Luis Caballero Diaz's profile

Machine Learning Unsupervised Algorithms

This project focuses on assessing the most common machine learning unsupervised algorithms working with Python using an open dataset in kaggle. The code and dataset can be found in the below links.


This data consists on 18 features for almost 10k credit bank customers, and the purpose is to separate the customers in similar groups. Therefore, this is an unsupervised problem since there is no output class or target.

The first thing is to get familiar with the input dataset. For that purpose, a boxplot and histogram per each feature will be used.

Below the boxplot sorted by magnitude is depicted. 
As observed, there is a very large amount of outliers which might affect to our customer classification and so, some data scrubbing work is definitely needed. A part from the customer ID (not included in the boxplot) there is no categorical features, and so one hot encoding method is not needed to be applied. The customer ID feature is removed because the ID does not explain how the customer will behave and what kind of transactions will do. There are several empty values, which are filtered. 

After applying some data scrubbing methods as described, the new boxplot and histogram is depicted as follows.
Reviewing the histogram, there are very few features with some distribution close to normal. Most of them have a common value with lots of frequency and then isolated values. It might be a sign of very dense distribution with some noise points around, but it will be assessed later.

After scrubbing the data, algorithms can be applied. Therefore, a short introduction of the main three machine learning unsupervised algorithms is explained below.

KMEANS CLUSTERING
This algorithm is probably the simplest and the most common used. It focuses on cluster centers and distributes the data points among them. Therefore, this algorithm needs to know the number of clusters as input. The procedure is to initialize random cluster centers according to the input number of clusters, and hence assign each data point to the closest center. When all data points are assigned, the new real cluster centers are calculated and all data point and again reassigned to the closest cluster center. The algorithm finishes when there is no data point assignment changes. 

AGGLOMERATIVE CLUSTERING
This algorithm belongs to hierarchy class. The start point of the algorithm is to declare all data points as a cluster and so, the number of clusters equals to the number of data points. From that situation, each step the algorithm merges the two most similar clusters together until reaching a stopping criterion. In our implementation, the stopping criterion is the desired number of clusters, so as KMeans, knowing in advanced the desired number of clusters is needed. The particularity of the algorithm is what to consider "most similar clusters" and there are three mainly approaches:
- WARD: the default choice and the one used in this exercise. It picks the two clusters to merge causing the variance within all clusters increases the least
- AVERAGE: merges the two clusters that have the smallest average distance between all their points
- COMPLETE (or maximum): merges the two clusters that have the smallest maximum distance between their points.

DBSCAN (density based spatial clustering of applications with noise)
DBSCAN works by identifying points that are in “crowded” regions, where many data points are close together. These regions are referred to as dense regions. The idea behind DBSCAN is that clusters form dense regions of data, separated by regions that are relatively empty. This helps to identify complex shapes. There are two parameters in DBSCAN: min_samples and eps. The algorithm works by picking an arbitrary point to start with, and the approach is that if there are at least min_samples data points within a distance of eps to a given data point, that data point is classified as a core sample. All of samples within the eps distance to the core point are put into the same cluster by DBSCAN, and then, the algorithm individually assess each new data point in the cluster. If there are more than min_samples in eps distance, the data points are called core points too, and if not, they are called cluster boundary point. The cluster grows until all neighbors of a core point are boundary points. On the other hand, if there is a point not belonging to a cluster with less than min_samples points within distance eps, this point is labeled as noise and it will not belong to any cluster. Therefore, DBSCAN algorithm automatically detects the number of clusters and might leave unclassified certain points. ​​​​​​​​​​​​​​
After the algorithms introduction, let's focus first in KMeans and Agglomerative, which requires the number of cluster as inputs. Therefore, main question here is to identify the optimal value of clusters. There are mainly three methods:

ELBOW CRITERION
The idea behind elbow method is to run algorithm clustering on a given dataset for a range of number of clusters (for example from 1 to 10) and for each value, calculate sum of squared errors (SSE). For the case of KMeans algorithm, that calculation is performed in scikit learn with the attribute inertia_, which is exactly the sum of squared distances of each sample to their closest cluster center. In case of a different algorithm, there is no immediate attribute for SSE measure since KMeans is the only algorithm working with cluster centers, and so in case of using this method with other algorithm than KMeans, the cluster centers calculation should be performed manually. Once the calculation is done, the plot SSE vs number of clusters can be generated, and the ideal value would corresponds to the elbow point (marked in green below), meaning that increasing the number of cluster only provides a minor decrease in SSE. Note that SSE equals zero if the number of clusters corresponds to the number of data points, so the purpose is not to minimize SSE, but to find a trade off between SSE and number of clusters.
DENDROGRAM PLOT
This method is used with agglomerative clustering and it is a plot in which the clusters selection is described. It uses all data points in the x-axis and depicts each cluster merge. The y-axis corresponds to the cluster distance, so it can evaluate how far the new data point was located when a cluster is merged. That sounds abstract to understand but below the dendrogram of the present exercise with further details will be assessed.

SILHOUETTE SCORE
The silhouette score computes the compactness of a cluster, where higher is better, with a perfect score of 1. While compact clusters are good, compactness does not allow for complex shapes, so this method might lead to lower values than reality when used under complex shapes. A higher Silhouette Coefficient score relates to a model with better-defined clusters. The Silhouette Coefficient is calculated using the distance between a sample and all point in its cluster and all other points in the next cluster, so this method is only applicable when there are at least two clusters. 

The below graph depicts the KMeans and Agglomerative algorithm for 1 to 10 cluster cases and then, the elbow and silhouette methods are applied to evaluate the optimum number of cluster using the dataset of this exercise.
Overall, KMeans looks better since SSE line is lower and silhouette line is higher than Agglomerative ones. However, it is not immediate to identify the optimum number of clusters. The SSE line does not depict a clear elbow and the silhouette line has small variance, but 3 clusters would look reasonable trade off choice. To have more information, below the dendrogram plot is depicted.
As explained earlier, the dendrogram places the data points in x-axis and cluster distance in y-axis. A good way to understand the graph is from bottom to top and see how the number of clusters is reduced from the number of data points to one. Thus, the vertical length when two clusters are merged means the real distance between the clusters. Therefore, very long vertical lines must be avoided within our clusters, cutting the number of clusters accordingly. For example, in this exercise, the three vertical blue lines are the longest and reinforces the idea to select 3 clusters.

Below picture depicts the data clustering with KMeans algorithm and 3 number of clusters. With the purpose of data visualization, the PCA transformation was used to plot first PCA component vs the second one assuming they both explain a very large amount of input data variance. However, the plot should be intended only as visualization reference since some information was missed in the PCA transformation of just two components and data points location are only approximate. That said, KMeans distribution looks good but note the input dataset shows a major dense area with lots of isolated points, which would make think DBSCAN will end up categorizing a single cluster with lots of noise points.
As reference, below picture depicts the data clustering for Agglomerative algorithm, which as previously suspected by the SSE and silhouette plots, offers a worse performance than KMeans.
Let's assess now the DBSCAN algorithm. As mentioned, DBSCAN would not look a very good algorithm for the present dataset since it has a major dense area with lots of noise points. The first thing is to tune the DBSCAN model with min_samples and eps. As general rule, min_samples should be higher than number of features, being up to x2 for large datasets. However, eps depends strictly on the dataset and there is no easy rule to apply. The best way is to use KNN model to plot the distance between each data point and its closest neighbor as below depicted, and hence determine a proper range of eps.
Based on graph data, a proper range from eps would be between 0.5 and 2. Higher values would lead to a single cluster and lower values to have all points as noise or a high number of very small clusters.

Using the eps and min_sample range, a sweep calculating the silhouette score and the number of clusters per each case is performed and depicted below. Note when one cluster is concluded, silhouette score cannot be calculated since it needs a minimum of two clusters to be calculated.
In general, silhouette score are very low and so, DBSCAN would not look a proper algorithm for the current dataset. For very high eps, DBSCAN concludes a single cluster. Instead, for very low eps values, DBSCAN concludes a higher number of clusters. Let's take a deeper look and for example below screenshot depicts the case of min samples 25 with eps 0.6.
There are four clusters, but they are very small and most points has been marked as noise, meaning it is not a good representation of the input data. As a different example, let's focus on the particular case of min_samples 19 with eps 1.4.
Here there are two clusters, one of them very big and the other very small. The small cluster would probably disappear when increasing the min_sample parameters. Again, the input dataset representation is mediocre.

After applying all algorithms and concluding that KMeans offer the best performance for the input dataset, the particularities of each cluster should be understood. For that purpose, the mean value of each feature per cluster is plotted next to have a fast visual information of each cluster.
The graph is very useful and it is clearly possible to identify the next three types of customers:

- Cluster 0: They are customers with cash and a very strong balance, but they do not tend to spend lots of cash and so, they are accumulating balance. That said, when they purchase something, they tend to pay with cash in advance instead of credit. Therefore, it would look a good idea to contact these customers with investment offers to use the excess of unused cash. 

- Cluster 2: They are customers with a strong impulse to make purchases, partly because their cash balance is good. However, in spite of paying with cash, they prefer to use credit for their purchases. Therefore, it would look a good idea to contact these customers with potential purchase options, specially for medium-high price given their good balance.

- Cluster 1: They are customers with limited balance account and so, they can not afford lots of purchases. Taking into account that they purchase more than the cluster 0 (the "rich" one), makes me consider this cluster as a potential cluster 2 but without that strong financial health. Therefore, it would look a good idea to contact these customers with potential small purchase options, probably the best option would be something implying a small recurrent payment. 
I appreciate your attention and I hope you find this work interesting.

Luis Caballero
Machine Learning Unsupervised Algorithms
Published:

Machine Learning Unsupervised Algorithms

Published:

Creative Fields