Summary

Most application of Machine Learning are based on supervised learning (labels are available); however, there may be vast majority of the available data without labels. It would be great if a model can be automatically trained without needing human supervision. Clustering or cluster analysis is an unsupervised machine learning technique, which groups the unlabelled dataset. It can be defined as "A way of grouping the data points into different clusters, consisting of similar data points. The objects with the possible similarities remain in a group that has less or no similarities with another group. In this notebook, different clustering techniques including K-means, DBSCAN, Spectral Clustering, Agglomerative Clustering, Gaussian Mixture models will be applied for synthetic and realistic data. The pros and cons of each technique will be discussed.

Python functions and data files needed to run this notebook are available via this link.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame
import matplotlib as mpl
import time
import os 
import random
import warnings
warnings.filterwarnings('ignore')
from functions import* # import require functions to run this notebook
#scikit-learn==0.24.2
from sklearn.decomposition import PCA
from sklearn.datasets import make_blobs
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

Introduction Clustering

A simple example for clustering. You find some unseen plants when you are in the mountains. The flowers may not be completely similar to each other, but you can simply understand they most likely belong to the same species. You need an expert to tell you the type or name of the plant; however, you absolutely need anyone to identify similar objects. For example, similar petal length, color, scent, petal width.. and so forth. This approach is called clustering. It first identifies similar instances and then assigns them to clusters. Clustering is can be used in several applications as below:

Customer Segmentation

You can cluster customers based on their activity on your website, their purchases, their returns and so on. This is really helpful to understand what your customers need and who they are, so you can make your products compatible to each segment.

Dimensionality Reduction

If we can fit the data into k clusters, this probably means that removing dimension to k can preserve enough information for further processing. So, clustering can be used as dimensionality reduction technique.

Anomaly Detection (Outlier detection)

Preventing fraud or detecting unusual credit card transactions and catching manufacturing defects. For example, if the users of your websites are clustered based on their behavior, the users with unusual behavior can be simply detected, such as weird number of requests per second, per day..

Data Analysis

It is usually useful to first discover clusters of similar instances when analyzing a new dataset: analyzing clusters separately is often easier.

Semi-supervised Learning

As we discussed before, semi-supervised technique can be applied if we have a few labels. Clustering can be performed first and the labels to all the instances in the same cluster can be propagated. By doing this, we can increase the number of labels available for a subsequent supervised learning algorithm. So, it can lead to improve the performance.

Synthetic Data

First Clustering technique is applied for synthetic data. A synthetic data with 4 cluster is created.

In [2]:
# Assign center of each blob
centers = np.array(
    [[ 2,  2],
     [ 2,  6],
     [5 ,  3],
     [6,  7],])

# Assign some variance for each blob
blob_std = np.array([0.9, 0.8, 0.85, 0.8])

# generate blobs with 3000 samples
X, y = make_blobs(n_samples=3000, centers=centers,cluster_std=blob_std, random_state=60)
In [3]:
font = {'size'   : 15}
plt.rc('font', **font)
fig = plt.subplots(figsize=(7, 5), dpi= 80, facecolor='w', edgecolor='k')

plt.scatter(X[:, 0], X[:, 1], c=y, s=30,cmap='jet',edgecolors='k', linewidth=0.8)
plt.xlabel('X1', fontsize=22)
plt.ylabel('X2', fontsize=22, rotation=0)
plt.title(" Synthetic 4 Clusters",fontsize=17)

for i in range(4):
    plt.text(centers[i][0], (centers[i][1]), str(i+1),fontsize=15,color='k', 
             fontweight='bold',style='oblique', ha='center',
             va='top', wrap=True,bbox=dict(facecolor='w', alpha=1,pad=5)) 
plt.text(0.2, 9, 'n='+str(len(X[:, 0])), ha = 'center',fontsize=18,bbox=dict(facecolor='k', alpha=0.2))
    
plt.show()
In [4]:
data=pd.DataFrame(X)
scaler = MinMaxScaler()
scaler.fit(data)
scaler.data_max_
data_Min_Max=scaler.transform(data)
data_Min_Max.shape
Out[4]:
(3000, 2)

K-Means

The K-Means is a simple algorithm that can very quickly and efficiently perform clustering. The reliable clusters are often achieved in a few iteration. It starts by randomly placing the centroids for number of clusters (here is 4) and label the instances accordingly. Then update the centroids, label the instances. This process is repeated for multiple iterations until the centroids stop moving. See the code and Figures below:

In [5]:
kmeans_iter1 = KMeans(n_clusters=4, init="random", n_init=1, max_iter=1, random_state=25)
kmeans_iter2 = KMeans(n_clusters=4, init="random", n_init=1, max_iter=2, random_state=25)
kmeans_iter3 = KMeans(n_clusters=4, init="random", n_init=1, max_iter=3, random_state=25)
kmeans_iter4 = KMeans(n_clusters=4, init="random", n_init=1, max_iter=4, random_state=25)
kmeans_iter5 = KMeans(n_clusters=4, init="random", n_init=1, max_iter=5, random_state=25)
kmeans_iter6 = KMeans(n_clusters=4, init="random", n_init=1, max_iter=6, random_state=25)
kmeans_iter1.fit(X)
kmeans_iter2.fit(X)
kmeans_iter3.fit(X)
kmeans_iter4.fit(X)
kmeans_iter5.fit(X)
kmeans_iter6.fit(X)
clust_1=kmeans_iter1.cluster_centers_
clust_2=kmeans_iter2.cluster_centers_
clust_3=kmeans_iter3.cluster_centers_
clust_4=kmeans_iter4.cluster_centers_
clust_5=kmeans_iter5.cluster_centers_
clust_6=kmeans_iter6.cluster_centers_
In [6]:
font = {'size'   : 15}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(12, 12), dpi= 80, facecolor='w', edgecolor='k')

plt.subplot(321)
plot_decision_boundaries(kmeans_iter1, X)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_iter1.predict(X), s=30,cmap='jet', edgecolors='k',linewidth=0.8)
for i in range(len(clust_1)):
    plt.plot(clust_1[i][0],clust_1[i][1],'gX',markersize=15)
    plt.text(clust_1[i][0], (clust_1[i][1])-0.4, str(i+1),fontsize=15,color='k', 
             fontweight='bold',style='oblique', ha='center',
             va='top', wrap=True,bbox=dict(facecolor='w', alpha=1,pad=1)) 

plt.ylabel('X2', fontsize=18, rotation=0)
plt.title("Initialize Random Centre for Clusters",fontsize=15)
#
plt.subplot(322)
plot_decision_boundaries(kmeans_iter2, X)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_iter2.predict(X), s=30,cmap='jet', edgecolors='k',linewidth=0.8)
for i in range(len(clust_2)):
    plt.plot(clust_2[i][0],clust_2[i][1],'gX',markersize=15)
    plt.text(clust_2[i][0], (clust_2[i][1])-0.4, str(i+1),fontsize=15,color='k', 
             fontweight='bold',style='oblique', ha='center',
             va='top', wrap=True,bbox=dict(facecolor='w', alpha=1,pad=1)) 
plt.title("Iteration 1",fontsize=17)
#
plt.subplot(323)
plot_decision_boundaries(kmeans_iter3, X)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_iter3.predict(X), s=30,cmap='jet', edgecolors='k',linewidth=0.8)
for i in range(len(clust_2)):
    plt.plot(clust_3[i][0],clust_3[i][1],'gX',markersize=15)
    plt.text(clust_3[i][0], (clust_3[i][1])-0.4, str(i+1),fontsize=15,color='k', 
             fontweight='bold',style='oblique', ha='center',
             va='top', wrap=True,bbox=dict(facecolor='w', alpha=1,pad=1))       
plt.ylabel('X2', fontsize=18, rotation=0)
plt.title("Iteration 2",fontsize=17)
#
plt.subplot(324)
plot_decision_boundaries(kmeans_iter4, X)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_iter4.predict(X), s=30,cmap='jet', edgecolors='k',linewidth=0.8)
for i in range(len(clust_2)):
    plt.plot(clust_4[i][0],clust_4[i][1],'gX',markersize=15)
    plt.text(clust_4[i][0], (clust_4[i][1])-0.4, str(i+1),fontsize=15,color='k', 
             fontweight='bold',style='oblique', ha='center',
             va='top', wrap=True,bbox=dict(facecolor='w', alpha=1,pad=1))   
plt.title("Iteration 3",fontsize=17)
#
plt.subplot(325)
plot_decision_boundaries(kmeans_iter5, X)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_iter5.predict(X), s=30,cmap='jet', edgecolors='k',linewidth=0.8)
for i in range(len(clust_2)):
    plt.plot(clust_5[i][0],clust_5[i][1],'gX',markersize=15)
    plt.text(clust_5[i][0], (clust_5[i][1])-0.4, str(i+1),fontsize=15,color='k', 
             fontweight='bold',style='oblique', ha='center',
             va='top', wrap=True,bbox=dict(facecolor='w', alpha=1,pad=1))       
plt.xlabel('X1', fontsize=18)
plt.ylabel('X2', fontsize=18, rotation=0)
plt.title("Iteration 4",fontsize=17)

#
plt.subplot(326)
plot_decision_boundaries(kmeans_iter6, X)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_iter6.predict(X), s=30,cmap='jet', edgecolors='k',linewidth=0.8)
for i in range(len(clust_2)):
    plt.plot(clust_6[i][0],clust_6[i][1],'gX',markersize=10)
    plt.text(clust_6[i][0], (clust_6[i][1])-0.4, str(i+1),fontsize=15,color='k', 
             fontweight='bold',style='oblique', ha='center',
             va='top', wrap=True,bbox=dict(facecolor='w', alpha=1,pad=1))       
plt.xlabel('X1', fontsize=18)
plt.title("Iteration 5",fontsize=17)
fig.suptitle('How K-Mean Clusteing works', fontsize=24,y=0.96)

plt.subplots_adjust(wspace=0.1)
plt.subplots_adjust(hspace=0.3)
plt.show()

Scikit-Learn keeps the best solution for different random centroids. But how exactly does it know which solution is the best? Well of course it uses a performance metric! It is called the model’s inertia: this is the mean squared distance between each instance and its closest centroid.

The KMeans class runs the algorithm n_init times and keeps the model with the lowest inertia. An important improvement to the K-Means algorithm, called K-Means++, was proposed in a 2006 paper by David Arthur and Sergei Vassilvitskii: they introduced a smarter initialization step that tends to select centroids that are distant from one another, and this makes the K-Means algorithm much less likely to converge to a suboptimal solution. They showed that the additional computation required for the smarter initialization step is well worth it since it makes it possible to drastically reduce the number of times the algorithm needs to be run to find the optimal solution.

Instead of initializing the centroids entirely randomly, it is preferable to initialize them using the following algorithm, proposed in a 2006 paper by David Arthur and Sergei Vassilvitskii:

  • Take one centroid $c_1$, chosen uniformly at random from the dataset.
  • Take a new center $c_i$, choosing an instance $\mathbf{x}_i$ with probability: $D(\mathbf{x}_i)^2$ / $\sum\limits_{j=1}^{m}{D(\mathbf{x}_j)}^2$ where $D(\mathbf{x}_i)$ is the distance between the instance $\mathbf{x}_i$ and the closest centroid that was already chosen. This probability distribution ensures that instances that are further away from already chosen centroids are much more likely be selected as centroids.
  • Repeat the previous step until all $k$ centroids have been chosen.

The KMeans class actually uses this initialization method by default. If you want to force it to use the original method (i.e., picking k instances randomly to define the initial centroids), then you can set the init hyperparameter to "random". You will rarely need to do this.

(Retrieved from Aurélien Géron)

Execution time and intertia are calculated for number of clusters from 2 to 10 for K-means.

In [7]:
%%time
n_clus_reg,time_c_reg,inertia_c_reg=kmean_eda(data=data_Min_Max,k_start=2,k_end=10,kint=1)
Wall time: 1.8 s
In [8]:
font = {'size'   :15}
plt.rc('font', **font)
fig=plt.figure(figsize=(12, 10), dpi= 80, facecolor='w', edgecolor='k')

ax1=plt.subplot(2,1,1) 
plt.plot(n_clus_reg,time_c_reg,color='b', marker='s', label='K-Means', markersize=7.5,linewidth=2,markeredgecolor='k')
plt.grid(True,which="both",ls="-",linewidth=0.5)
plt.title('Number of Clusters (k) vs. Training Time (Seconds)')
plt.ylabel('Time (Seconds)')
plt.xlabel('Number of Clusters (k)',labelpad=15)
plt.legend(ncol=1,markerscale=1.0)
plt.xticks(np.arange(2,11))
plt.xticks(np.arange(2,11),rotation=0,y=0.0, fontsize=14)
plt.text(9, 0.1, 'n='+str(len(X[:, 0])), ha = 'center',fontsize=20,bbox=dict(facecolor='k', alpha=0.2))

ax2=plt.subplot(2,1,2)
plt.plot(n_clus_reg,inertia_c_reg,color='b', marker='s', label='K-Means', markersize=7.5,linewidth=2,markeredgecolor='k')
plt.grid(True,which="both",ls="-",linewidth=0.5)
plt.title('Number of Clusters (k) vs. Inertia (Mean square Distance)')
plt.xlabel('Number of Clusters (k)',labelpad=10)
plt.ylabel('Inertia')
plt.legend(ncol=1,markerscale=1.0)
plt.xticks(np.arange(2,11),rotation=0,y=0.0, fontsize=14)
plt.text(9, 50, 'n='+str(len(X[:, 0])), ha = 'center',fontsize=20,bbox=dict(facecolor='k', alpha=0.2))
plt.subplots_adjust(wspace=0.12)
plt.subplots_adjust(hspace=0.4)

Accelerated K-Means and Mini-batch K-Means

Another important improvement to the K-Means algorithm was proposed in a 2003 paper by Charles Elkan. It considerably accelerates the algorithm by avoiding many unnecessary distance calculations.

Accelerated K-Means

The K-Means algorithm can be significantly accelerated by avoiding many unnecessary distance calculations: this is achieved by exploiting the triangle inequality (given three points A, B and C, the distance AC is always such that AC ≤ AB + BC) and by keeping track of lower and upper bounds for distances between instances and centroids (see this 2003 paper by Charles Elkan for more details). To use Elkan's variant of K-Means, just set algorithm="elkan".

Mini-Batch K-Means

Yet another important variant of the K-Means algorithm was proposed in a 2010 paper by David Sculley. Instead of using the full dataset at each iteration, the algorithm is capable of using mini-batches, moving the centroids just slightly at each iteration. This speeds up the algorithm typically by a factor of 3 or 4 and makes it possible to cluster huge datasets that do not fit in memory. Scikit-Learn implements this algorithm in the MiniBatchKMeans class.

(Retrieved from Aurélien Géron)

In [9]:
%%time
n_clus_mbatch,time_c_mbatch,inertia_mbatch=kmean_eda(data=data_Min_Max,k_start=2,
                                                     k_end=10,kint=1,minibatch=True,batch_size=3072)
Wall time: 1.16 s
In [10]:
font = {'size'   :15}
plt.rc('font', **font)
fig=plt.figure(figsize=(12, 10), dpi= 80, facecolor='w', edgecolor='k')
ax1=plt.subplot(2,1,1) 
plt.plot(n_clus_reg,time_c_reg,color='b', marker='s', label=f'Regular K-Means (sum={int(np.sum(time_c_reg))}s)', markersize=7.5,linewidth=2,markeredgecolor='k')
plt.plot(n_clus_mbatch,time_c_mbatch,color='r', marker='o', label=f'MiniBatch K-Means (sum={int(np.sum(time_c_mbatch))}s)', markersize=7.5,linewidth=2,markeredgecolor='k')

plt.grid(True,which="both",ls="-",linewidth=0.5)
plt.title('Number of Clusters (k) vs. Training Time (Seconds)')
plt.ylabel('Time (Seconds)')
plt.xlabel('Number of Clusters (k)',labelpad=15)
plt.legend(ncol=1,markerscale=1.0)
plt.xticks(np.arange(2,11))
plt.xticks(np.arange(2,11),rotation=0,y=0.0, fontsize=14)
plt.xlim(2,7)
plt.text(6.5, 0.06, 'n='+str(len(X[:, 0])), ha = 'center',fontsize=20,bbox=dict(facecolor='k', alpha=0.2))

ax2=plt.subplot(2,1,2)
plt.plot(n_clus_reg,inertia_c_reg,color='b', marker='s', label='Regular K-Means', markersize=7.5,linewidth=2,markeredgecolor='k')
plt.plot(n_clus_mbatch,inertia_mbatch,color='r', marker='o', label='MiniBatch K-Means', markersize=7.5,linewidth=2,markeredgecolor='k')
plt.grid(True,which="both",ls="-",linewidth=0.5)
plt.title('Number of Clusters (k) vs. Inertia (Mean square Distance)')
plt.xlabel('Number of Clusters (k)',labelpad=10)
plt.ylabel('Inertia')
plt.axvline(x=4,color='k',linewidth=4,linestyle='--')
plt.legend(ncol=1,markerscale=1.0)
plt.xticks(np.arange(2,11),rotation=0,y=0.0, fontsize=14)
plt.text(6.5, 50, 'n='+str(len(X[:, 0])), ha = 'center',fontsize=20,bbox=dict(facecolor='k', alpha=0.2))
plt.xlim(2,7)
plt.subplots_adjust(wspace=0.12)
plt.subplots_adjust(hspace=0.4)

If the dataset does not fit in memory, the simplest option is to use the memmap class

In [11]:
#filename = "memmap.data"
#X_mm = np.memmap(filename, dtype='float32', mode='write', shape=data.shape)
#X_mm[:] = data

#n_clus_mbatch,time_c_mbatch,inertia_mbatch=kmean_eda(data=X_mm,k_start=2,k_end=20,kint=1,minibatch=True)

Limitation of K-Means

Despite its many merits, most notably being fast and scalable, K-Means is not perfect because:

  • It is necessary to run the algorithm several times to avoid sub-optimal solutions

  • It needs to specify the number of clusters, which can be quite a hassle

  • K-Means does not behave very well when the clusters have varying sizes, different densities, or non-spherical shapes.

Optimum Number of Clusters

There are two approaches of finding optimum number of clusters:

  • Elbow rule

When the inertia drops very quickly but it reaches a number of cluster that decreases much more slowly. The curve has roughly the shape of an arm called “elbow”.

  • Silhouette (si·luh·wet) coefficient

A more precise approach (but also more computationally expensive) is to use the silhouette score, which is the mean silhouette coefficient over all the instances. An instance’s silhouette coefficient is equal to (b – a) / max(a, b) where a is the mean distance to the other instances in the same cluster (it is the mean intra-cluster distance), and b is the mean nearest-cluster distance, that is the mean distance to the instances of the next closest cluster (defined as the one that minimizes b, excluding the instance’s own cluster). The silhouette coefficient can vary between -1 and +1: a coefficient close to +1 means that the instance is well inside its own cluster and far from other clusters, while a coefficient close to 0 means that it is close to a cluster boundary, and finally a coefficient close to -1 means that the instance may have been assigned to the wrong cluster. To compute the silhouette score, you can use Scikit-Learn’s silhou ette_score() function,

In [12]:
#Min and max scaling
scaler = MinMaxScaler()
scaler.fit(data)
scaler.data_max_
data_Min_Max=scaler.transform(data)
In [13]:
%%time
n_clus_reg,time_c_reg,inertia_c_reg,silhouette_kmean_reg=kmean_eda(data=data_Min_Max,k_start=2,k_end=7,kint=1,
                                minibatch=False,batch_size=3072,silhouette=True)
#
n_clus_mbat_kmean,time_mbat_kmean,inertia_mbat_kmean,silhouette_mbat_kmean=kmean_eda(data=data_Min_Max,k_start=2,k_end=7,
                                                        kint=1,minibatch=True,batch_size=1000,silhouette=True)
Wall time: 3.66 s

As you can see in Figure below, the inertia drops very quickly as we increase k up to 4, but then it decreases much more slowly as we keep increasing k. This curve has roughly the shape of an arm, and there is an “elbow” at k=4 so if we did not know better, it would be a good choice: any lower value would be dramatic, while any higher value would not help much, and we might just be splitting perfectly good clusters in half for no good reason. k=6 can be also be considered as elbow so this technique for choosing the best value for the number of clusters is rather coarse.

In [14]:
font = {'size'   :15}
plt.rc('font', **font)
fig=plt.figure(figsize=(12, 10), dpi= 100, facecolor='w', edgecolor='k')

ax2=plt.subplot(2,1,1)
plt.plot(n_clus_reg,inertia_c_reg,color='b', marker='s', label='Regular K-Means', markersize=8.5,linewidth=2,markeredgecolor='k')
plt.plot(n_clus_mbat_kmean,inertia_mbat_kmean,color='r', marker='o', label='Mini-Batch K-Means', markersize=8.5,linewidth=2,markeredgecolor='k')
plt.grid(True,which="both",ls="-",linewidth=0.5)
plt.title('Number of Clusters (k) vs. Inertia (Mean square Distance) \n Standardization Scale')
plt.xlabel('Number of Clusters (k)',labelpad=14)
plt.ylabel('Inertia')
plt.axvline(x=4,color='k',linewidth=4,linestyle='--')
plt.legend(loc=1,ncol=1,markerscale=1.0)
plt.xticks(np.arange(1,10),rotation=0,y=0.0, fontsize=15)

plt.text(4,100, 'Optimum k=4',fontsize=15,color='k', 
         fontweight='bold',style='oblique', ha='center',
             va='top', wrap=True,bbox=dict(facecolor='w', alpha=1,pad=5)) 
plt.text(6.5, 50, 'n='+str(len(X[:, 0])), ha = 'center',fontsize=20,bbox=dict(facecolor='k', alpha=0.2))    
plt.xlim(2,7)
plt.subplots_adjust(wspace=0.12)
plt.subplots_adjust(hspace=0.4)
In [15]:
font = {'size'   :15}
plt.rc('font', **font)
fig=plt.figure(figsize=(12, 10), dpi= 100, facecolor='w', edgecolor='k')

ax1=plt.subplot(2,1,1)
plt.plot(n_clus_reg,silhouette_kmean_reg,color='b', marker='s', label='Regular K-Means', 
         markersize=8.5,linewidth=2,markeredgecolor='k')
plt.plot(n_clus_mbat_kmean,silhouette_mbat_kmean,color='r', marker='o', label='Mini-Batch K-Means', 
         markersize=8.5,linewidth=2,markeredgecolor='k')

plt.grid(True,which="both",ls="-",linewidth=0.5)
plt.title('Number of Clusters (k) vs. Silhouette Score (-1 to 1) (Scaled Data) \n Standardization Scale')
plt.xlabel('Number of Clusters (k)',labelpad=12)
plt.ylabel('Silhouette score')
plt.axvline(x=4,color='k',linewidth=4,linestyle='--')
plt.legend(loc=2,ncol=1,markerscale=1.0,fontsize=13)
plt.xticks(np.arange(1,8),rotation=0,y=0.0, fontsize=15)
plt.xlim(2,7)
plt.text(4,0.48, 'Optimum k=4',fontsize=15,color='k', 
         fontweight='bold',style='oblique', ha='center',
             va='top', wrap=True,bbox=dict(facecolor='w', alpha=1,pad=5)) 
plt.text(6.5, 0.55, 'n='+str(len(X[:, 0])), ha = 'center',fontsize=20,bbox=dict(facecolor='k', alpha=0.2))    
plt.show()
In [16]:
kmeans_per_k = [MiniBatchKMeans(n_clusters=k,batch_size=1000,random_state=42).fit(data_Min_Max) for k in range(3, 7)]
silhouette_scores = [silhouette_score(data_Min_Max, model.labels_) for model in kmeans_per_k]

An even more informative visualization is obtained when you plot every instance’s silhouette coefficient, sorted by the cluster they are assigned to and by the value of the coefficient. This is called a silhouette diagram (see Figure 9-10):

The vertical dashed lines represent the silhouette score for each number of clusters. When most of the instances in a cluster have a lower coefficient than this score (i.e., if many of the instances stop short of the dashed line, ending to the left of it), then the cluster is rather bad since this means its instances are much too close to other clusters. We can see that when k=3 and when k=6, we get bad clusters. But when k=4 or k=5, the clusters look pretty good – most instances extend beyond the dashed line, to the right and closer to 1.0. When k=4, the cluster at index 1 (the third from the top), is rather big, while when k=5, all clusters have similar sizes, so even though the overall silhouette score from k=4 is slightly greater than for k=5, it seems like a good idea to use k=5 to get clusters of similar sizes.

In [17]:
font = {'size'   :13 }
plt.rc('font', **font)
fig=plt.figure(figsize=(11, 9), dpi= 80, facecolor='w', edgecolor='k')

for k in (3, 4, 5, 6):
    plt.subplot(2, 2, k - 2)
    
    y_pred = kmeans_per_k[k - 3].labels_
    silhouette_coefficients = silhouette_samples(data_Min_Max, y_pred)

    padding = len(data_Min_Max) // 30
    pos = padding
    ticks = []
    for i in range(k):
        coeffs = silhouette_coefficients[y_pred == i]
        coeffs.sort()

        color = mpl.cm.Spectral(i / k)
        plt.fill_betweenx(np.arange(pos, pos + len(coeffs)), 0, coeffs,
                          facecolor=color, edgecolor=color, alpha=0.7)
        ticks.append(pos + len(coeffs) // 2)
        pos += len(coeffs) + padding

    plt.gca().yaxis.set_major_locator(FixedLocator(ticks))
    plt.gca().yaxis.set_major_formatter(FixedFormatter(range(k)))
    if k in (3, 5):
        plt.ylabel("Cluster")
    
    if k in (5, 6):
        plt.gca().set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
        plt.xlabel("Silhouette Coefficient", fontsize=15)
    else:
        plt.gca().set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
        #plt.tick_params(labelbottom=False)

    plt.axvline(x=silhouette_scores[k - 3], color="k", linestyle="--",lw=4)
    plt.title(f"k={k} (Silhouette={np.round(silhouette_scores[k - 3],3)})", fontsize=19)
    plt.grid(True,which="both",ls="-",linewidth=0.2)
    plt.xlim((-0.1, 0.8))

plt.subplots_adjust(wspace=0.2)
plt.subplots_adjust(hspace=0.3)
plt.show()

DBSCAN

DBSCAN stands for Density-Based Spatial Clustering of Application with Noise. This algorithm defines clusters as continuous regions of high density. It is actually quite simple:

  • For each instance, the algorithm counts how many instances are located within a small distance ε (epsilon) from it. This region is called the instance’s ε-neighborhood.
  • If an instance has at least min_samples instances in its ε-neighborhood (including itself), then it is considered a core instance. In other words, core instances are those that are located in dense regions.
  • All instances in the neighborhood of a core instance belong to the same cluster. This may include other core instances, therefore a long sequence of neighboring core instances forms a single cluster.
  • Any instance that is not a core instance and does not have one in its neighborhood is considered an anomaly.

Notice that some instances have a cluster index equal to -1: this means that they are considered as anomalies by the algorithm. The indices of the core instances are available in the core_sampleindices instance variable, and the core instances themselves are available in the components_ instance variable:

In [18]:
dbscan = DBSCAN(eps=0.04, min_samples=3)
dbscan.fit(data_Min_Max)

pd.Series(dbscan.labels_).value_counts()
Out[18]:
 0    2956
-1      25
 1       7
 3       5
 2       4
 4       3
dtype: int64

Advantage

  • This algorithm works well if all the clusters are dense enough, and they are well separated by low-density regions. It is great for separating clusters of high density versus low density within a given dataset.
  • Great for handling outliers within dataset.

Disadvantage

  • It does not work when dealing with cluster of varying densities. While DBSCAN is great for separating high density cluster from low density clusters, it struggles with clusters of similar density.

  • Struggle with high dimensionality data.

Spectral clustering

this algorithm takes a similarity matrix between the instances and creates a low-dimensional embedding from it (i.e., it reduces its dimensionality), then it uses another clustering algorithm in this low-dimensional space (Scikit-Learn’s implementation uses K-Means). Spectral clustering can capture complex cluster structures, and it can also be used to cut graphs (e.g., to identify clusters of friends on a social network), however it does not scale well to large number of instances, and it does not behave well when the clusters have very different sizes.

In [19]:
sc1 = SpectralClustering(n_clusters=4, assign_labels='discretize', random_state=42)
sc1.fit(data_Min_Max)
Out[19]:
SpectralClustering(assign_labels='discretize', n_clusters=4, random_state=42)

Agglomerative Clustering

A hierarchy of clusters is built from the bottom up. Think of many tiny bubbles floating on water and gradually attaching to each other until there’s just one big group of bubbles. Similarly, at each iteration agglomerative clustering connects the nearest pair of clusters (starting with individual instances). If you draw a tree with a branch for every pair of clusters that merged, you get a binary tree of clusters, where the leaves are the individual instances. This approach scales very well to large numbers of instances or clusters, it can capture clusters of various shapes, it produces a flexible and informative cluster tree instead of forcing you to choose a particular cluster scale, and it can be used with any pairwise distance. It can scale nicely to large numbers of instances if you provide a connectivity matrix. This is a sparse m by m matrix that indicates which pairs of instances are neighbors (e.g., returned by sklearn.neighbors.kneighbors_graph()). Without a connectivity matrix, the algorithm does not scale well to large datasets.

In [20]:
agg = AgglomerativeClustering(n_clusters=4,memory='joblib.Memory').fit(data_Min_Max)

Gaussian Mixture Model

A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown. All the instances generated from a single Gaussian distribution form a cluster that typically looks like an ellipsoid. Each cluster can have a different ellipsoidal shape, size, density and orientation. When you observe an instance, you know it was generated from one of the Gaussian distributions, but you are not told which one, and you do not know what the parameters of these distributions are.

(using Gaussian Mixture Model) – The key assumption in using this technique is that the data must follow a Gaussian distribution. The operating model is quite similar to the k-mean clustering technique wherein the number of clusters is randomized, and then the probability that each data point would be a part of the specific cluster is computed.

In [21]:
%%time
gm = GaussianMixture(n_components=4, n_init=10, random_state=42)
gm.fit(data_Min_Max)
Wall time: 239 ms
Out[21]:
GaussianMixture(n_components=4, n_init=10, random_state=42)
In [22]:
gm.weights_
Out[22]:
array([0.25370461, 0.25071153, 0.24581008, 0.24977378])
In [23]:
gm.means_
Out[23]:
array([[0.27232177, 0.65637108],
       [0.61294467, 0.36643653],
       [0.2683217 , 0.26798464],
       [0.71654869, 0.75419008]])

K-Means: uses a deterministic approach and assigns each data point to one unique cluster (ex: Derrick Henry = TE). This is referred to as a hard clustering method.

GMM: uses a probabilistic approach and gives the probability of each data point belonging to any of the clusters. This is referred to as a soft clustering method.

Execution Time

In [24]:
%%time
# k-mean
n_clus_reg_kmean,time_reg_kmean,inertia_reg_kmean,silhouette_reg_kmean=kmean_eda(data=data_Min_Max,k_start=2,k_end=10,
                                                        kint=1,minibatch=False,batch_size=100,silhouette=True)
n_clus_mbat_kmean,time_mbat_kmean,inertia_mbat_kmean,silhouette_mbat_kmean=kmean_eda(data=data_Min_Max,k_start=2,k_end=10,
                                                        kint=1,minibatch=True,batch_size=100,silhouette=True)
# Gaussian mixture model
n_clus_gmm,time_gmm, silhouette_gmm=gmm_eda(data=data_Min_Max,k_start=2,k_end=10,kint=1)
Wall time: 12 s
In [25]:
font = {'size'   :15}
plt.rc('font', **font)
fig=plt.figure(figsize=(12, 10), dpi= 120, facecolor='w', edgecolor='k')


ax1=plt.subplot(2,1,1)
plt.plot(n_clus_reg_kmean,time_reg_kmean,color='b', marker='s', label=f'Regular K-Means (sum={int(np.sum(time_reg_kmean))}s)', 
         markersize=8.5,linewidth=2,markeredgecolor='k')
plt.plot(n_clus_mbat_kmean,time_mbat_kmean,color='r', marker='o', label=f'Mini-Batch K-Means (sum={int(np.sum(time_mbat_kmean))}s)', 
         markersize=8.5,linewidth=2,markeredgecolor='k')
plt.plot(n_clus_gmm,time_gmm,color='g', marker='>', label=f'Gaussian Mixture Model (sum={int(np.sum(time_gmm))}s)', 
         markersize=8.5,linewidth=2,markeredgecolor='k')

plt.grid(True,which="both",ls="-",linewidth=0.5)
plt.title('Number of Clusters (k) vs. Training Time (Seconds)')
plt.xlabel('Number of Clusters (k)',labelpad=14)
plt.ylabel('Time (Seconds)')


plt.legend(loc=2,ncol=1,markerscale=1.0,fontsize=12)
plt.xticks(np.arange(1,11),rotation=0,y=0.0, fontsize=15)
plt.xlim(1.5,10.5)
plt.subplots_adjust(wspace=0.12)
plt.subplots_adjust(hspace=0.4)

Results

In [26]:
# Min and Max scaling
scaler = MinMaxScaler()
scaler.fit(data)
data_Min_Max=scaler.transform(data)
In [27]:
# batch kmeans
start_time = time.time()
kmeans_batch = MiniBatchKMeans(n_clusters=4,batch_size=1000)
kmeans_batch.fit_predict(data_Min_Max)
time_took_ = time.time() - start_time
print (time_took_)
0.0709993839263916
In [28]:
%%time
# Calculate 
# regular kmeans
np.random.seed(42)
time_took=np.zeros(5)
wss_=np.zeros(5)
start_time = time.time()
kmeans = KMeans(n_clusters=4)
kmeans.fit_predict(data_Min_Max)
time_took[0] = time.time() - start_time
km_pred=kmeans.labels_
km_pred_score=silhouette_score(data_Min_Max, kmeans.labels_)
km_pred= (pd.Series(km_pred).value_counts()/len(km_pred)).to_dict()
wss_[0]=wss_value(data_Min_Max, kmeans.cluster_centers_)

# batch kmeans
start_time = time.time()
kmeans_batch = MiniBatchKMeans(n_clusters=4,batch_size=1000)
kmeans_batch.fit_predict(data_Min_Max)
time_took[1] = time.time() - start_time
km_batch_pred= kmeans_batch.labels_
km_batch_score=silhouette_score(data_Min_Max, km_batch_pred)
km_batch_pred= (pd.Series(km_batch_pred).value_counts()/len(data_Min_Max)).to_dict()
wss_[1]=wss_value(data_Min_Max, kmeans_batch.cluster_centers_)

# gmm
start_time = time.time()
gm = GaussianMixture(n_components=4, n_init=10, random_state=42)
gm.fit(data_Min_Max)
time_took[2] = time.time() - start_time
gmm_pred_score=silhouette_score(data_Min_Max, gm.predict(data_Min_Max))
gmm_pred=(pd.Series(gm.predict(data_Min_Max)).value_counts()/len(data_Min_Max)).to_dict()
wss_[2]=wss_value(data_Min_Max, gm.means_)

# Spectral clustering
start_time = time.time()
sc = SpectralClustering(n_clusters=4, assign_labels='discretize', random_state=42)
sc.fit(data_Min_Max)
time_took[3] = time.time() - start_time
sc_pred_= sc.labels_
sc_pred_score=silhouette_score(data_Min_Max, sc_pred_)
sc_pred= (pd.Series(sc_pred_).value_counts()/len(data_Min_Max)).to_dict()
sc_centre=np.array([np.mean(data_Min_Max[sc_pred_==i,:], axis=0) for i in range(4)])
wss_[3]=wss_value(data_Min_Max, sc_centre)

# Agglomerative clustering
start_time = time.time()
agg = AgglomerativeClustering(n_clusters=4,memory='joblib.Memory').fit(data_Min_Max)
time_took[4] = time.time() - start_time
agg_pred_= agg.labels_
agg_pred_score=silhouette_score(data_Min_Max, agg_pred_)
agg_pred= (pd.Series(agg_pred_).value_counts()/len(data_Min_Max)).to_dict()
agg_centre=np.array([np.mean(data_Min_Max[agg_pred_==i,:], axis=0) for i in range(4)])
wss_[4]=wss_value(data_Min_Max, sc_centre)
Wall time: 2.27 s
In [29]:
font = {'size'   : 10}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(9, 8), dpi= 140, facecolor='w', edgecolor='k')
#
ax1=plt.subplot(2,1,1)
tech=['Regular kmean','Mini Batch kmean','Gaussian Mixture','Spectral Clustering','Agglomerative Clustering']

val=time_took
ind=np.arange(5)
color=['#323539','#8DC63F','cyan','y','g']
ax1.bar(ind, val,width=0.35,lw = 1.0, align='center', alpha=1, ecolor='black',
        edgecolor='k',capsize=10,color=color)

for ii in range(len(val)):
    plt.text(ii-0.15, time_took[ii]+0.009,f'{"{0:.3f}".format(time_took[ii])}s',
    fontsize=10,rotation=0,color='k') 
            
plt.title('Execution Time (second) for Training Number of Cluster 4 (k=4)',fontsize=13)
plt.ylabel('Execution Time in Second',fontsize=12)
ax1.axhline(min(val),linewidth=0.6, color='k',linestyle='--')
plt.xticks(ind, tech,fontsize=10, rotation=30,y=0.05)
plt.ylim(0, 1.1)
plt.grid(linewidth='0.15')
plt.show()
In [30]:
font = {'size'   : 10}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(9, 8), dpi= 140, facecolor='w', edgecolor='k')
#
ax1=plt.subplot(2,1,1)
val=(km_pred_score,km_batch_score,gmm_pred_score,sc_pred_score,agg_pred_score)
ax1.bar(ind, val,width=0.35,lw = 1.0, align='center', alpha=1, ecolor='black',
        edgecolor='k',capsize=10,color=color)

for ii in range(len(val)):
    plt.text(ii-0.18, val[ii]+0.009,f'{"{0:.5f}".format(val[ii])}',
    fontsize=10,rotation=0,color='k') 
    
plt.title('Silhouette Score for number of Cluster 4 (k=4)',fontsize=13)
plt.ylabel('Silhouette Score',fontsize=12)
plt.xticks(ind, tech,fontsize=11, rotation=30,y=0.05)
ax1.axhline(max(val),linewidth=0.6, color='k',linestyle='--')
plt.grid(linewidth='0.15')
plt.ylim(0, 0.75)
plt.show()
In [31]:
font = {'size'   :8 }
plt.rc('font', **font)
fig=plt.figure(figsize=(6, 5), dpi= 200, facecolor='w', edgecolor='k')
ax1=plt.subplot(2,1,1) 

cluster=['Cluster 1','Cluster 2','Cluster 3','Cluster 4']
plt.plot(cluster,km_pred.values(), '-o',color=color[0], label='Regular kmean', markersize=4,linewidth=1.5)
plt.plot(cluster,km_batch_pred.values(), '-o',color=color[1], label='Mini Batch kmean', markersize=4,linewidth=1.5)
plt.plot(cluster,sc_pred.values(), '-s',color=color[2], label='Spectral clustering', markersize=4,linewidth=1.5)
plt.plot(cluster,agg_pred.values(), '->',color=color[3], label='Agglomerative clustering', markersize=4,linewidth=1.5)
plt.plot(cluster,gmm_pred.values(), '-*',color=color[4], label='Gaussian Mixture', markersize=5.5,linewidth=1.5)

plt.grid(True,which="both",ls="-",linewidth=0.1)
plt.title('Percentage of each Cluster for Each Technique',fontsize=9.5)
plt.ylabel('Percentage of each Cluster',fontsize=7)
#plt.legend(loc=9,bbox_to_anchor=(0.2, 1.0015), ncol=3,fontsize=6,markerscale=1.6)
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
plt.legend(loc=1, ncol=2,fontsize=6,markerscale=1.2)
plt.xticks(rotation=0,y=0.0, fontsize=8)
plt.show()

Agglomerative clustering has imbalanced percentage of clusters.

In [32]:
font = {'size'   : 10}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(9, 8), dpi= 140, facecolor='w', edgecolor='k')
#
ax1=plt.subplot(2,1,1)
val=wss_
ax1.bar(ind, val,width=0.35,lw = 1.0, align='center', alpha=1, ecolor='black',
        edgecolor='k',capsize=10,color=color)

for ii in range(len(val)):
    plt.text(ii-0.18, val[ii]+5,f'{"{0:.3f}".format(val[ii])}',
    fontsize=10,rotation=0,color='k') 
    
plt.title('WSS (Within Cluster Sums of Square Difference) for Number of Cluster 4 (k=4) \n MinMax Scale',fontsize=13)

ax1.axhline(min(val),linewidth=0.6, color='k',linestyle='--')
ax1.grid(linewidth='0.05')
plt.ylim((0, 400))
plt.ylabel('WSS',fontsize=12)
plt.xticks(ind, tech,fontsize=11, rotation=30,y=0.05)
plt.show()

Techniques have very similar WSS. Mini Batch kmean has the lowest running time and high Silhouette Score.

Footbal Premier League Data

K-Means

Premier League Player Stats Data is downloaded from https://www.kaggle.com/datasets/themlphdstudent/premier-league-player-stats-data

  • Rank : Rank of the player
  • Player : Player name
  • Team : Player team name
  • GP : Games played
  • GS : Games started
  • MIN : Minutes played
  • G : Goals
  • ASST : Assists
  • SHOTS : Total shots
  • SOG : Shots on goal
In [33]:
#load our data from CSV
premier_footbal_data =pd.read_csv('./Data/Premier_League_Player_Stats.csv', 
                                  encoding= 'unicode_escape')
premier_footbal_data
Out[33]:
Rank PLAYER TEAM GP GS MIN G ASST SHOTS SOG
0 1 Jamie Vardy Leicester City 35 34 3034 23 5 71 43
1 2 Danny Ings Southampton 38 32 2812 22 2 66 38
2 3 Pierre-Emerick Aubameyang Arsenal 36 35 3138 22 3 70 42
3 4 Raheem Shaquille Sterling Manchester City 33 30 2660 20 1 68 38
4 5 Mohamed Salah Ghaly Liverpool 34 33 2884 19 10 95 59
... ... ... ... ... ... ... ... ... ... ...
535 536 Dennis Cirkin Tottenham Hotspur 0 0 0 0 0 0 0
536 537 Darnell Johnson Leicester City 0 0 0 0 0 0 0
537 538 Timothy Fosu-Mensah Manchester United 3 2 136 0 0 1 0
538 539 Conor Coventry West Ham United 0 0 0 0 0 0 0
539 540 Alex Cochrane Brighton and Hove Albion 0 0 0 0 0 0 0

540 rows × 10 columns

In [34]:
premier_footbal_data=premier_footbal_data.drop(['Rank','PLAYER','TEAM'], axis=1)
premier_footbal_data
Out[34]:
GP GS MIN G ASST SHOTS SOG
0 35 34 3034 23 5 71 43
1 38 32 2812 22 2 66 38
2 36 35 3138 22 3 70 42
3 33 30 2660 20 1 68 38
4 34 33 2884 19 10 95 59
... ... ... ... ... ... ... ...
535 0 0 0 0 0 0 0
536 0 0 0 0 0 0 0
537 3 2 136 0 0 1 0
538 0 0 0 0 0 0 0
539 0 0 0 0 0 0 0

540 rows × 7 columns

In [35]:
#Standardization
scaler = StandardScaler()
scaler.fit(premier_footbal_data)
data_scale=scaler.transform(premier_footbal_data)
In [36]:
%%time
n_clus_mbat_kmean,time_mbat_kmean,inertia_mbat_kmean,silhouette_mbat_kmean=kmean_eda(data=data_scale,k_start=2,k_end=10,
                                                        kint=1,minibatch=False,batch_size=1000,silhouette=True)
Wall time: 435 ms

As you can see in Figure below, the inertia drops very quickly as we increase k up to 4, but then it decreases much more slowly as we keep increasing k. This curve has roughly the shape of an arm, and there is an “elbow” at k=4 so if we did not know better, it would be a good choice: any lower value would be dramatic, while any higher value would not help much, and we might just be splitting perfectly good clusters in half for no good reason. k=6 can be also be considered as elbow so this technique for choosing the best value for the number of clusters is rather coarse.

In [37]:
font = {'size'   :15}
plt.rc('font', **font)
fig=plt.figure(figsize=(12, 10), dpi= 100, facecolor='w', edgecolor='k')

ax2=plt.subplot(2,1,1)
#plt.plot(n_clus_reg,inertia_c_reg,color='b', marker='s', label='Regular K-Means', markersize=8.5,linewidth=2,markeredgecolor='k')
plt.plot(n_clus_mbat_kmean,inertia_mbat_kmean,color='r', marker='o', label='Regular K-Means', markersize=8.5,linewidth=2,markeredgecolor='k')
plt.grid(True,which="both",ls="-",linewidth=0.5)
plt.title('Number of Clusters (k) vs. Inertia (Mean square Distance) \n Standardization Scale')
plt.xlabel('Number of Clusters (k)',labelpad=14)
plt.ylabel('Inertia')
plt.axvline(x=3,color='k',linewidth=4,linestyle='--')
plt.legend(loc=1,ncol=1,markerscale=1.0)
plt.xticks(np.arange(1,10),rotation=0,y=0.0, fontsize=15)

plt.text(3,1750, 'Optimum k=6',fontsize=15,color='k', 
         fontweight='bold',style='oblique', ha='center',
             va='top', wrap=True,bbox=dict(facecolor='w', alpha=1,pad=5)) 
plt.text(6.5, 1000, 'n='+str(len(data_scale)), ha = 'center',fontsize=20,bbox=dict(facecolor='k', alpha=0.2))    
plt.xlim(2,7)
plt.subplots_adjust(wspace=0.12)
plt.subplots_adjust(hspace=0.4)
In [38]:
font = {'size'   :15}
plt.rc('font', **font)
fig=plt.figure(figsize=(12, 10), dpi= 100, facecolor='w', edgecolor='k')

ax1=plt.subplot(2,1,1)
#plt.plot(n_clus_reg,silhouette_kmean_reg,color='b', marker='s', label='Regular K-Means', 
#         markersize=8.5,linewidth=2,markeredgecolor='k')
plt.plot(n_clus_mbat_kmean,silhouette_mbat_kmean,color='r', marker='o', label='Regular K-Means', 
         markersize=8.5,linewidth=2,markeredgecolor='k')

plt.grid(True,which="both",ls="-",linewidth=0.5)
plt.title('Number of Clusters (k) vs. Silhouette Score (-1 to 1) (Scaled Data) \n Standardization Scale')
plt.xlabel('Number of Clusters (k)',labelpad=12)
plt.ylabel('Silhouette score')
plt.axvline(x=3,color='k',linewidth=4,linestyle='--')
plt.legend(loc=4,ncol=1,markerscale=1.0,fontsize=13)
plt.xticks(np.arange(1,8),rotation=0,y=0.0, fontsize=15)
plt.xlim(2,7)
plt.text(3,0.4, 'Optimum k=3',fontsize=15,color='k', 
         fontweight='bold',style='oblique', ha='center',
             va='top', wrap=True,bbox=dict(facecolor='w', alpha=1,pad=5)) 
plt.text(6.5, 0.46, 'n='+str(len(data_scale)), ha = 'center',fontsize=20,bbox=dict(facecolor='k', alpha=0.2))    
plt.show()

A more informative visualization can be achieved by plotting silhouette coefficient of every instance , sorted by the cluster they are assigned to and by the value of the coefficient. This is called a silhouette diagram.

The vertical dashed lines represent the silhouette score for each number of clusters. When most of the instances in a cluster have a lower coefficient than this score (i.e., if many of the instances stop short of the dashed line, ending to the left of it), then the cluster is rather bad since this means its instances are much too close to other clusters. We can see that when k=3 and when k=6, we get bad clusters.

In [39]:
kmeans_per_k = [KMeans(n_clusters=4,random_state=42).fit(data_scale) for k in range(3, 7)]
silhouette_scores = [silhouette_score(data_scale, model.labels_) for model in kmeans_per_k]
In [40]:
font = {'size'   :13 }
plt.rc('font', **font)
fig=plt.figure(figsize=(11, 9), dpi= 80, facecolor='w', edgecolor='k')

for k in (3, 4, 5, 6):
    plt.subplot(2, 2, k - 2)
    
    y_pred = kmeans_per_k[k - 3].labels_
    silhouette_coefficients = silhouette_samples(data_scale, y_pred)

    padding = len(data_scale) // 30
    pos = padding
    ticks = []
    for i in range(k):
        coeffs = silhouette_coefficients[y_pred == i]
        coeffs.sort()

        color = mpl.cm.Spectral(i / k)
        plt.fill_betweenx(np.arange(pos, pos + len(coeffs)), 0, coeffs,
                          facecolor=color, edgecolor=color, alpha=0.7)
        ticks.append(pos + len(coeffs) // 2)
        pos += len(coeffs) + padding

    plt.gca().yaxis.set_major_locator(FixedLocator(ticks))
    plt.gca().yaxis.set_major_formatter(FixedFormatter(range(k)))
    if k in (3, 5):
        plt.ylabel("Cluster")
    
    if k in (5, 6):
        plt.gca().set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
        plt.xlabel("Silhouette Coefficient", fontsize=15)
    else:
        plt.gca().set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
        #plt.tick_params(labelbottom=False)

    plt.axvline(x=silhouette_mbat_kmean[k - 2], color="k", linestyle="--",lw=4)
    plt.title(f"k={k} (Silhouette={np.round(silhouette_mbat_kmean[k - 2],3)})", fontsize=19)
    plt.grid(True,which="both",ls="-",linewidth=0.2)
    plt.xlim((-0.1, 0.8))

plt.subplots_adjust(wspace=0.2)
plt.subplots_adjust(hspace=0.3)
plt.show()

Spectral clustering

In [41]:
%%time
sc = SpectralClustering(n_clusters=3, assign_labels='discretize', random_state=42)
sc.fit(data_scale)
#sc1.labels_
Wall time: 126 ms
Out[41]:
SpectralClustering(assign_labels='discretize', n_clusters=3, random_state=42)

Agglomerative Clustering

In [42]:
%%time
agg = AgglomerativeClustering(n_clusters=3,memory='joblib.Memory').fit(data_scale)
#agg.labels_
Wall time: 3 ms

Gaussian Mixture Model

In [43]:
%%time
gm = GaussianMixture(n_components=3, n_init=10, random_state=42)
gm.fit(data_scale)
#gm.predict(data_scale)
Wall time: 222 ms
Out[43]:
GaussianMixture(n_components=3, n_init=10, random_state=42)

Results

In [44]:
# regular kmeans
np.random.seed(42)
wss_=np.zeros(4)

kmeans = KMeans(n_clusters=3)
kmeans.fit_predict(data_scale)
km_pred_=kmeans.labels_
km_pred_score=silhouette_score(data_scale, kmeans.labels_)
km_pred= (pd.Series(km_pred_).value_counts()/len(km_pred_)).to_dict()
wss_[0]=wss_value(data_scale, kmeans.cluster_centers_)

# Spectral clustering
sc = SpectralClustering(n_clusters=3, assign_labels='discretize', random_state=42)
sc.fit(data_scale)
sc_pred_= sc.labels_
sc_score=silhouette_score(data_scale, sc_pred_)
sc_pred= (pd.Series(sc_pred_).value_counts()/len(data_scale)).to_dict()
sc_centre=np.array([np.mean(data_scale[sc_pred_==i,:], axis=0) for i in range(3)])
wss_[1]=wss_value(data_scale, sc_centre)

# Agglomerative clustering
agg = AgglomerativeClustering(n_clusters=3,memory='joblib.Memory').fit(data_scale)
agg_pred_= agg.labels_
agg_score=silhouette_score(data_scale, agg_pred_)
agg_pred= (pd.Series(agg_pred_).value_counts()/len(data_scale)).to_dict()
agg_centre=np.array([np.mean(data_scale[agg_pred_==i,:], axis=0) for i in range(3)])
wss_[2]=wss_value(data_scale, agg_centre)

# gmm
gm = GaussianMixture(n_components=3, n_init=10, random_state=42)
gm.fit(data_scale)
gmm_pred_score=silhouette_score(data_scale, gm.predict(data_scale))
gmm_pred_=gm.predict(data_scale)
gmm_pred=(pd.Series(gm.predict(data_scale)).value_counts()/len(data_scale)).to_dict()
wss_[3]=wss_value(data_scale, gm.means_)
In [45]:
font = {'size'   : 10}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(9, 8), dpi= 140, facecolor='w', edgecolor='k')
#
ax1=plt.subplot(2,1,1)
val=(km_pred_score,sc_score,agg_score,gmm_pred_score)
tech=['Regular kmean','Spectral clustering','Agglomerative clustering','Gaussian Mixture']
ind=np.arange(4)
color=['#323539','#8DC63F','#025174','cyan','y','g']
ax1.bar(ind, val,width=0.35,lw = 1.0, align='center', alpha=1, ecolor='black',
        edgecolor='k',capsize=10,color=color)

for ii in range(len(val)):
    plt.text(ii-0.15, val[ii]+0.009,f'{"{0:.3f}".format(val[ii])}s',
    fontsize=10,rotation=0,color='k') 

plt.title('Silhouette Score for number of Cluster 3 (k=3) ',fontsize=13)
plt.ylabel('Silhouette Score',fontsize=12)
ax1.axhline(max(val),linewidth=0.6, color='k',linestyle='--')
plt.xticks(ind, tech,fontsize=11)
ax1.grid(linewidth='0.15')
plt.ylim(0,0.55)
plt.show()
In [46]:
font = {'size'   :8 }
plt.rc('font', **font)
fig=plt.figure(figsize=(6, 5), dpi= 200, facecolor='w', edgecolor='k')
ax1=plt.subplot(2,1,1) 

cluster=['Cluster 1','Cluster 2','Cluster 3']
plt.plot(cluster,km_pred.values(), '-o',color=color[0], label='Regular kmean', markersize=4,linewidth=1.5)
plt.plot(cluster,sc_pred.values(), '-s',color=color[1], label='Spectral clustering', markersize=4,linewidth=1.5)
plt.plot(cluster,agg_pred.values(), '->',color=color[2], label='Agglomerative clustering', markersize=4,linewidth=1.5)
plt.plot(cluster,gmm_pred.values(), '-*',color=color[3], label='Gaussian Mixture', markersize=5.5,linewidth=1.5)

plt.grid(True,which="both",ls="-",linewidth=0.1)
plt.title('Percentage of each Cluster for Each Technique',fontsize=9.5)
plt.ylabel('Percentage of each Cluster',fontsize=7)
#plt.legend(loc=9,bbox_to_anchor=(0.2, 1.0015), ncol=3,fontsize=6,markerscale=1.6)
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
plt.legend(loc=1, ncol=2,fontsize=6,markerscale=1.2)
plt.xticks(rotation=0,y=0.0, fontsize=8)
plt.show()
In [47]:
font = {'size'   : 10}
plt.rc('font', **font)
fig, ax1 = plt.subplots(figsize=(9, 8), dpi= 140, facecolor='w', edgecolor='k')
#
ax1=plt.subplot(2,1,1)
val=wss_
ax1.bar(ind, val,width=0.35,lw = 1.0, align='center', alpha=1, ecolor='black',
        edgecolor='k',capsize=10,color=color)

for ii in range(len(val)):
    plt.text(ii-0.15, val[ii]+10,f'{"{0:.3f}".format(val[ii])}',
    fontsize=10,rotation=0,color='k') 
    
plt.title('WSS (Within Cluster Sums of Square Difference) for Number of Cluster 4 (k=4)',fontsize=13)
ax1.axhline(min(val),linewidth=0.6, color='k',linestyle='--')
ax1.grid(linewidth='0.05')
plt.ylim((0, 900))
plt.ylabel('WSS',fontsize=12)
plt.xticks(ind, tech,fontsize=11, rotation=20,y=0.05)
plt.show()
In [48]:
pca = PCA(n_components = 2)
X2D = pca.fit_transform(data_scale)
#
font = {'size'   : 12}
plt.rc('font', **font)
fig, ax=plt.subplots(figsize=(10, 10), dpi= 100, facecolor='w', edgecolor='k')

cmap = plt.cm.rainbow
norm = BoundaryNorm(np.arange(-0.5, 3, 1), cmap.N)

# regular kmeans
ax=plt.subplot(2,2,1) 
sc=ax.scatter(X2D[:, 0], X2D[:, 1],c=km_pred_, cmap=cmap, norm=norm, s=40, edgecolor='k')
cbar = plt.colorbar(sc,orientation='vertical',shrink=0.5)
cbar.set_ticks([0., 1., 2.])
cluster=['Cluster 1', ' Cluster 2', 'Cluster 3']
cbar.set_ticklabels(cluster)
ax.set_xlabel("$X$", fontsize=18)
ax.set_ylabel("$Y$", fontsize=18, rotation=0)
plt.gca().set_aspect('1')
ax.set_title('Regular K-Mean',fontsize=13, y=1)
plt.xlim((-3.1, 3.2))
plt.ylim((-2.5, 2.8))
ax.grid(True, linewidth=0.35)


# Spectral clustering
ax=plt.subplot(2,2,2) 
sc=ax.scatter(X2D[:, 0], X2D[:, 1],c=sc_pred_, cmap=cmap, norm=norm, s=40, edgecolor='k')
cbar = plt.colorbar(sc,orientation='vertical',shrink=0.5)
cbar.set_ticks([0., 1., 2.])
cbar.set_ticklabels(cluster)
ax.set_xlabel("$X$", fontsize=18)
ax.set_ylabel("$Y$", fontsize=18, rotation=0)
plt.gca().set_aspect('1')
ax.set_title('Spectral clustering',fontsize=13, y=1)
plt.xlim((-3.1, 3.2))
plt.ylim((-2.5, 2.8))
ax.grid(True, linewidth=0.35)


# Agglomerative clustering
ax=plt.subplot(2,2,3) 
sc=ax.scatter(X2D[:, 0], X2D[:, 1],c=agg_pred_, cmap=cmap, norm=norm, s=40, edgecolor='k')
cbar = plt.colorbar(sc,orientation='vertical',shrink=0.5)
cbar.set_ticks([0., 1., 2.])
cbar.set_ticklabels(cluster)
ax.set_xlabel("$X$", fontsize=18)
ax.set_ylabel("$Y$", fontsize=18, rotation=0)
plt.gca().set_aspect('1')
ax.set_title('Agglomerative clustering',fontsize=13, y=1)
plt.xlim((-3.1, 3.2))
plt.ylim((-2.5, 2.8))
ax.grid(True, linewidth=0.35)
cbar.set_ticks([0., 1., 2.])

# Gaussian Mixture clustering
ax=plt.subplot(2,2,4) 
sc=ax.scatter(X2D[:, 0], X2D[:, 1],c=gmm_pred_, cmap=cmap, norm=norm, s=40, edgecolor='k')
cbar = plt.colorbar(sc,orientation='vertical',shrink=0.5)
cbar.set_ticks([0., 1., 2.])
cbar.set_ticklabels(cluster)
ax.set_xlabel("$X$", fontsize=18)
ax.set_ylabel("$Y$", fontsize=18, rotation=0)
plt.gca().set_aspect('1')
ax.set_title('Gaussian Mixture',fontsize=13, y=1)
plt.xlim((-3.1, 3.2))
plt.ylim((-2.5, 2.8))
ax.grid(True, linewidth=0.35)

fig.suptitle('Dimensionality Reduction with PCA for Clusters', fontsize=16,y=0.88)
plt.subplots_adjust(wspace=0.3)
plt.subplots_adjust(hspace=-0.2)

plt.show()