cnvrg.io is now Intel® Tiber™ AI Studio
Open Nav

The Beginners Guide to Clustering Algorithms and How to Apply Them in Python

The Beginners Guide to Clustering Algorithms and How to Apply Them in Python 1

Table of Contents

Introduction to clustering algorithms

Oftentimes, you might be in a situation where the data available is unlabeled. Since there are no labels you cannot perform classification. In such scenarios, you have to employ unsupervised machine learning techniques. One such approach is clustering. In clustering, the objective is to group the data into separate groups based on the given data. For example, you may have customer data and want to group the customers into separate groups based on their similarities. For instance, the customers can be grouped based on their behavior. Other applications of clustering include image segmentation, document clustering, anomaly detection, and recommendation engines. These grouping problems can be solved by a wide range of clustering algorithms. These algorithms work differently and require different configurations. 

In this article, let’s take a look at some of these clustering algorithms and how to apply them. We’ll also take a look at how you can generate a clustering dataset and fit it into a clustering algorithm.

Getting started with clustering in Python

The quickest way to get started with clustering in Python is through the Scikit-learn library. Once the library is installed, you can choose from a variety of clustering algorithms that it provides. The next thing you need is a clustering dataset.

Clustering dataset

Scikit-learn can be used to generate various datasets. For instance, you can use the library to generate a clustering dataset. This is done using the `make_blobs` function. The function expects:

  • The number of samples 
  • The number of centers 
  • The standard deviation for each cluster 
  • The number of features
				
					
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=400, centers=3,
                       cluster_std=0.50, random_state=3)


				
			

The generated dataset is ideal for demonstrating clustering. For example, you can use Matplotlib to visualize the above data.

				
					
plt.figure(figsize=(10,7))
plt.scatter(X[:, 0], X[:, 1], s=30);


				
			
Clustering dataset

The alternative is to generate a classification dataset and ignore the labels. 

				
					
from sklearn.datasets import make_classification
X, _ = make_classification(n_samples=400,n_features=4,random_state=22)


				
			

What are clustering algorithms?

There are numerous clustering algorithms. Most of these algorithms are distance-based. This requires that the data is scaled before it is fitted to an algorithm. If the data is of high dimension, it is usually good practice to apply Principle component analysis (PCA) to reduce the effects of the curse of dimensionality on the clustering problem. PCA reduces the number of features by combining and or removing features. The resulting features are referred to as components. 

The other important factor to consider while clustering is the number of clusters. Clustering algorithms may have a default number for the clusters to generate. You can also define the number of clusters through guesswork or from your experience. However, you’ll see how to programmatically determine the optimal number of clusters in a later part of this article. 

Let’s now start looking at various clustering algorithms. 

K-Means clustering algorithm

K-Means is one of the most popular clustering algorithms. Given a certain dataset, it puts the data in separate groups based on their similarity. The letter K stands for the number of clusters. Each of the clusters has its center referred to as a centroid. Means in K-Means is in reference to the fact that the algorithm works by averaging the data. 

How does the K-Means algorithm work? 

Let’s take a moment to talk about how the K-Means clustering algorithm works. The algorithm operates in the following steps:

  • Specify K number of clusters
  • K data points are randomly selected for the centroids (data points representing the cluster centers)
  • Each sample is assigned to its nearest centroid
  • New centroids are created by taking the mean value of all of

the samples assigned to each previous centroid

  • Calculate the difference between the old and new centroid
  • Repeat until the centroids do not change significantly

As noted above you have to specify the number of clusters. The fact that the algorithm cannot learn the best number of clusters is one of the challenges of the K-Means clustering algorithm. The assumption that data points will be closer to their cluster center also makes the algorithm inefficient where clusters have complex geometries. In such a scenario you may consider applying the Spectral Clustering algorithm which is better at handling highly non-convex clusters. The other challenge of using K-Means is that it may slow down as the number of samples in the dataset grows. This can be addressed by using a subset of the data to update the cluster centers. This is referred to as MiniBatchKMeans.  

K-Means clustering basic example

Let’s look at K-Means in action by generating a simple dataset. The dataset contains 400 samples, 3 centers, and a cluster standard deviation of 4.2. A random state of 3 is defined for reproducibility. 

				
					
from sklearn.datasets.samples_generator import make_blobs
X, _ = make_blobs(n_samples=400, centers=3,
                       cluster_std=4.2, random_state=3)


				
			

The next step is to import the algorithm and instantiate it with the required number of clusters.

				
					
from sklearn.cluster import KMeans
kmeans = KMeans(random_state=42,n_clusters= 4)


				
			

You can check the parameters of the model after instantiating it. Some of these parameters include: 

  • `init ` defines the initialization method. You can choose between `k-means++` and random. `K-means++`  selects the initial cluster centers in a way that speeds up convergence. The random method does this randomly. 
  • `max_iter` is the maximum number of iterations to be taken in a single run.
  • `random_state` determines the random number to be used for centroid initialization. 
				
					
kmeans.get_params()


				
			

The next step here is to fit the algorithm to the data.

				
					
kmeans.fit(X)


				
			

After this, you can check the cluster centers using the `cluster_centers_` attribute. 

				
					
 kmeans.cluster_centers_


				
			

This will return an array with 4 pairs of clusters because the chosen number of clusters was 4. 

K-Means clustering basic example 2

The `inertia_`  attribute displays how internally coherent the clusters are. The number represents the sum of squared errors of samples to their closest cluster center.

				
					
kmeans.inertia_


				
			

Finding the best number of clusters for K-Means

So far, you have specified the number of clusters manually. Let’s now take a look at a technique that can be used to identify the number of clusters programmatically. 

The elbow method

In this method, K-Means is applied with a different number of clusters while recording the sum of squared errors. 

				
					
sse = []
for k in range(1, 13):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X)
    sse.append(kmeans.inertia_)


				
			

You can then create a `DataFrame` containing the clusters and the accompanying error. 

				
					
cluster_df = pd.DataFrame({'Cluster':range(1,13), 'sse':sse})
cluster_df.head()


				
			

The next step is to plot these errors and identify the elbow point. This the point at which the curve starts to bend. 

				
					
plt.figure(figsize=(12,6))
plt.plot(cluster_df['Cluster'], cluster_df['sse'])
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')


				
			
The elbow method

In this case, that elbow point seems to be between 2 and 4. Instead of checking this manually, you can employ another technique to obtain the exact number of clusters. 

This can be done using the `KneeLocator` library. It is similar to what you have done above but will result in a precise number of clusters. 

				
					
from kneed import KneeLocator
kl = KneeLocator(range(1, 13), sse, curve="convex", direction="decreasing" )


				
			

The elbow attribute can then be used to obtain the optimal number of clusters. 

				
					
kl.elbow


				
			

With the optimal number of clusters, the next step is to fit the model and make predictions. 

				
					
kmeans = KMeans(kl.elbow)
kmeans.fit(X)
predictions = kmeans.predict(X)


				
			

Cluster analysis

You can then use the above predictions and the initial data to visualize the clusters using Matpotlib. The centers of the clusters are marked with blue color. In the visualization, you can see the four clusters and their centers. 

				
					
plt.figure(figsize=(10,7))
plt.scatter(X[:, 0], X[:, 1], c=predictions, s=30, cmap='cividis')

plt.scatter(centers[:, 0], centers[:, 1], c='blue', s=200, alpha=0.6)


				
			
Cluster analysis

K-Means clustering image compression example

K-Means can also be applied in reducing the number of colors in an image. Usually, an image may have millions of colors. Some of these colors can be removed without extremely distorting the image. The quality of the image will be lower and so will be the size of the image. Image compression can be applied in visualizing images in devices that support only a few colors. To illustrate this, let’s use the flower image from Scikit-learn. The first step is to load it and visualize it. 

				
					
from sklearn.datasets import load_sample_image
flower = load_sample_image("flower.jpg")
plt.figure(figsize=(10,7))
plt.axis('off')
plt.imshow(flower);


				
			
flower

The image is currently in three dimensions representing the height, width, and RGB intensities. 

12

You, therefore, have to scale the pixels to be between 0 and 1 and later reshape the image. The image is a NumPy array since it was loaded using Sklearn. You would have to convert the image to a NumPy array if you had loaded it using another package. 

				
					
data = flower / 255.0
data = data.reshape(427 * 640, 3)
data.shape


				
			

Let’s now apply K-Means clustering to reduce these colors. The first step is to instantiate K-Means with the number of preferred clusters. These clusters represent the number of colors you would like for the image. Let’s reduce the image to 24 colors. 

				
					
kmeans = KMeans(random_state=42, n_clusters= 24)
kmeans.fit(data)


				
			

The next step is to obtain the labels and the centroids. The centroids represent the new colors. Every pixel in the image is associated with the cluster it belongs to. 

				
					
centroids = kmeans.cluster_centers_
labels = kmeans.predict(data)


				
			

Let’s now rebuild the image from the labels and the centroids. After that, reshape it to three dimensions necessarily for visualizing it with Matplotlib.

				
					
data_2d = centroids[labels]
data_3d = data_2d.reshape(427, 640, 3)


				
			

It’s now time to visualize the image in 24 colors. 

				
					
plt.figure(figsize=(10,7))
plt.imshow(data_3d)
plt.axis('off')
plt.show()


				
			

You can see that some information has been lost but the flower is still clear. 

Mini-Batch K-Means

You have already learned that K-Means can be a bit slow when applied to a large dataset. A great example of that is the image compression example above. This challenge is addressed by Mini-Batch K-Means. This algorithm doesn’t use the entire dataset but a subset of it. This subset is used for updating the cluster centroids making it faster than K-Means on large datasets. 

Let’s look at how the algorithm can be used to group similar digits.

Mini-Batch K-Means clustering digits classification example

The process of applying Mini-Batch K-Means to the digits is quite similar to what was done in the image compression example. Let’s start by loading the digits.

				
					
from sklearn.datasets import load_digits
digits = load_digits()
data = digits.data


				
			

Next, scale the pixels so that they are between 0 and 1. This is important because K-Means uses distance to compute similarities between data. 

				
					
data = data / 255


				
			

The digits data contains 10 digits i.e 0 to 9. You can, therefore, group the data into 10 clusters. Let’s do that using Mini-batch K-Means.

				
					
from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(n_clusters=10)
kmeans.fit(data)
predictions = kmeans.predict(data)


				
			

The next step is to plot the resulting clusters using Matplotlib.

				
					import numpy as np
for i in range(0,10):

    row = np.where(predictions==i)[0]  
    num = row.shape[0]       
    rows = np.floor(num/10.)    

    print("Cluster no: "+str(i))
    print(str(num)+" items")

    plt.figure(figsize=(10,7))
    for k in range(0, num):
        plt.subplot(rows+1, 10, k+1)
        image = data[row[k], ]
        image = image.reshape(8, 8)
        plt.imshow(image, cmap='cividis')
        plt.axis('off')
    plt.show()
				
			

From the results below, you can see that the algorithm can group similar digits in the same cluster. 

cluster3

The K-Means clustering algorithm is the crème de la crème in the clustering field. However, several other algorithms can be used for clustering. Let’s look at those next. 

How to select the clustering algorithm for your use case

You’ve already seen how to apply clustering algorithms to various problems. Clustering can be divided into two subgroups; soft and hard clustering. In hard clustering, a data point belongs to exactly one cluster. In soft clustering, a data point is assigned a probability that it will belong to a certain cluster. Clustering algorithms also fall into different categories. Let’s take a step back and look at these categories. 

Broadly speaking, the algorithms can be divided into the following categories:

  • Hierarchical clustering
  • Partitioning clustering 
  • Density-based clustering
  • Distribution-based clustering

Let’s now take a moment and examine them.

Hierarchical clustering

Hierarchical clustering works best in data that is hierarchical because it creates clusters in a tree-like manner.

Hierarchical clustering

Source. Example of hierarchical clustering

This type of clustering can further be divided into:

  • Agglomerative clustering that employs a bottom-up approach
  • Divisive clustering that uses a top-down approach in clustering. 

Partitioning clustering

In this form of clustering, the data points are split into several partitions known as clusters. These algorithms may also be said to be centroid-based. Some algorithms that fall into this class include:

  • K-Means clustering 
  • Fuzzy C-Means clustering

The difference between K-Means clustering and Fuzzy C-Means clustering is that in Fuzzy C-Means a data point can belong to more than one cluster while in K-Means this is not possible. 

Density-based clustering

In density-based clustering, areas with numerous data points in close proximity are grouped as clusters. The result is that outliers will not be captured in the clusters. 

Source. Example of density-based clustering. 

Examples of these algorithms include:

  • DBSCAN (Density-Based Spatial Clustering of Applications with Noise) 
  • OPTICS (Ordering Points to Identify Clustering Structure)

The algorithms can be applied through Scikit-learn similar to the way you fitted K-Means to a dataset earlier.

Distribution-based clustering

In this case, data points are divided into clusters based on the probability that they belong to a certain distribution, say a normal distribution. 

Source. Example of distribution-based clustering. 

The expectation-maximization algorithm is an example of such an algorithm that uses multivariate normal distribution. 

Clustering metrics

Several methods can be used in evaluating clustering algorithms. Usually, this will require the ground truth and the predicted labels. These metrics include:

Homogeneity score. The result of a cluster is said to be homogenous if its clusters only contain data that are members of a single class. 

Completeness score. This score checks that all members of a certain class are attributed to the same cluster.  

 V measure score. This is the harmonic mean between homogeneity and completeness. 

 Adjusted rand score. This is the Rand index adjusted for chance. It calculates the similarity measure of two clusterings. 

Adjusted mutual info score. This is an adjustment of the Mutual Information (MI) score to account for chance. 

The metrics can be accessed from Scikit-learn’s metric module. After obtaining the labels, say from K-Means, you can then apply all of them in a loop. 

				
					
from sklearn import metrics
clustering_metrics = [
        metrics.homogeneity_score,
        metrics.completeness_score,
        metrics.v_measure_score,
        metrics.adjusted_rand_score,
        metrics.adjusted_mutual_info_score,
    ]
for metric in clustering_metrics:
  score = metric(y, kmeans.labels_)
  print(f"The {metric.__name__} score is {score} ")


				
			

K-Nearest Neighbor algorithm versus K-Means clustering

The fact that these two algorithms start with a K can be a bit confusing. Let’s uncover the two main differences between them:

  • K-Nearest Neighbor algorithm is a supervised machine learning algorithm used in classification and regression. Here the true values are known while training the model. The models can therefore be evaluated using regression and classification metrics.  
  • K-Means clustering is an unsupervised machine algorithm used in clustering problems. In this case, the data is not labeled meaning that the true values are not known while training the model.  

Final thoughts

In this article, you have learned about an unsupervised machine learning technique known as clustering. You have also seen how to apply clustering algorithms to a dataset. More precisely you have covered: 

  • What is clustering?
  • How to generate a clustering dataset 
  • Types of clustering algorithms 
  • How to evaluate clustering algorithms 
  • Applying K-Means clustering to digit classification and image compression 

…to mention a few. 

 

Happy clustering! 

Resources

Top MLOps guides and news in your inbox every month