Overview of Clustering Algorithms

Hands-on Clustering Algorithms: A Walkthrough in Python!

Srivignesh Rajan

Published in

Towards Data Science

6 min readJul 20, 2020

Clustering

Clustering is an unsupervised technique in which the set of similar data points is grouped together to form a cluster. A Cluster is said to be good if the intra-cluster (the data points within the same cluster) similarity is high and the inter-cluster (the data points outside the cluster) similarity is low. Clustering could also be viewed as a Data Compression technique in which the data points of a cluster can be treated as a group. Clustering is also called Data Segmentation because it partitions the data such that a group of similar data points forms a cluster.

How does Clustering differ from Classification?

Classification Algorithms are good techniques to distinguish between groups and classify. Classification requires manual labeling of data which is a tiring process when a dataset is huge. How about the reverse process? viz., partitioning the similar data points together and assigning cluster labels to them. A Clustering Algorithm does not require labels to further proceed since it is an unsupervised technique.

Requirements of Clustering

Scalability
Potential to cope up with distinct attributes
Robustness towards noise and outliers
Capability to face high-dimensional data

Methods of Clustering

The most common methods of Clustering are,

Partitioning methods
Hierarchical methods
Density-based methods
Model-based methods

Partitioning methods: Partitioning methods involve partitioning the data and clustering the group of similar items. Common Algorithms used in this method are,

K-Means
K-Medoids
K-Modes

Hierarchical methods: Hierarchical methods involve decomposing the data hierarchically. Two types of hierarchical methods exist,

Agglomerative (bottom-up approach)
Divisive (top-down approach)

Density-based methods: Density-based methods are used in Anomaly detection. The data points that are highly dense are grouped together leaving the data points with low density.

DBSCAN (Density-Based Spatial Clustering of Application with Noise)
OPTICS (Ordering Points To Identify Cluster Structures)

Model-based methods: Model-based methods involves applying a model to find the best cluster structures.

EM (Expectation-Maximization) Algorithm

In this post, we concentrate on some of the algorithms of Partitioning methods and Density-based methods.

Load the required Libraries in Python

Partitioning methods

1. K-Means Clustering

K-Means Clustering is a classical approach to Clustering. K-Means iteratively relocates the cluster centers by computing the mean of a cluster.

Initially, K-Means chooses k cluster centers randomly.
Distance is calculated between each data point and cluster centers (Euclidean distance is commonly used). A data point is assigned to a Cluster to which it is very close.
After all the data points are assigned to a cluster the algorithm computes the mean of the cluster data points and relocates the cluster centers to its corresponding mean of the cluster.
This process is continued until the cluster centers do not change.

The advantage of using K-Means is Scalability. K-Means perform well on huge data. The main disadvantage of using K-Means is its sensitivity to outliers. The outliers have a severe impact while computing means of the cluster. Cluster results vary according to k value and initial choice of cluster centers. K-Means algorithm works well only for spherical data and fails to perform well on arbitrary shapes of data.

2. K-Modes Clustering

K-Means works well for continuous data. What about categorical data? K-Modes Clustering comes to the rescue. The algorithm is very similar to K-Means but instead of calculating the mean of the cluster, K-Modes calculate the mode (a value that occurs frequently) of the cluster.

Initially, K-Mode chooses k cluster centers randomly.
The similarity is calculated between each data point and cluster centers. A data point is assigned to a Cluster to which it has high similarity.
After all the data points are assigned to a cluster the algorithm computes the mode of the cluster data points and relocates the cluster centers to its corresponding mode of the cluster.
This process is continued until the cluster centers do not change.

How do you find an optimal value for k?

Elbow method and Silhouette index are the most commonly used methods to find an optimal value for k.

Density-based methods

1. DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that has the ability to perform well on data with arbitrary shapes. DBSCAN finds the data points that are dense and assigns it to a cluster leaving the less dense data points apart from the cluster

Terminologies of DBSCAN

If a data point q is within the radius ϵ of another data point p then the data point q is considered to be ϵ-neighborhood (epsilon neighborhood) of the data point p.
A data point p is said to be a core object if the ϵ-neighborhood of point p consists of the value mentioned in MinPts (Minimum Points). For example, consider MinPts = 5, p is said to be a core object if the ϵ-neighborhood of p consists of 5 data points.
A data point p is said to be directly density-reachable from point q if the point p is within the ϵ-neighborhood of q.

How does it work? DBSCAN checks the ϵ-neighborhood of each data point. If p is a core object and its data points in ϵ-neighborhood are higher than the value of MinPts it forms a new cluster around the core object p. DBSCAN finds directly density-reachable data points iteratively and may amalgamate other few clusters. The process is continued until there are no more points left to analyze.

The disadvantage of using DBSCAN is the presence of hyperparameters ϵ and MinPts. The results produced vary according to the values opted for hyperparameters.

2. OPTICS

OPTICS (Ordering Points To Identify the Clustering Structure) is a clustering algorithm that surmounts the disadvantages faced by the usage of hyperparameters in DBSCAN. OPTICS is very similar to DBSCAN but it uses a flexible value for the radius ϵ.

Terminologies of OPTICS

The smallest value for 𝜖′ (less than the value for 𝜖′) that makes the point p as a core object is called core-distance of p.
The maximum of the core-distance of p and the Euclidean distance between p and q is called the reachability-distance of a point q with respect to another object p.

Summary

Clustering is used in the sales and marketing industry to identify the right segment of customers. Netflix uses Clustering algorithms to group viewers of similar interests. Clustering is a field of research work and many variants of the available algorithms are in the development phase.

Find this post in my Kaggle notebook: https://www.kaggle.com/srivignesh/overview-of-clustering-algorithms

References:

[1] Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques Second Edition (2006)

Connect with me on LinkedIn, Twitter!

Happy Machine Learning!