Skip to article frontmatterSkip to article content

What is k-means Clustering?

Definition: k-means is an unsupervised machine learning algorithm used for clustering data points into k distinct, non-overlapping groups based on their similarities and differences.

Objective: The main goal of k-means is to minimize the within-cluster sum of squares (WCSS), which represents the total squared distance (euclidean distance) between data points and their nearest cluster centroid.

Steps & Process

Steps demo: 1

The data point

The data point

The steps:
1 Initialize with random centroids. 2 Get the nearest neighbors, and calculate the new mean centers. 3 Move the centers to the mean centers

The steps: 1 Initialize with random centroids. 2 Get the nearest neighbors, and calculate the new mean centers. 3 Move the centers to the mean centers

The steps:
1 Re-calculate distance and find nearest neighbors. 2 Re-Calculate new mean centers and move

The steps: 1 Re-calculate distance and find nearest neighbors. 2 Re-Calculate new mean centers and move

The steps:
1 The location of new and previous centers overlapped. Hence, stops.

The steps: 1 The location of new and previous centers overlapped. Hence, stops.

k-means Clustering: Example with 7 clusters

The final clusters.

The final clusters.

Parameters and Initialization

How to select k

Elbow Method: Plot the explained variation (or distortion) as a function of the number of clusters, and choose the elbow point as the number of clusters to use. The elbow point is the point of diminishing returns, where the additional gain in reducing the distortion starts to decrease significantly as more clusters are added. For the final iteration (i), the explained variation (EV^i) can be calculated by the distance between each point (k) and the cluster they were assigned (C_j^i).

Dji=(xkμji2)D_{j}^i = \sum (|x_k - \mu_j^i|^2)

where xkCji{x_{k} \in C_j^i}

Di=j=1kDjiD^i = ∑_{j=1}^k D_j^i
EVi=D0DiEV^i = D^0 - D^i
Wikipedia

Wikipedia

Silhouette Analysis: Calculate the average silhouette score (SiSi) for different values of k and choose the value that yields the highest score. The silhouette score measures the cohesiveness of datapoints within a cluster compared to their closeness to datapoints in other clusters, with values closer to 1 indicating well-defined clusters.

Si=(biai)/max(ai,bi)Si = (b_i - a_i) / \text{max}(a_i, b_i)

aia_i is the average distance between point i and all other points in its own cluster:

ai=(d(xi,xj))Cia_i = \sum \frac{(d(x_i, x_j))}{|C_i|}

where, xjCix_j ∈ C_i

bib_i is the lowest average distance between point i and all points in any other cluster:

bi=min(d(xi,xj))Cjb_i = \text{min} ∑ \frac{(d(x_i, x_j))}{|C_j|}

where, xjCjx_j ∈ C_j and jij\neq i

Note that the Silhouette Coefficient ranges between -1 and 1, with higher values indicating better clustering.

Wikipedia

Wikipedia

Other approaches

How set the initial position of the mean centers?

Properties of k-means clustering