What is k-means Clustering?
Disclaimer: This is just a sharing of my study notes, aim to reinforce my learning, proceed at your own risk!

k-means clustering is a method of unsupervised learning that is being used to, that’s right, clustering. The number of k is a predefined number to split the data into groups. k-means clustering finds a benchmark data point for each cluster and assigns all the data points to the nearest benchmark data point which is also known as the centroid, eventually forming clusters. Due to the nature of the algorithm, k-means can only work on numerical data.
k-means algorithm, in summary, works as below:
- Placing k centroids (c₁ … cₖ) at random locations
- Find the nearest centroid for each data point (using Euclidean distance), and assign it to the cluster
- Find the new centroid for each cluster
- Repeat Steps 1–3 until no data point is changing cluster.
Next, there are 2 ways to evaluate the clusters, either by using the Elbow Curve Method or Silhouette Analysis. In general, it is aimed to have the shortest average distance within the cluster, and the furthest average distance between clusters.
The Elbow Curve Method will check against each k value, calculating the WCSS (Within-Cluster-Sum of Square), which is the sum of the squared distance between each data point and centroid in the cluster. Plotting WCSS into a graph, there will be a point where the value suddenly drops, this will be the Elbow point.
Whereas the Silhouette Analysis will find out the Silhouette Score, which is a way to measure how well the clusters are separated, and also how dense it is.
References:
Victor Lavrenko. (2014) K-means clustering: how it works. https://youtu.be/_aWzGGNrcic?si=KbStWpVxHqIzhSAj
Scikit Learn. https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html