k means clustering simple explanation
by keshav
K-means clustering is one of the easiest, simple, and most popular unsupervised machine learning algorithms. The goal of K-means is simple: group similar data points together and recognize the underlying patterns. K-means looks for a fixed number (k) of clusters in a dataset, to accomplish this goal. A cluster refers to a collection of data points aggregated together exhibiting certain similarities. The target number k is a hyperparameter (i.e. You have to set its value by yourself before you the learning process begins), which refers to the number of centroids you need in the dataset. A centroid is a fictional or real location representing the center of the cluster. The centroid can be defined in various ways such as by the mean or medoid of the objects (or points) assigned to the cluster (we will deal with K medoid in upcoming articles). Every data point is assigned to each of the clusters by reducing the in-cluster sum of squares.
In other words, the K-means algorithm identifies k number of centroids, and then groups every data point to the nearest cluster, while keeping the centroids as small as possible. The term ‘means’ of the cluster in the K-means refers to the average of the data in a cluster; that is, the centroid.
Suppose a data set, D, contains n data points in Euclidean space. Clustering (Partitioning) methods distribute the data points in D into k clusters, C1,…, Ck, that is, Ci belongs to D. An objective function is used to assess the partitioning quality so that data points within a cluster are similar to one another.
A K means clustering technique uses the mean(i.e. centroid value) of a cluster, Ci, to represent that cluster. The difference between a data point p belongs to Ci and ci, the mean of the cluster, is measured by dist(p, ci), where dist(x,y) gives the Euclidean distance between the two points x and y. The quality of cluster Ci can be measured by the within-cluster variation, which is the sum of squared error between all data points in Ci and the centroid ci, defined as,
|
K means error |
Where E is the sum of the squared error for all data points in the data set; p is the point in space representing a given data point, and ci is the mean of cluster Ci (both p and ci are multidimensional).
In other words, for each data points in each cluster, the distance from the data point to its cluster center is squared, and the distances are summed. This objective function tries to make the resulting k clusters as compact and as separate as possible from other clusters.
How does the k-means algorithm work?
The k-means algorithm defines the centroid of a cluster as the mean value of the points among the cluster.
It works as follows,
First of all, it randomly selects k of the data point in D, each of which initially represents a cluster mean (i.e. centroid). Each of the remaining data points is assigned to the cluster to which it is the most similar, based on the Euclidean distance between the object and the cluster mean. The k-means algorithm then iteratively improves the within-cluster variation. For every cluster, it computes the new mean using the data points allocated to the cluster in the previous iteration. All the data points are then reallocated using the updated means as the new cluster means. The iterations go on until the allocation is stable, that is, the clusters formed in the current round are the same as those formed in the previous round (i.e. the cluster mean obtained in this round is the same as that of the obtained in the previous round). The k-means algorithm is summarized in the figure below,
Algorithm of k-means Clustering:
The k-means algorithm is used for partitioning (clustering), where each cluster’s center is represented by the mean value of the data points in the cluster (group).
We have
Input:
-k: the number of clusters,
-D: a data set containing n objects.
Output:
-A set of k clusters.
Steps:
(1) randomly select k data points from D as the initial cluster centers;
(2)Start loop;
(1) re/assign each data points to the cluster to which the object is the most similar (i.e. which is more near based upon the Euclidean distance measure), based on the mean value of the objects in the cluster;
(2) update the cluster means, that is, calculate the new mean value of the data points for each cluster;
(3) end loop when there is no change in cluster means;
No Comments