In other words, the **K-means** algorithm identifies *k* number of centroids, and then groups every data point to the nearest cluster, while keeping the centroids as small as possible. The term *‘means’* of the cluster in the **K-means** refers to the average of the data in a cluster; that is, the centroid.

Suppose a data set, *D*, contains n data points in Euclidean space. Clustering (Partitioning) methods distribute the data points in D into k clusters, *C1,…, Ck*, that is, *Ci belongs to D*. An objective function is used to assess the partitioning quality so that data points within a cluster are similar to one another.

A K means clustering technique uses the mean(i.e. centroid value) of a cluster, * Ci*, to represent that cluster. The difference between a data point

K means error |

Where

In other words, for each data points in each cluster, the distance from the data point to its cluster center is squared, and the distances are summed. This objective function tries to make the resulting k clusters as compact and as separate as possible from other clusters.

The k-means algorithm defines the centroid of a cluster as the mean value of the points among the cluster.

It works as follows,

First of all, it randomly selects k of the data point in D, each of which initially represents a cluster mean (i.e. centroid). Each of the remaining data points is assigned to the cluster to which it is the most similar, based on the Euclidean distance between the object and the cluster mean. The k-means algorithm then iteratively improves the within-cluster variation. For every cluster, it computes the new mean using the data points allocated to the cluster in the previous iteration. All the data points are then reallocated using the updated means as the new cluster means. The iterations go on until the allocation is stable, that is, the clusters formed in the current round are the same as those formed in the previous round (i.e. the cluster mean obtained in this round is the same as that of the obtained in the previous round). The k-means algorithm is summarized in the figure below,

The k-means algorithm is used for partitioning (clustering), where each cluster’s center is represented by the mean value of the data points in the cluster (group).**We have**

**Input**:

-k: the number of clusters,

-D: a data set containing n objects.

**Output: **

-A set of k clusters.

**Steps:**

(1) randomly select k data points from D as the initial cluster centers;

(2)**Start loop**;

(1) re/assign each data points to the cluster to which the object is the most similar (i.e. which is more near based upon the Euclidean distance measure), based on the mean value of the objects in the cluster;

(2) update the cluster means, that is, calculate the new mean value of the data points for each cluster;

(3)

Also read – Supervised machine learning

Post a Comment

No Comments