K-Means Clustering

K-Means Clustering

What is Clustering? Also called unsupervised learning, sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing Mengelompokkan data-data menjadibeberapa cluster berdasarkankesamaannya

What is a natural grouping among these objects?

What is a natural grouping among these objects? Clustering is subjective Simpson's Family Females Males School Employees

Two Types of Clustering • Partitional algorithms:Membuatbeberapapartisidanmengelompokkanobjekberdasarkankriteriatertentu • Hierarchical algorithms:Membuatdekomposisipengelompokanobjekberdasarkankriteriatertentu. Misal= tua-muda, tua-muda(merokok-tidakmerokok) Partitional Hierarchical

What is Similarity? The quality or state of being similar; likeness; resemblance; as, a similarity of features. Webster's Dictionary Similarity is hard to define, but… “We know it when we see it”.

0 4 8 8 7 7 0 2 0 3 3 0 1 0 4 Distance : Adalahukurankesamaanantarobjek yang dihitungberdasarkanrumusantertentu D( , ) = 8 D( , ) = 1

Partitional Clustering • Nonhierarchical, setiapobjekditempatkandisalahsatu cluster • Nonoverlapping cluster • Jumlahkluster yang akandibentukditentukansejakawal

Algorithmk-means Tentukanberapa cluster k yang maudibuat. Inisialisasicentroiddaritiap cluster (randomly, if necessary). Tentukankeanggotaanobjek-objek yang lain denganmengklasifikasikannyasesuaicentroidterdekat (berdasarkan distance kecentroid) Setelah cluster dananggotanyaterbentuk, hitung mean tiap cluster danjadikansebagaicentroidbaru Jikacentroidbarutidaksamadengancentroid lama, makaperludiupdatelagikeanggotaanobjek-objeknya(balikke -3). Sebaliknyajikacentroidbarusamadengan yang lama makaselesai.

k3 k1 k2 K-means Clustering: Step 1-2 Tentukanberapa cluster k yang maudibuat. Inisialisasicentroiddaritiap cluster (randomly, if necessary) 5 4 3 2 1 0 0 1 2 3 4 5

k3 k1 k2 K-means Clustering: Step 3 Tentukankeanggotaanobjek-objek yang lain dengan mengklasifikasikannyasesuaicentroidterdekat 5 4 3 2 1 0 0 1 2 3 4 5

k3 k1 k2 K-means Clustering: Step 4 Setelah cluster dananggotanyaterbentuk, hitung mean tiap cluster danjadikansebagaicentroidbaru 5 4 3 2 1 0 0 1 2 3 4 5

k3 k1 k2 K-means Clustering: Step 5 Jikacentroidbarutidaksamadengancentroid lama, makaperludiupdatelagikeanggotaanobjek-objeknya 5 4 3 2 1 0 0 1 2 3 4 5

k1 k2 k3 K-means Clustering: Finish Lakukaniterasi step 3-5 sampaitakadalagiperubahancentroid dantakadalagiobjek yang berpindahkelas

Comments on the K-Means Method • Strength • Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. • Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms • Weakness • Applicable only when mean is defined, then what about categorical data? • Need to specify k, the number of clusters, in advance • Unable to handle noisy data and outliers

Algoritmapengukuran distance • SqEuclidean • Cityblock • Cosine • Correlation • Hamming

MATLAB • [IDX,C] = kmeans(X,k) returns the k cluster centroid locations in the k-by-p matrix C

[...] = kmeans(...,'param1',val1,'param2',val2,...) enables you to specify optional parameter name-value pairs to control the iterative algorithm used by kmeans. • The parameters are : • ‘distance’ • ‘start’ • ‘replicates’ • ‘maxiter’ • ‘emptyaction’ • ‘display’

'distance’ • Distance measure, in p-dimensional space, that kmeans minimizes with respect to. kmeans computes centroid clusters differently for the different supported distance measures:

'start' • Method used to choose the initial cluster centroid positions, sometimes known as "seeds". Valid starting values are:

'replicates' • Number of times to repeat the clustering, each with a new set of initial cluster centroid positions. • kmeans returns the solution with the lowest value for sumd. • You can supply 'replicates' implicitly by supplying a 3-dimensional array as the value for the 'start' parameter.

'maxiter' • Maximum number of iterations. Default is 100.

'emptyaction' • Action to take if a cluster loses all its member observations. Can be one of:

'display' • Controls display of output. • 'off‘ : Display no output. • 'iter‘ : Display information about each iteration during minimization, including the iteration number, the optimization phase, the number of points moved, and the total sum of distances. • 'final‘ : Display a summary of each replication. • 'notify‘ : Display only warning and error messages. (default)

Example dataku =[ 7 26 6 60; 1 29 15 52; ... 11 56 8 20; ... 11 31 8 47; ... 7 52 6 33; ... 11 55 9 22; ... 3 71 17 6; ... 1 31 22 44; ... 2 54 18 22; ... 21 47 4 26; ... 1 40 23 34; ... 11 66 9 12; ... 10 68 8 12]

Using kmeans to build 3 cluster • hasilk = kmeans(dataku,3)

Result hasilk = 1 1 2 1 2 2 2 3 2 2 3 2 2

Meaning of the result • Data at row number 1, 2, and 4 are member of first cluster (cluster number 1). • Data at row number 3,5,6,7,9,10,12 and 13 are member of second cluster (cluster number 2). • Data at row number 8 and 11 are member of third cluster (cluster number 3).

K-Means Clustering

K-Means Clustering

Presentation Transcript

k -means Clustering

K-means Clustering

K-means Clustering

K means Clustering ( Weka )

Canopy Clustering and K-Means Clustering

K-MEANS CLUSTERING

K-Means Clustering

K-means clustering

Combinatorial clustering algorithms. Example: K-means clustering

K-means Clustering

Initial K-Means Clustering :

K-means Clustering

Determining the ‘k’ in k-Means Clustering

K-means Clustering

Clustering Beyond K -means

Clustering: K-Means

K-means clustering

Categorical K-means Clustering Algorithm