1 / 28

K-Means Clustering

K-Means Clustering. What is Clustering?. Also called unsupervised learning , sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing. Mengelompokkan data-data menjadi beberapa cluster berdasarkan kesamaannya.

macy
Download Presentation

K-Means Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. K-Means Clustering

  2. What is Clustering? Also called unsupervised learning, sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing Mengelompokkan data-data menjadibeberapa cluster berdasarkankesamaannya

  3. What is a natural grouping among these objects?

  4. What is a natural grouping among these objects? Clustering is subjective Simpson's Family Females Males School Employees

  5. Two Types of Clustering • Partitional algorithms:Membuatbeberapapartisidanmengelompokkanobjekberdasarkankriteriatertentu • Hierarchical algorithms:Membuatdekomposisipengelompokanobjekberdasarkankriteriatertentu. Misal= tua-muda, tua-muda(merokok-tidakmerokok) Partitional Hierarchical

  6. What is Similarity? The quality or state of being similar; likeness; resemblance; as, a similarity of features. Webster's Dictionary Similarity is hard to define, but… “We know it when we see it”.

  7. 0 4 8 8 7 7 0 2 0 3 3 0 1 0 4 Distance : Adalahukurankesamaanantarobjek yang dihitungberdasarkanrumusantertentu D( , ) = 8 D( , ) = 1

  8. Partitional Clustering • Nonhierarchical, setiapobjekditempatkandisalahsatu cluster • Nonoverlapping cluster • Jumlahkluster yang akandibentukditentukansejakawal

  9. Algorithmk-means Tentukanberapa cluster k yang maudibuat. Inisialisasicentroiddaritiap cluster (randomly, if necessary). Tentukankeanggotaanobjek-objek yang lain denganmengklasifikasikannyasesuaicentroidterdekat (berdasarkan distance kecentroid) Setelah cluster dananggotanyaterbentuk, hitung mean tiap cluster danjadikansebagaicentroidbaru Jikacentroidbarutidaksamadengancentroid lama, makaperludiupdatelagikeanggotaanobjek-objeknya(balikke -3). Sebaliknyajikacentroidbarusamadengan yang lama makaselesai.

  10. k3 k1 k2 K-means Clustering: Step 1-2 Tentukanberapa cluster k yang maudibuat. Inisialisasicentroiddaritiap cluster (randomly, if necessary) 5 4 3 2 1 0 0 1 2 3 4 5

  11. k3 k1 k2 K-means Clustering: Step 3 Tentukankeanggotaanobjek-objek yang lain dengan mengklasifikasikannyasesuaicentroidterdekat 5 4 3 2 1 0 0 1 2 3 4 5

  12. k3 k1 k2 K-means Clustering: Step 4 Setelah cluster dananggotanyaterbentuk, hitung mean tiap cluster danjadikansebagaicentroidbaru 5 4 3 2 1 0 0 1 2 3 4 5

  13. k3 k1 k2 K-means Clustering: Step 5 Jikacentroidbarutidaksamadengancentroid lama, makaperludiupdatelagikeanggotaanobjek-objeknya 5 4 3 2 1 0 0 1 2 3 4 5

  14. k1 k2 k3 K-means Clustering: Finish Lakukaniterasi step 3-5 sampaitakadalagiperubahancentroid dantakadalagiobjek yang berpindahkelas

  15. Comments on the K-Means Method • Strength • Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. • Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms • Weakness • Applicable only when mean is defined, then what about categorical data? • Need to specify k, the number of clusters, in advance • Unable to handle noisy data and outliers

  16. Algoritmapengukuran distance • SqEuclidean • Cityblock • Cosine • Correlation • Hamming

  17. MATLAB • [IDX,C] = kmeans(X,k) returns the k cluster centroid locations in the k-by-p matrix C

  18. [...] = kmeans(...,'param1',val1,'param2',val2,...) enables you to specify optional parameter name-value pairs to control the iterative algorithm used by kmeans. • The parameters are : • ‘distance’ • ‘start’ • ‘replicates’ • ‘maxiter’ • ‘emptyaction’ • ‘display’

  19. 'distance’ • Distance measure, in p-dimensional space, that kmeans minimizes with respect to. kmeans computes centroid clusters differently for the different supported distance measures:

  20. 'start' • Method used to choose the initial cluster centroid positions, sometimes known as "seeds". Valid starting values are:

  21. 'replicates' • Number of times to repeat the clustering, each with a new set of initial cluster centroid positions. • kmeans returns the solution with the lowest value for sumd. • You can supply 'replicates' implicitly by supplying a 3-dimensional array as the value for the 'start' parameter.

  22. 'maxiter' • Maximum number of iterations. Default is 100.

  23. 'emptyaction' • Action to take if a cluster loses all its member observations. Can be one of:

  24. 'display' • Controls display of output. • 'off‘ : Display no output. • 'iter‘ : Display information about each iteration during minimization, including the iteration number, the optimization phase, the number of points moved, and the total sum of distances. • 'final‘ : Display a summary of each replication. • 'notify‘ : Display only warning and error messages. (default)

  25. Example dataku =[ 7 26 6 60; 1 29 15 52; ... 11 56 8 20; ... 11 31 8 47; ... 7 52 6 33; ... 11 55 9 22; ... 3 71 17 6; ... 1 31 22 44; ... 2 54 18 22; ... 21 47 4 26; ... 1 40 23 34; ... 11 66 9 12; ... 10 68 8 12]

  26. Using kmeans to build 3 cluster • hasilk = kmeans(dataku,3)

  27. Result hasilk = 1 1 2 1 2 2 2 3 2 2 3 2 2

  28. Meaning of the result • Data at row number 1, 2, and 4 are member of first cluster (cluster number 1). • Data at row number 3,5,6,7,9,10,12 and 13 are member of second cluster (cluster number 2). • Data at row number 8 and 11 are member of third cluster (cluster number 3).

More Related