1 / 22

ROCK: A Robust Clustering Algorithm for Categorical Attributes

ROCK: A Robust Clustering Algorithm for Categorical Attributes. Sudipto Guha Stanford University Stanford, CA 94305. Rajeev Rastogi Bell Laboratories Murray Hill, NJ 07974. Kyuseok Shim Bell Laboratories Murray Hill, NJ 07974.

lara
Download Presentation

ROCK: A Robust Clustering Algorithm for Categorical Attributes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ROCK: A Robust Clustering Algorithm for Categorical Attributes SudiptoGuha Stanford University Stanford, CA 94305 Rajeev Rastogi Bell Laboratories Murray Hill, NJ 07974 Kyuseok Shim Bell Laboratories Murray Hill, NJ 07974 Data Engineering, 1999. Proceedings., 15th International Conference on Presented by 王真儀, 張家騏, 翁岳廷, 李文淦 2013/12/12

  2. Clustering? {Wine, cheese} {toy, milk, baby food} {Wine} {toy, milk}

  3. Partitionalclustering • k-means

  4. Hierarchical clustering • Agglomerative / Divisive 5 3 • single-linkage • complete-linkage • average-linkage

  5. Categorical Problem: Distance (a) {1,2,3,5} (b) {2,3,4,5} (c) {1,4} (d) {6} (1,1,1,0,1,0) (0,1,1,1,1,0) (1,0,0,1,0,0) (0,0,0,0,0,1) (0.5, 1, 1, 0.5, 1, 0) ? d(a,b) = d(a,c) = 2 d(a,d) = d(b,c) = 2 d(b,d) = d(c,d) = d(ab,c) = d(ab,d) = d (c,d) =

  6. Categorical Problem: Jaccard single-linkage {1, 2, 3} {2, 3, 5} {3, 4, 5} . . 0.5 0.2 0.5 Rang 0.2 ~ 0.5 0.5 0.2

  7. Local Relation vs Global Relation A B

  8. Define • Neighbors • Link(pi, pj) is the number of common neighbors between pi and pj a has a neighbor b, c also has neighbor b, there is a link between a and c. b c a

  9. Criterion Functionof Rock • Maximize the number of links in each clusters. • Minimize the number of links between clusters. # of excepted links link

  10. Criterion Functionof Rock cont. # of excepted links Each neighbor also has X neighbors Each point has X neighbors The clusteri has ni point neighbor point Link …. =ni*X2 # of excepted links = …. …. * * If the clusteri has ni point, each point should have nif(θ) neighbor. X = nif(θ) => # of excepted links = ni1+2f(θ) X ni X

  11. The Goodness between Cluster Link[Ci,Cj] Cj Ci Ci+Cj # of excepted Links: A # of excepted Links: B # of excepted Links C # of excepted Links between Ci and Cj: (ni+nj)1+2f(θ) – ni1+2f(θ)– nj1+2f(θ) # of excepted Links between Ci and Cj: C– A – B The “goodness” of Ci and Cj:

  12. ROCK Clustering Algorithm • RObust Clustering using linKs(ROCK) clustering algorithm • Agglomerative hierarchical clustering algorithms

  13. Clustering Algorithm decreasing order v u u w v n*n n-1*n-1

  14. Compute links 2 3 1 6 4 5 nbrlist: link: 3 1 2 4 5 6 1 2 3 4 5 6

  15. Time Complexity Time complexity: mi:neighbor_size( i ) Ma: average number of neighbors Mm: maximum number of neighbors

  16. Time Complexity cont.

  17. Labeling data on database p Li denote this set of points from cluster i and used for labeling Ni denote neighbor

  18. Experiment • Real life data set. • Categorical attribute • Outlier handling: • Eliminating clusters with only 1 point when numbers of clusters is less than 1/3 of original points numbers.

  19. Result 25% 12%

  20. Experiment cont.

  21. Discussion • Only good with categorical problem? How about numerical problem? • What if merge all clusters in two in experiment of mushroom.

  22. Any Question?

More Related