370 likes | 850 Views
ROCK: A Robust Clustering Algorithm for Categorical Attributes. Sudipto Guha Stanford University Stanford, CA 94305. Rajeev Rastogi Bell Laboratories Murray Hill, NJ 07974. Kyuseok Shim Bell Laboratories Murray Hill, NJ 07974.
E N D
ROCK: A Robust Clustering Algorithm for Categorical Attributes SudiptoGuha Stanford University Stanford, CA 94305 Rajeev Rastogi Bell Laboratories Murray Hill, NJ 07974 Kyuseok Shim Bell Laboratories Murray Hill, NJ 07974 Data Engineering, 1999. Proceedings., 15th International Conference on Presented by 王真儀, 張家騏, 翁岳廷, 李文淦 2013/12/12
Clustering? {Wine, cheese} {toy, milk, baby food} {Wine} {toy, milk}
Partitionalclustering • k-means
Hierarchical clustering • Agglomerative / Divisive 5 3 • single-linkage • complete-linkage • average-linkage
Categorical Problem: Distance (a) {1,2,3,5} (b) {2,3,4,5} (c) {1,4} (d) {6} (1,1,1,0,1,0) (0,1,1,1,1,0) (1,0,0,1,0,0) (0,0,0,0,0,1) (0.5, 1, 1, 0.5, 1, 0) ? d(a,b) = d(a,c) = 2 d(a,d) = d(b,c) = 2 d(b,d) = d(c,d) = d(ab,c) = d(ab,d) = d (c,d) =
Categorical Problem: Jaccard single-linkage {1, 2, 3} {2, 3, 5} {3, 4, 5} . . 0.5 0.2 0.5 Rang 0.2 ~ 0.5 0.5 0.2
Define • Neighbors • Link(pi, pj) is the number of common neighbors between pi and pj a has a neighbor b, c also has neighbor b, there is a link between a and c. b c a
Criterion Functionof Rock • Maximize the number of links in each clusters. • Minimize the number of links between clusters. # of excepted links link
Criterion Functionof Rock cont. # of excepted links Each neighbor also has X neighbors Each point has X neighbors The clusteri has ni point neighbor point Link …. =ni*X2 # of excepted links = …. …. * * If the clusteri has ni point, each point should have nif(θ) neighbor. X = nif(θ) => # of excepted links = ni1+2f(θ) X ni X
The Goodness between Cluster Link[Ci,Cj] Cj Ci Ci+Cj # of excepted Links: A # of excepted Links: B # of excepted Links C # of excepted Links between Ci and Cj: (ni+nj)1+2f(θ) – ni1+2f(θ)– nj1+2f(θ) # of excepted Links between Ci and Cj: C– A – B The “goodness” of Ci and Cj:
ROCK Clustering Algorithm • RObust Clustering using linKs(ROCK) clustering algorithm • Agglomerative hierarchical clustering algorithms
Clustering Algorithm decreasing order v u u w v n*n n-1*n-1
Compute links 2 3 1 6 4 5 nbrlist: link: 3 1 2 4 5 6 1 2 3 4 5 6
Time Complexity Time complexity: mi:neighbor_size( i ) Ma: average number of neighbors Mm: maximum number of neighbors
Labeling data on database p Li denote this set of points from cluster i and used for labeling Ni denote neighbor
Experiment • Real life data set. • Categorical attribute • Outlier handling: • Eliminating clusters with only 1 point when numbers of clusters is less than 1/3 of original points numbers.
Result 25% 12%
Discussion • Only good with categorical problem? How about numerical problem? • What if merge all clusters in two in experiment of mushroom.
Any Question?