390 likes | 565 Views
Incremental Clustering And Dynamic Information Retrieval. By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab. Outline:. Motivation Main Problem Hierarchical Agglomerative Clustering A Model Incremental Clustering
E N D
Incremental Clustering And Dynamic Information Retrieval By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab
Outline: • Motivation • Main Problem • Hierarchical Agglomerative Clustering • A Model Incremental Clustering • Different incremental algorithms • Lower Bounds for incremental algorithms • Dual Problem
I. Main Problem The clustering problem is as follows: given n points in a metric space M, partition the points into k clusters so as to minimize the maximum cluster diameter.
1.Greedy Incremental Clustering • Center-Greedy • Diameter-greedy
a) Center-Greedy The center-greedy algorithm associates a center for each cluster and merges the two clusters whose centers are closest. The center of the old cluster with the larger radius becomes the new center Theorem: The center-greedy algorithm’s performance ratio has a lower bound of 2k - 1.
-1 0 0 1 0 0 -1 1 0 0 1 1 v3 v4 v5 v1 v2 S3 S2 S2 S0 S1 a) Center-Greedy cont. Proof: • 1-Tree Construction 0 K=2
a) Center-Greedy cont. • 2-Tree Graph • Set Ai (in our example Ai={{v1},{v2}, {v3},{v4}}) 1 1-e2 1-e1 v3 v4 v5 v1 v2 S3 S2 S0 S1 S2 1 1-e3
a) Center-Greedy cont. Claims: • For 1 <= i <= 2k - 1, Ai is the set of clusters of center-greedy which contain more than one vertex after the k + i vertices v1, . . . , vk+i are given. • There is a k-clustering of G of diameter 1. The clustering which achieves the above diameter is{S0 US1, . . . , S2k-2 US2k-1}.
Competitiveness of Center-Greedy Theorem :The center-greedy algorithm has performance ratio of 2k-1 in any metric space.
b) Diameter-Greedy The diameter-greedy algorithm always merges those two clusters which minimize the diameter of the resulting merged cluster. Theorem : The diameter-greedy algorithm’s performance ratio W(log(k)) is even on the line.
b) Diameter-Greedy cont. • Proof: 1) Assumptions Ui= Uj=1Fi{{pij, qij}, {rij, sij}}, Vi= Uj=1Fi{{qij}, {rij}}, Wi= Uj=1Fi{{pij}, {qij, rij}}, Xi= Uj=1Fi{{pij}, {qij, rij}, {sij}}, Yi= Uj=1Fi{{pij, qij, rij}, {sij}}, Zi= Uj=1Fi{{pij, qij, rij, sij}}.
b) Diameter-Greedy cont. • Proof: 2) Invariant : When the last element of Kt is received, diameter-greedy’s k+1 clusters are (Ui=1t-2 Zi) UYt-1U Xt (Uri=t+1 Vi). Since there are k+1 clusters, two of the clusters have to be merged and the algorithm merges two clusters in Vt+1to form a cluster of diameter (t+1). Without loss of generality, we may assume that the clusters merged are {q(t+1)1} and {r(t+1)1}.
Competitiveness of Diameter-Greedy Theorem :For k = 2, the diameter-greedy algorithm has a performance ratio 3 in any metric space.
2.Doubling Algorithm • Deterministic • Randomized • Oblivious • Randomized Oblivious
a) Deterministic doubling algorithm • The algorithm works in phases • At the start of phase i it has k+1 clusters • Uses a and b, s.t a/(1-a)<=b • At start of phase i the following is assumed: 1. for each cluster Cj, the radius of Cjdefined as maxp Cj d(cj, p) is at most αdi 2. for each pair of clusters Cjand Cl, the inter-center distance d(cj, cl) => di 3. di <= opt.
a) Deterministic doubling algorithm • Each phase has two stages 1- Merging stage, in which the algorithm reduces the number of clusters by merging certain pairs 2-Update stage, in which the algorithm accepts new updates and tries to maintain at most k clusters without increasing the radius of the clusters or violating the invariants A phase ends when number of clusters exceeds k
a) Deterministic doubling algorithm • Definition: The t-threshold graph on a set of points P = {p1, p2, . . . , pn} is the graph G=(P,E) such that (pi, pj) in E if and only if d(pi, pj) <= t. • Merging stage defines di+1= bdi and a graph G di+1–threshold for centers c1,. . . , ck+1 . • New clusters C’1…C’m. If m=k+1 this ends the phase i
a) Deterministic doubling algorithm • Lemma The pairwise distance between cluster centers after the merging stage of phase i is at least di+1. • Lemma The radius of the clusters after the merging stage of phase i is at most di+1+αdi<=αdi+1 • Update continues while the number of clusters is at most k. It is restricted by the radius bound αdi+1. Then phase i ends.
a) Deterministic doubling algorithm • Initialization : the algorithm waits until k+1 points have arrived then enters phase 1, with each point as a center containing just itself. And d1 set to the distance between the closest pair of points
a) Deterministic doubling algorithm • Lemma The k + 1 clusters at the end of the ith phase satisfy the following conditions: 1. The radius of the clusters is at most αdi+1. 2. The pairwise distance between the cluster centers is at least d i+1. 3. di+1<= OPT, where OPT is the diameter of the optimal clustering for the current set of points. Theorem:The doubling algorithm has performance ratio 8 in any metric space.
a) Deterministic doubling algorithm Example to show the analysis is tight: k=>3. Input consists of k+3 points p1…pk+3 the points p1…pk+1 have distance 1, pk+2 ,pk+3 have distance 4 from the others, and 8 from each other.
b) Randomized doubling algorithm • Choose a random variable r from [1/e,1] according to the probability density function 1/r • The min pairwise distance of the first k+1 point is x. And d1=rx • b=e, a=e/(e-1)
b) Randomized doubling algorithm Theorem :The randomized doubling algorithm has expected performance ratio 2e in any metric space. The same bound is also achieved for the radius measure.
c) Oblivious clustering algorithm • Does not need to know k • Assume we have un upper bound on the max distance between point which is 1. • Points are maintained in a tree
c) Oblivious clustering algorithm cont. Root at depth 0 Within dist. 1/2i-1from parent At distance greater than 1/2i Where i is the depth of the vertex , i=>0
c) Oblivious clustering algorithm cont. • How do we obtain the k clusters from the tree? • If k is given, and i is the greatest depth containing at most k points. • These are the k cluster centers. The sub-trees of the vertices at depth i are the clusters. • As points are added, the number of vertices at depth i increases; if it goes beyond k, then we change i to i - 1, collapsing certain clusters; otherwise, the new point is inserted in one of the existing clusters.
c) Oblivious clustering algorithm cont. Theorem :The algorithm that outputs the k clusters obtained from the tree construction has performance ratio 8 for the diameter measure and the radius measure. • Optimal diameter is ½i+1< d <= ½I • Then points at depth i are in different clusters, so there are at most k of them. • j=>i be the greatest depth containing at most k points. • Subtrees are at a distance of the root within ½j+ ½j+1 + ½j+2 + · · ·<= ½j-1< 4d.
d) Randomized Oblivious • Distance threshold for depth i is r/ei • r chosen once at random from [1,e], according to the PDF 1/r • The expected diameter is at most 2e.OPT diameter
Lower Bounds Theorem1:For k = 2, there is a lower bound of 2 and 2 - ½k/2on the performance ratio of deterministic and randomized algorithms, respectively, for incremental clustering on the line.
Lower Bounds cont. Theorem2: There is a lower bound of 1+21/2on the performance ratio of any deterministic incremental clustering algorithm for arbitrary metric spaces.
Lower Bounds cont. Theorem3:For any e>0 and k=>2, there is a lower bound of 2 - e on the performance ratio of any randomized incremental algorithm.
Lower Bounds cont. Theorem4:For the radius measure, no deterministic incremental clustering algorithm has a performance ratio better than 3 and no randomized algorithm has a ratio better than 3 – e for any fixed e > 0.
II. Dual Problem For a sequence of points p1,p2,...,pnRd, cover each point with a unit ball in d as it arrives, so as to minimize the total number of balls used.
II. Dual Problem Rogers Theorem:Rd can be covered by any convex shape with covering density O(d log d). Theorem:For the dual clustering problem in Rd, there is an incremental algorithm with performance ratio O(2dd log d). Theorem:For the dual clustering problem in d, any incremental algorithm must have performance ratio W( (log d)/(log log log d)).