540 likes | 707 Views
On the Parallel Complexity of Hierarchical Clustering and CC -Complete Problems. Dr. Raymond Greenlaw School of Computing Armstrong Atlantic State University and Dr. Sanpawat Kantabutra Department of Computer Science Chiang Mai University. Outline. Introduction Preliminaries
E N D
On the Parallel Complexity of Hierarchical Clustering and CC-Complete Problems Dr. Raymond Greenlaw School of Computing Armstrong Atlantic State University and Dr. Sanpawat Kantabutra Department of Computer Science Chiang Mai University
Outline • Introduction • Preliminaries • Algorithms for Hierarchical Clustering • Complexity of Hierarchical Clustering • CC-Complete Problems • Conclusions and Open Problems • References • Acknowledgments
Outline • Introduction • Preliminaries • Algorithms for Hierarchical Clustering • Complexity of Hierarchical Clustering • CC-Complete Problems • Conclusions and Open Problems • References • Acknowledgments
Introduction • Clustering is a division of data into groups of ‘similar’ objects, where each group is given a more-compact representation. • Used to model very large data sets. • Points are more similar to their own cluster than to points in other clusters.
Introduction • Useful tool in data mining, where immense data sets which are difficult to store and to manipulate are involved. • Study the parallel complexity of the hierarchical clustering problem. • Builds a tree of clusters. • Sibling clusters in this tree partition the points associated with their parent. • Can explore data using various levels of granularity.
Introduction • Two widely studied models • Bottom-Up Starts with single-point clusters and then recursively merges two or more of the most-‘appropriate’ clusters. • Top-Down Starts with one large cluster consisting of all the data points and then recursively splits the most-‘appropriate’ cluster. • In both methods, the process continues until a desired stopping condition is met such as a required number of clusters or a diameter bound of the ‘largest’ cluster.
Introduction • A variety of sequential versions of hierarchical-clustering methods have been studied: • Cure Guha, et al.: Bottom-Up, good for clusters having arbitrary shapes or outliers • Chameleon Karypis et al.: Bottom-Up, relies heavily on graph partitioning • Principal Direction Divisive Partitioning Boley: Top-Down, good for document collections • Hierarchical Divisive Bisecting k-means Steinbach: Top-Down
Introduction • Address the parallel complexity of hierarchical clustering. • Describe known sequential algorithms for top-down and bottom-up hierarchical clustering. • Parallelize top-down, when n points are to be clustered, provide an O(log n)-time, n2-processor CREW-PRAM algorithm that computes the same output as the corresponding sequential algorithm.
Introduction • Define a natural decision problem based on bottom-up hierarchical clustering and add this Hierarchical Clustering Problem (HCP) to the list of CC-complete problems, adding a data mining problem for the first time. • Show that HCP is one of the computationally most-difficult problems in the Comparator Circuit Value Problem (CCVP) class.
Introduction • Demonstrate that the HCP is very unlikely to have an NC algorithm. • In sharp contrast, give an NC algorithm for the top-down sequential approach. • Parallel complexities of top-down and bottom-up are different, unless CC equals NC.
Outline • Introduction • Preliminaries • Algorithms for Hierarchical Clustering • Complexity of Hierarchical Clustering • CC-Complete Problems • Conclusions and Open Problems • References • Acknowledgments
Preliminaries • Interested in relating the complexity of hierarchical clustering to that of a problem involving Boolean circuits containing comparator gates. • Comparator gates have two output wires, the first outputting the minimum and the second outputting the maximum of its two inputs. • Each output has a maximum fanout of one.
Preliminaries • Based on the comparator gate • Basis for an entire complexity class Comparator Circuit Value Problem (CCVP) • Given: An encoding of a Boolean circuit composed of comparator gates, inputs x1,…,xn, and a designated output y. • Problem: Is output y of TRUE on input x1,…,xn?
Preliminaries • Let P denote the class of all languages decidable in polynomial time. • Let NC denote the class of all languages decidable in poly-logarithmic time using a polynomial number of processors on a PRAM. • Let RNC denote the randomized version of NC. • Let NLOG denote the class non-deterministic logarithmic space. • Let CC denote the class of problems that are NC many-one reducible to CCVP.
Outline • Introduction • Preliminaries • Algorithms for Hierarchical Clustering • Complexity of Hierarchical Clustering • CC-Complete Problems • Conclusions and Open Problems • References • Acknowledgments
Algorithms for Hierarchical Clustering Sequential Algorithms – Bottom-Up • Input: set of points, distance function, bound B, and desired number of clusters, k • Output: set of clusters • Pair up all points starting with the two closest ones, then the next remaining two closest ones, and so on, until all are paired. • Next, the sets of points X and Y minimizing dmin(X,Y) over all remaining sets are merged, until only k sets remain.
Algorithms for Hierarchical Clustering Sequential Algorithms – Bottom-Up (cont.) • Assumed that the number of input points is even. • There are no restrictions placed on the distance function. • In the first phase of the algorithm points are clustered whose distance is less than or equal to B. • Operates in polynomial time.
Algorithms for Hierarchical Clustering Sequential Algorithms – Top-Down • Function v(G) takes a graph as its argument and returns a set that consists of the vertices from G. • Input: set of points, a distance function, and the desired number of clusters k • Output: set of clusters • All points start in the same cluster.
Algorithms for Hierarchical Clustering • Compute a minimum-cost spanning tree. • Form clusters by repeatedly removing the highest-cost edge from what remains of a minimum-cost spanning tree of the graph corresponding to the initial set of points with respect to the distance function, until exactly k sets have been formed. • Runs in polynomial time.
Algorithms for Hierarchical Clustering • Top-Down and Bottom-Up have different parallel complexities, unless CC equals NC. • Prove that the exact same clusters as produced by the Sequential (Top-Down) Hierarchical Clustering Algorithm can be computed in NC. • A natural decision problem based on the Sequential (Bottom-Up) Hierarchical Clustering Algorithm is CC-complete.
Algorithms for Hierarchical Clustering • Since a CC-complete problem is very unlikely to have an NC algorithm and a problem with an NC algorithm is very unlikely to be CC-complete, the parallel complexities of these two sequential algorithms are different. • For a fast parallel algorithm for hierarchical clustering, the algorithm should be based on the Top-Down approach.
Algorithms for Hierarchical Clustering • Theorem: Let n denote the number of points to be clustered. The Parallel (Top-Down) Hierarchical Clustering Algorithm can be implemented in O(log n) time using n2 processors on the CREW PRAM. • This algorithm is an NC algorithm, which means that the clusters can be computed very fast in parallel. • Any reasonable decision problem based on this algorithm will be in NC.
Outline • Introduction • Preliminaries • Algorithms for Hierarchical Clustering • Complexity of Hierarchical Clustering • CC-Complete Problems • Conclusions and Open Problems • References • Acknowledgments
Complexity of Hierarchical Clustering Hierarchical Clustering Problem (HCP) • Given: A set S of n points in Rd, a distance function dS : S x S N, the number of clusters k ≤ n/2 N, a distance bound B, and two points x, y S. • Problem: Are x and y with dS(x, y) ≤ B in the same cluster C after the first-phase of the Sequential (Bottom-Up) Hierarchical Clustering Algorithm?
Complexity of Hierarchical Clustering • No restrictions placed on the properties the distance function must satisfy, the distances themselves must be natural numbers. • This version of the problem easily reduces to the version where the weights come from R+. • Not concerned with the distance between a point and itself, the k is the number of clusters to be formed. • x and y are required to be no further apart than the distance bound B.
Complexity of Hierarchical Clustering Lexicographically First Maximal Matching Problem (LFMMP) • Given: An undirected graph G = (V, E) with an ordering on its edges plus a distinguished edge e E. • Problem: Is e in the lexicographically first maximal matching of G? • A matching is maximal if it cannot be extended.
NC m Complexity of Hierarchical Clustering • LFMMP is CC-complete [Cook 1982, Mayr and Subramanian 1992]. • Theorem: The Hierarchical Clustering Problem is NC many-one reducible to the Lexicographically First Maximal Matching Problem, that is, HCP ≤ LFMMP. • HCP is in CC.
NC m Complexity of Hierarchical Clustering • Theorem: The Lexicographically First Maximal Matching Problem is NC many-one reducible to the Hierarchical Clustering Problem, that is, LFMMP ≤ HCP. • Proof Sketch: Let G = (V = {1,…,n},E), ø : E {1,…,|E|} be an ordering on E, and e = {u,v} E be an instance of the LFMMP. Construct instance of HCP, a set S of n points p1,…,pn in Rd, a distance function dS : S x S N, clusters k≤ n/2 N, bound B, and x,yS.
Complexity of Hierarchical Clustering • Proof (cont.): Let S = {1,…,n,n+1,…,2n}. Let V’ = S – V. Define the distance function between each pair of points in S as follows: • Let B = |E|, k = n, and take u and v as our points {x,y} E
Complexity of Hierarchical Clustering • Theorem: The Hierarchical Clustering Problem is CC-complete. • This expands the list of CC-complete problems and adds the first clustering/data mining problem to the class.
Outline • Introduction • Preliminaries • Algorithms for Hierarchical Clustering • Complexity of Hierarchical Clustering • CC-Complete Problems • Conclusions and Open Problems • References • Acknowledgments
CC-Complete Problems Comparator Circuit Value Problem (CCVP) • Given: An encoding of a Boolean circuit composed of comparator gates, inputs x1,…,xn, and a designated output y. • Problem: Is output y of TRUE on input x1,…,xn? • References: [Cook 1982, Mayr and Subramanian 1992]
CC-Complete Problems Lexicographically First Maximal Matching Problem (LFMMP) • Given: An undirected graph G = (V, E) with an ordering on its edges plus a distinguished edge e E. • Problem: Is e in the lexicographically first maximal matching of G? • References: [Cook 1982, Mayr and Subramanian 1992] • Remarks: Resembles the Lexicographically First Maximal Independent Set Problem which is P-complete.
CC-Complete Problems Stable Marriage Problem (SMP) • Given: A set of n men and a set of n women. For each person a ranking of the opposite sex according to their preference for a marriage partner. • Problem: Does the given instance of the problem have a set of marriages that is stable? The set is stable if there is no unmatched pair {m, w} such that both m and w prefer each other to their current partners.
CC-Complete Problems Stable Marriage Problem (SMP) • References: [Mayr and Subramanian 1992, Subramanian 1989] • Remarks: If the preference lists are complete, there is always a solution. Several variations of the SMP are also known to be equivalent to the CCVP. The Male-Optimal Stable Marriage Problem finds a matching in which no man could do any better in a stable marriage.
CC-Complete Problems Stable Marriage Stable Pair Problem (SMSPP) • Given: A set of n men and n women, for each person a ranking of the opposite sex according to their preference for a marriage partner, and a designated couple Alice and Bob. • Problem: Are Alice and Bob a stable pair for the given instance of the problem? That is, is it the case that Alice and Bob are married to each other in some stable marriage? • References: [Mayr and Subramanian 1992, Subramanian 1989]
CC-Complete Problems Stable Marriage Minimum Regret Problem (SMMRP) • Given: A set of n men and n women, for each person a ranking of the opposite sex according to their preference for a marriage partner, and a natural number k, 1 ≤ k ≤ n. • Problem: Is there a stable marriage in which every person has regret at most k? The regret of a person in a stable marriage is the position of her mate on her preference list.
CC-Complete Problems Stable Marriage Minimum Regret Problem (SMMRP) • References: [Mayr and Subramanian 1992, Subramanian 1989] • Remarks: The goal in this problem is to minimize the maximum regret of any person.
CC-Complete Problems Telephone Connection Problem (TCP) • Given: A telephone line with a fixed channel capacity k, a natural number l, and a sequence of calls (s1, f1),…, (sn, fn), where si(fi) denotes the starting (respectively, finishing) time of the i-th call. The i-th call can be serviced at time si if the number of calls being served at that time is less than k. If the call cannot be served, it is discarded. When a call is completed, the channel is freed up.
CC-Complete Problems Telephone Connection Problem (TCP) • Problem: Is the l-th call serviced? • References: [Ramachandran and Wang 1991] • Remarks: There is an O(min( ,k) log n)-time EREW-PRAM algorithm that uses n processors for solving the TCP.
CC-Complete Problems Internal Diffusion Limited Aggregation Predication Problem (IDLAPP) • Given: A time T and a list of moves (t,i,s), one for each time 0 ≤ t ≤ T indicating that at time t for particle i, if still active, will visit site s, plus a designated site d, and a designated particle p. A particle is active if it is still moving within the cluster, that is, the particle has not yet stuck to the cluster because all of the sites that it has visited so far were occupied already.
CC-Complete Problems Internal Diffusion Limited Aggregation Predication Problem (IDLAPP) • Problem: Is site d occupied and is site p active at time T? • References: [Moore and Machta 2000]
CC-Complete Problems Internal Diffusion Limited Aggregation Predication Square Lattice Problem • Given: A time T and a list of moves (t,i,s) on a square lattice, one for each time 0 ≤ t ≤ T indicating that at time t for particle i, if still active, will visit site s, plus a designated site d, and a designated particle p. • Problem: Is site d on the square latice occupied and is site p active at time T? • References: [Moore and Matcha 2000]
CC-Complete Problems Hierarchical Clustering Problem (HCP) • Given: A set S of n points in Rd, a distance function dS : S x S N, the number of clusters k ≤ n/2 N, a distance bound B, and two points x, y S. • Problem: Are x and y with dS(x, y) ≤ B in the same cluster C after the first-phase of the Sequential (Bottom-Up) Hierarchical Clustering Algorithm? • Reference: [This work 2006]
Outline • Introduction • Preliminaries • Algorithms for Hierarchical Clustering • Complexity of Hierarchical Clustering • CC-Complete Problems • Conclusions and Open Problems • References • Acknowledgments
Conclusions • A natural decision problem based on bottom-up hierarchical clustering is CC-complete. • Top-down hierarchical clustering is in NC. • Brings the number of known CC-complete problems to ten, and shows that the HCP is unlikely to have a NC algorithm. • Fast parallel algorithms for hierarchical clustering should be based on a top-down approach.
Open Problems • Is Euclidean HCP CC-complete? (It is in CC.) • Determine the complexity of the second-phase of the Sequential (Bottom-Up) Hierarchical Clustering Algorithm. • Add new problems to the class of CC-complete problems.
Outline • Introduction • Preliminaries • Algorithms for Hierarchical Clustering • Complexity of Hierarchical Clustering • CC-Complete Problems • Conclusions and Open Problems • References • Acknowledgments
[Blumenthal 1953] Theory and Applications of Distance Geometry. Oxford University Press. [Boley 1998] Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4):325—344. [Chong, Han, and Lam 2001] Concurrent threads and optimal parallel minimum spanning tree algorithms. Journal of the ACM, 48(2):297—323. [Cole 1988] Parallel Merge Sort. SIAM Journal of Computing, 17(4):770—785. [Cook 1985] A taxonomy of problems with fast parallel algorithms. Information and Control, 64(1—3):2—22. [Dash, Petrutiu, and Scheuermann 2004] Efficient parallel hierarchical clustering. Lecture Notes in Computer Science, 3149:363—371. [Feder 1992] A new fixed point approach to stable networks and stable marriages. Journal of Computer and System Sciences, 45(2):233—284. [Gibbons 1985] Algorithmic Graph Theory. Cambridge University Press. [Greenlaw 1992] A model classifying algorithms as inherently sequential with applications to graph searching. Information and Computing, 97(2):133—149. References
References • [Greenlaw, Hoover, and Ruzzo 1995] Limits to Parallel Computation: P-Completeness Theory. Oxford University Press. • [Guha, Rastogi, and Shim 1998] Cure: An efficient clustering algorithm for large databases. In ACM SIGMOD, pages 378—385, Seattle, WA. Association for Computing Machinery. • [Jain and Dubes 1988] Algorithms for Clustering Data. Prentice-Hall. • [Karypis, Han, and Vumar 1999] CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8):68—75. • [Kaufman and Rousseeuw 1990] Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons. • [Li 1990] Parallel algorithms for hierarchical clustering and clustering validity. IEEE Trans. Pattern Analysis and Machine Intelligence, 12(11):1088—1092. • [Li and Fang 1989] Parallel clustering algorithms. Parallel Computing, 11(3):275—290. • [Mayr and Subramanian 1992] The complexity of circuit value and network stability. Journal of Computer and System Sciences, 44(2):302—323.