250 likes | 494 Views
Constraint-Driven Clustering. Rong Ge 1 , Martin Ester 1 , Wen Jin 1 , Ian Davidson 2 Presenter: Rong Ge 1 Simon Fraser University 2 University of California - Davis. Introduction. Clustering methods aim at grouping data objects into clusters based on some criteria
E N D
Constraint-Driven Clustering Rong Ge1, Martin Ester1, Wen Jin1, Ian Davidson2 Presenter: Rong Ge 1Simon Fraser University 2University of California - Davis
Introduction • Clustering methods • aim at grouping data objects into clusters based on some criteria • can be either data-driven or need-driven [Banerjee’06] • Data-Driven methods • discover the true structure of the underlying data by grouping similar data objects together • Need-Driven methods • group data objects based on not only similarity but also applicationneeds • discover more actionable clusters
Capturing Application Needs • Two methodologies: • Design sophisticated objective functions based on business needs • E.g., in catalog segmentation, clustering results are evaluated by their utility in decision making [Kleinberg et al.’99] • Capture application needs by constraints • E.g., discovering balanced customer groups in market segmentation [Ghosh et al.’02] • Yet, existing models often require users to provide the number of clusters • Often unknown • Or not suit for application needs
Constraint-Driven Clustering • Constraint-Driven Clustering • Utilizes constraints to control cluster formation • Discovers an arbitrary number of clusters • Goals: • Discover compact clusters • Satisfy all constraints • Two constraint types (Cluster-level constraints) • Minimum significance constraint • Specifies the minimum number of objects in a cluster • Minimum variance constraint • Specifies the minimum variance of a cluster
Motivation - Energy Aware Sensor Networks • Goal: minimize energy consumption • Solution: • Group sensors into clusters • A master node is selected from sensors in a cluster or deployed • Other sensors communicate with outside through the master nodes • Constraint-Driven Clustering: • Minimum Significance Constraint • Balances the work load of master nodes • Minimum Variance Constraint • Allows sensor clusters to be balanced in terms of energy consumption Sensor Master Node Communication Channel CommandNode
Motivation - Privacy Preservation • Goal: publish personal records without a privacy breach • Solution: • Group records into clusters • Release the summary of each cluster to the public • Constraint-Driven Clustering: • Minimum Significance Constraint • Similar to k-Anonymity in preserving individual privacy • Minimum Variance Constraint • Variance translates into the width of the confidence interval of the adversary estimate • Prevent similar, even identical, records to be released
Related Work • Clustering with Cluster-level Constraints • Constrained k-means algorithm [Bradley et al.’00] • The existential constraint [Tung et al.’01] • Specifies the minimum # of objects in a subset of the input data • Is a general form of minimum significance constraint • Different to our model: K is specified • K-Anonymity [Samarati et al.’98][Sweeney et al.’02] • Each record is indistinguishable from k-1 other records • On categorical data • PPMicroCluster [Jin et al.’06] • Minimum significance and minimum radius constraints • Constraint is posed on the radius of a cluster • Did not analyze the complexity of the clustering model
Constraint-Driven Clustering (CDC) • Given a set of points , a set of constraints C • Partition P into disjoint clusters {P1,, Pm} s.t.: • Each cluster satisfies all constraints • The sum of squared distances of data points to their corresponding cluster representatives is minimized • Constraints • For each cluster Pi, 1 · i · m • Our model searches for clusters which are balanced in terms of cardinality or/and variance
Theoretical Results • Note that the CDC problem has feasible solutions as long as the whole data set satisfies given constraints
Heuristic Algorithm • Intuition • The generated clusters must be balanced • Membership assignment of each point depends on its close neighbors • Data structure: CD-Tree • Helps to retrieve close neighbors easily • Obtain a solution to the CDC problem by post processing leaf nodes • Two parameters • Significance parameter S (S = Sig) • Variance parameter V (V = Var)
CD-Tree • Leaf nodes • Each entry contains an individual data point • Upper-bound capacity and variance • Max capacity: 2S – 1 (In an optimal solution, no cluster consists of > 2S-1 data objects) • Max variance: 2V (To keep leaf nodes compact s.t. the SSE is minimized) • Non-leaf nodes • Each entry • contains pointers to child nodes and summaries of points in the child nodes • corresponds to the subtree rooted at the child node • Max capacity Z ( a constant, can be set arbitrarily)
CD-Tree vs. CF-Tree and R*-Tree • CF-Tree • Does not save individual data points • No max capacity specified for leaf nodes • R*-Tree • No max variance specified for leaf nodes • Both CF-Tree and R*-tree are not designed for generating clusters satisfying constraints • CD-Tree • One CD-Tree is built for a set of constraints • When constraint value is changed slightly, we can obtain a solution by post-processing leaf nodes
l1 l1 l2 l1 nll l1 l2 Algorithm • Two steps: • Build the CD-Tree (Insertion and Split) • Post-process leaf nodes to solve the CDC problem nll nlr Root l3 l2 nlr S = 5 l2 l3
Experimental Results • Comparison partner • PPMicroCluster algorithm • Similar problem definition • Can be adapted to handle the minimum variance constraint • Static algorithm • Data sets • Synthetic data set (DS1) • 5000 2-d data points to simulate sensors deployed uniformly • Two real UCI data sets (Abalone and Letter)
Results on Synthetic data set Results for the DS1 dataset (Only Significance Constraints are Specified)
Results on Letter data set Results for the Letter dataset (Both Significance and Variance Constraints are Specified)
Conclusion & Future work • A new Constraint-Driven Clustering (CDC) model • Need-driven • Focused on two cluster-level constraints • Proved NP-Hardness of the CDC problem • Proposed a new data structure (CD-Tree) • Developed a heuristic algorithm based on CD-Tree • Future Work • Allow constraints to be ranges instead of exact values • Design other types of constraints to capture different application needs • Generalize the heuristic algorithm to handle other constraints, such as minimum separation constraint [Davidson et al.’05]
Reference • [Ghosh’02] J. Ghosh and A. Strehl. Clustering and visualization of retail market baskets. In N. R. Pal and L. Jain, editors, Knowledge Discovery in Advanced Information Systems. Springer, 2002. • [Kleinberg’99] J. Kleinberg, C. Papadimitriou, and P. Raghavan. A microeconomic view of data mining. J. Data Mining and Knowledge Discovery, 1999. • [Bradley’00] P. Bradley, K. P. Bennett, and A. Demiriz. Constrained k-means clustering. Technical report, MSR-TR-2000-65, Microsoft Research, 2000. • [Wagstaff’00] K. Wagstaff and C. Cardie. Clustering with instance-level constraints. In ICML, 2000. • [Davidson’05] I. Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the k-means algorithm. In SDM, 2005. • [Samarati’98] P. Samarati and L. Sweeney. Generalizing data to provide anonymity when disclosing information (abstract). In PODS, 1998.
Reference • [Sweeney’02] L. Sweeney. k-anonymity: A model for protecting privacy. In IJUFKS, 2002. • [Jin’06] W. Jin, R. Ge, and W. Qian. On robust and effective k-anonymity in large databases. In PAKDD, 2006. • [Aggarwal’04] C. C. Aggarwal and P. S. Yu. A condensation approach to privacy preserving data mining. In EDBT, 2004. • [Tung’01] A. K. H. Tung, J. Han, R. T. Ng, and L. V. S. Lakshmanan. Constraint-based clustering in large databases. In ICDT, 2001. • [Banerjee’06] A. Banerjee and J. Ghosh. Scalable clustering algorithms with balancing constraints. Data Mining Knowledge Discovery, 13(3), 2006.
Thanks! Poster: this evening (Tuesday), board #1
Split Create a new leaf node Move the furthest point to the mean of the old leaf node to it Calculate the new objective value Yes Does the objective value drop? No Link the new node appropriately Split
Runtime • O(n2 + n * Sig2) • The runtime of inserting one point is O(n) • The height of a CD-Tree can be O(n) • Total time for split is O(Sig2) • Total time for building a tree is O(n2 + nSig2)
Outline • Introduction • Two classes of clustering methods • Motivation for constraint-driven clustering • Related Work • Constraint-Driven Clustering model • Theoretical Results • Heuristic Algorithm • Experimental Results • Conclusion • Future Work
Related Work • Actionable Clustering [Kleinberg‘99] • Objective function measures the utility of a clustering in decision making • Cluster-level Constraints • Constrianed k-means algorithm [Bradley’00] • Different to our model: K is specified • Instance-level Constraints • Must-link and cannot-link constraints [Wagstaff’00] • Feasibility issue with the instance-level constraints [Davidson’05] • Model a cluster-level constraint with instance-level constraints • Require a large number of instance-level constraints • Specifying too many constraints is problematic
Related Work (Contd.) • K-Anonymity [Samarati’98][Sweeney’02] • Each record is indistinguishable from k-1 other records • On categorical data • Condensation approach is an extension of K-Anonymity on numerical data [Aggarwal’04] • PPMicroCluster[Jin’06] • Minimum significance constraint and minimum radius constraint • Different to our model: • Minimum variance constraint • Not analyze the complexity of the cluster model • Propose a static algorithm