310 likes | 525 Views
Cluster Description and Related Problems. DDM Lab Seminar 2005 Spring Byron Gao. Road Map. Motivations Cluster Description Problems Related Problems in Machine Learning Related Problems in Theory Related Problems in Computational Geometry Conclusions and Past / Future Work.
E N D
Cluster Description and Related Problems DDM Lab Seminar 2005 Spring Byron Gao
Road Map • Motivations • Cluster Description Problems • Related Problems in Machine Learning • Related Problems in Theory • Related Problems in Computational Geometry • Conclusions and Past / Future Work
1. Motivations Clustering • Group objects into clusters such that objects within a cluster are similar and objects from different clusters are dissimilar • one of the major data mining tasks • K-means, hierarchical, density-based…
1. Motivations Cluster Descriptions • Literature lacks systematic study of cluster descriptions • Most clustering algorithms just give membership assignments • Hidden knowledge in data need to be represented for inference • Integration of database and data mining • Purposes of cluster descriptions • Summarize data to gain initial knowledge • Compress data supporting further investigation • One way of describing a cluster: DNF formula • A set of isothetic hyper-rectangles • Interpretable • Used as search condition to retrieve objects
1. Motivations Summarize Data • Gain initial and basic understanding of the clusters • Capture shape and location in the multi-dimensional space • Requirement: interpretability • Simple and clear structure of description format • e.g., DNF, B-DNF… • B-DNF: Set difference between the bounding box B and a DNF formula describing objects in B but not in cluster C • Short in length • Another requirement: accuracy • Interpreting the correct target • May trade accuracy for shorter length Trade-off
1. Motivations Compress Data (1) • Data compression has important applications in data transmission and data storage • Transmission: distribute clustering results to branches • Clustering performed interactively by expert at head office • Storage: support further investigation • Partial results or results (in some sense, always partial results) • Description process ≈ encoding (compression) • Retrieval process ≈ decoding (decompression) • Description as search condition in SELECT query statement
1. Motivations Compress Data (2) Both summarization & compression • Requirement: compression ratio & retrieval efficiency • Short in length • comp. ratio = |C| / (|DESC| * 2) • Typically < 0.01 • Another requirement: accuracy • Retrieve desired objects • May trade accuracy for shorter length: lossy compression / faster retrieval Prefer Shorter Descriptoins May trade accuracy for shorter length
1. Motivations Compress Data (3) • Support Further investigation • Exploratory queries without retrieval of original objects • e.g.,… • Incremental clustering • Exploratory queries requiring retrieval of original objects • Information not associated with the description • Information not revealed by the description • Interactive and iterative mining environment • Inductive database framework
2. Cluster Description Problems SDL Problem • Shortest Description Length (SDL) problem Given a cluster C, its bounding box B, and a description format, obtain a logical expression DESC in the given format with minimum length such that for any object o, (oCoDESC) (oB – CoDESC) • Lossless • NP-hard • variant of the Minimum Set Cover problem
2. Cluster Description Problems MDQ Problem • Maximum Description Quality (MDQ) problem Given a cluster C, its bounding box B, an integer l, a description format, and a quality measure, obtain a logical expression DESC in the given format with length at most l such that the quality measure is maximized • Lossy • NP-hard: • reducible to the Set Cover problem
2. Cluster Description Problems MDQ Problem: Quality Measure • F-measureF = 2PR / (P + R) • Harmonic mean of P and R • R = |Cdescribed| / |C| • P = |Cdescribed| / |Bdescribed| • where Cdescribed (Bdescribed) is the set of objects in C (B) that is described by the description • RP=1 (Recall at fixed Precision of 1) • PR=1 (Precision at fixed Recall of 1)
3. Related Problems in Machine Learning Concept Learning • Concept learning: find a classifier for a concept • Given a labeled training sequence, generate a hypothesis that is a good approximation of the target concept and can be used as a predictor for other query points • Typical machine learning problem • Most extensively studied problem in computational learning theory • Representation of target concept assumed known • DNF formulas, Boolean circuits, geometric objects, pattern languages, deterministic finite automata, etc…
3. Related Problems in Machine Learning Geometric Concept Learning • Target is a geometric object • Half spaces. Separated by linear hyper planes • perceptron learning • Intersections of half spaces. Convex polyhedra • particularly, axis-parallel box • More accurate yet simple enough to offer intuitive solutions • Unions of axis-parallel boxes • e.g., 2 boxes. Less studied • Popular topic in computational learning theory
3. Related Problems in Machine Learning Learning Models • PAC learning model: Probably Approximately Correct [15] • Given a reasonable number of labeled instances, generate with high probability a hypothesis that is a good approximation of the target concept within a reasonable amount of time • Exact Learning Model [2] • The learner is allowed to pose queries in order to exactly identify the target concept within a reasonable amount of time • Common queries: answered by an oracle • Membership query: feed instance to oracle, return its classification • Equivalence query: present hypothesis to oracle, return “correct” or a counterexample
3. Related Problems in Machine Learning Similarities • Learn unions of axis-parallel boxes • DNF formula • Learner is fed with positive and negative examples which need to be classified accurately
3. Related Problems in Machine Learning Dissimilarities (1) : objective • Geometric concept learning • exact or approximate identification of geometric object • related to geometric object construction in computational geometry • Ultimate goal is the classifying accuracy over the instance space • labeled instances are a small part of the instance space • Optimization: least disagreement problem • fixed number of boxes, usually 1 or 2 • “+” and “–” examples are treated equally • Cluster description problems • Entire instance space is given • To describe “+” examples instead of a classify “+” and “–” examples • Optimization problem: SDL or MDQ problems. • For quality measure, “+” and “–” examples are not treated equally • Examples could be labeled “1”, “5”, “9”... for clustering description
3. Related Problems in Machine Learning Dissimilarities (2) : research focus • Geometric concept learning • Learnability • Under certain learning models • Learning from queries (membership, equivalence) rather than from random examples to significantly improve performance • Low-dimensional • Cluster description problems • Heuristics • Large data set, possibly high-dimensional • Other concerns: compression tool, retrieval efficiency etc.
4. Related Problems in Theory Complexity and Approximation Algorithm Theory Literatures Two Traditional Problems • Minimum Set Cover problem [10] • Given a collection S of subsets of a finite ground set X, find S’ S with minimum cardinality such that U Si S’Si = X • Approximable within ln|S| [8]: greedy set cover • Iteratively picks up the subset that covers the maximum number of uncovered elements • Maximum Coverage problem (max k-cover): a variant problem • Maximize the number of covered elements in X using at most k subsets from S. • Approximable within e / (e – 1) by the greedy algorithm [8] • The ratios are optimal [6]
4. Related Problems in Theory Red Blue Set Cover Problem • Given a finite set of “red” elements R, a finite set of “blue” elements B and a family S 2RUB, find a subfamily S’ S which covers all blue elements and minimum number of red elements • Raised recently (00’) [4] • Related to Group Steiner tree, minimum monotone satisfying assignment and minimum color path problems • Reducible to the set cover problem (at least as hard as) [4]
4. Related Problems in Theory SDL Problem is NP-hard • a variant of the Minimum Set Cover problem • Ground set X = set of objects in C, the cluster being described • Collection S of rectangles (subsets): • Each Sj Sis a rectangle minimally covering some objects in C and none objects not in C • S 2C • Find S’ S with minimum cardinality such that U Si S’Si = C
4. Related Problems in Theory MDQ Problem is NP-hard • RP=1: a variant of the Maximum Coverage problem • In the same fashion as we showed in last slide • Given k, find S’ S with |S’| at most k such that the number of covered elements in C is maximized • PR=1: a variant of the Red Blue Set Cover problem • F-measure: reducible to either of the two
4. Related Problems in Theory Similarities and Dissimilarities • Set Cover problem and its variants arise in a variety of settings • Useful to argue NP-hardness of SDL and MDQ problems • Not useful from algorithm design point of view because of the essential differences: in SDL and MDQ problems, the collection of subsets S is not explicitly given • e.g., the greedy set cover algorithm cannot be applied
5. Related Problems in Computational Geometry Maximum Box Problem • Given two finite sets of points X + and X - in d, find an axis-parallel box B such that B ∩ X - = and the total number of points from X + covered is maximized • Raised recently (02’) [5] • NP-hard when d is part of the input • Studied by [12] [14] for d = 2 finding exact solutions • Corresponds to MDQ problem with k = 1 and RP=1 • Techniques for low-dimensional and small datasets
5. Related Problems in Computational Geometry Rectilinear Polygon Covering Problem • Find a collection of axis-parallel rectangles with minimum cardinality whose union is exactly to a given rectilinear polygon • Covering rectilinear polygon with axis-parallel rectangles with holes [7] (79’) • Different from partition, overlapping is allowed • 2-d • NP-complete [13] • No possibility of polynomial time approximation scheme [3] • A special case of the general set covering problem • performance guarantee O(log n) • Recent result O(√ log n) [9]
5. Related Problems in Computational Geometry Similarities and Dissimilarities • If extended to higher-dimensional space, exactly the same formulation as in Greedy Growth [1] • If further allowing “don’t care” cells, similar formulation as in Algorithm BP [11] • If not restricted to grid data, same as SDL problem • [1] and [11] are related studies in DB literature • Logic minimization is also related to the grid case • Most researches are on Boolean variables
6. Conclusions and Past / Future Work Conclusions • Cluster (clustering) description is of fundamental importance to KDD, but not systematically studied • Formulated problems have significant differences to related problems in machine learning, theory, computational geometry and DB literatures.
6. Conclusions and Past / Future Work Past Work • Formulated and provided heuristic algorithms for the SDL and MDQ problem • for both DNF and B-DNF formats • work for both vector and grid data • Studied efficient retrieval issue • cost model • efficiency estimation methodology • optimal ordering of description terms
6. Conclusions and Past / Future Work Future Work • Optimization of algorithms: focused on introducing concepts and ideas • Real data experiments • Experimental validation of the retrieval efficiency issue: DBMS • Description alphabet: always rectangles? • if not, getting farther away from some related problems • Clustering description: gaining quality and efficiency • even more distinguishable from related problems • Incremental description: description process could take long • Interaction with clustering algorithms: online description?
References: • R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD 1998. • D. Angluin. Queries and concept learning. Machine Learning, 2(4): 319-342, 1988. • P. Berman and B. Dasgupta. Approximating rectilinear polygon cover problems. Algorithmica, 17: 331-356, 1997. • R.D. Carr, S. Doddi, G. Konjevod and M. Marathe. On the red-blue set cover problem. SODA 2000. • J. Eckstain, P. Hammer, Y. Liu, M. Nediak, and B. Simeone. The maximum box problem and its application to data analysis. Computational Optimization and Applications, 23(3): 285-298, 2002. • U. Feige. A threshold of ln n for approximating set cover. J.ACM 45(4): 634-652. 1998. • M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York, 1979. • D.S. Hochbaum. Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. Approximation algorithms for NP-hard problems. PWS, New York, 1997. • V.S.A. Kumar and H. Ramesh. Covering rectilinear polygons with axis-parallel rectangles. SIAM J. Comput. 32(6): 1509-1541, 2003. • D.S. Johnson. Approximation algorithms for combinatorial problems. J.Comput.System Sci., 9: 256-278, 1974. • L.V.S. Lakshmanan, R.T. Ng, C.X. Wang, X. Zhou and T.J. Johnson. The generalized MDL approach for summarization. VLDB 1999. • Y. Liu and M. Nediak. Planar case of the maximum box and related problems. CCCG 2003. • W.J. Masek. Some NP-Complete Set Covering Problems, manuscript, MIT, Cambridge, MA, 1979. • M. Segal. Planar maximum box problem. Journal of Mathematical Modelling and Algorithms, 3: 31-38, 2004. • L.G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134-1142, 1984.
Questions? Thank you !