400 likes | 661 Views
Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin. Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela,
E N D
Information Theoretic Clustering, Co-clustering and Matrix ApproximationsInderjit S. Dhillon University of Texas, Austin Joint work with A. Banerjee, J. Ghosh, Y. Guan, S. Mallela, S. Merugu & D. Modha Data Mining Seminar Series, Mar 26, 2004
Clustering: Unsupervised Learning • Grouping together of “similar” objects • Hard Clustering -- Each object belongs to a single cluster • Soft Clustering -- Each object is probabilistically assigned to clusters
Contingency Tables • Let Xand Y be discrete random variables • X and Y take values in {1, 2, …, m} and {1, 2, …, n} • p(X, Y) denotes the joint probability distribution—if not known, it is often estimated based on co-occurrence data • Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc. • Key Obstacles in Clustering Contingency Tables • High Dimensionality, Sparsity, Noise • Need for robust and scalable algorithms
Co-Clustering • Simultaneously • Cluster rows of p(X, Y) into k disjoint groups • Cluster columns of p(X, Y) into l disjoint groups • Key goal is to exploit the “duality” between row and column clustering to overcome sparsity and noise
Co-clustering Example for Text Data • Co-clustering clusters both words and documents simultaneously using the underlying co-occurrence frequency matrix document document clusters word word clusters
Co-clustering and Information Theory • View “co-occurrence” matrix as a joint probability distribution over row & column random variables • We seek a “hard-clustering” of both rows and columns such that “information” in the compressed matrix is maximized.
Information Theory Concepts • Entropy of a random variable X with probability distribution p: • The Kullback-Leibler (KL) Divergence or “Relative Entropy” between two probability distributions p and q: • Mutual Information between random variables X and Y:
“Optimal” Co-Clustering • Seek random variables and taking values in {1, 2, …, k} and {1, 2, …, l} such that mutual information is maximized: where = R(X) is a function of X alone where = C(Y) is a function of Y alone
Related Work • Distributional Clustering • Pereira, Tishby & Lee (1993), Baker & McCallum (1998) • Information Bottleneck • Tishby, Pereira & Bialek(1999), Slonim, Friedman & Tishby (2001), Berkhin & Becher(2002) • Probabilistic Latent Semantic Indexing • Hofmann (1999), Hofmann & Puzicha (1999) • Non-Negative Matrix Approximation • Lee & Seung(2000)
Information-Theoretic Co-clustering • Lemma: “Loss in mutual information” equals • p is the input distribution • q is an approximation to p • Can be shown that q(x,y) is a maximum entropy approximation subject to cluster constraints.
Decomposition Lemma • Question: How to minimize ? • Following Lemma reveals the Answer: Note that may be thought of as the “prototype” of row cluster. Similarly,
Co-Clustering Algorithm • [Step 1] Set . Start with , Compute . • [Step 2] For every row , assign it to the cluster that minimizes • [Step 3] We have . Compute . • [Step 4] For every column , assign it to the cluster that minimizes • [Step 5] We have . Compute . Iterate 2-5.
Properties of Co-clustering Algorithm • Main Theorem: Co-clustering “monotonically” decreases loss in mutual information • Co-clustering converges to a local minimum • Can be generalized to multi-dimensional contingency tables • q can be viewed as a “low complexity” non-negative matrix approximation • q preserves marginals of p, and co-cluster statistics • Implicit dimensionality reduction at each step helps overcome sparsity & high-dimensionality • Computationally economical
Applications -- Text Classification • Assigning class labels to text documents • Training and Testing Phases New Document Class-1 Document collection Grouped into classes Classifier (Learns from Training data) New Document With Assigned class Class-m Training Data
Feature Clustering (dimensionality reduction) • Feature Selection • Feature Clustering 1 • Select the “best” words • Throw away rest • Frequency based pruning • Information criterion based • pruning Document Bag-of-words Vector Of words Word#1 Word#k m 1 Vector Of words Cluster#1 • Do not throw away words • Cluster words instead • Use clusters as features Document Bag-of-words Cluster#k m
Experiments • Data sets • 20 Newsgroups data • 20 classes, 20000 documents • Classic3 data set • 3 classes (cisi, med and cran), 3893 documents • Dmoz Science HTML data • 49 leaves in the hierarchy • 5000 documents with 14538 words • Available at http://www.cs.utexas.edu/users/manyam/dmoz.txt • Implementation Details • Bow – for indexing,co-clustering, clustering and classifying
Results (20Ng) • Classification Accuracy on 20 Newsgroups data with 1/3-2/3 test-train split • Divisive clustering beats feature selection algorithms by a large margin • The effect is more significant at lower number of features
Results (Dmoz) • Classification Accuracy on Dmoz data with 1/3-2/3 test train split • Divisive Clustering is better at lower number of features • Note contrasting behavior of Naïve Bayes and SVMs
Results (Dmoz) • Naïve Bayes on Dmoz data with only 2% Training data • Note that Divisive Clustering achieves higher maximum than IG with a significant 13% increase • Divisive Clustering performs better than IG at lower training data
Hierarchical Classification Science Math Physics Social Science Quantum Theory Number Theory Mechanics Economics Archeology Logic • Flat classifier builds a classifier over the leaf classes in the above hierarchy • Hierarchical Classifier builds a classifier at each internal node of the hierarchy
Results (Dmoz) • Hierarchical Classifier (Naïve Bayes at each node) • Hierarchical Classifier: 64.54% accuracy at just 10 features (Flat achieves 64.04% accuracy at 1000 features) • Hierarchical Classifier improves accuracy to 68.42 % from 64.42%(maximum) achieved by flat classifiers
Anecdotal Evidence Top few words sorted in Clusters obtained by Divisive and Agglomerative approaches on 20 Newsgroups data
Co-Clustering (0.9835) 1-D Clustering (0.821) 992 4 8 847 142 44 40 1452 7 41 954 405 1 4 1387 275 86 1099 Co-Clustering Results (CLASSIC3)
Binary (0.852,0.67) Binary_subject (0.946,0.648) Co-clustering 1-D Clustering Co-clustering 1-D Clustering 207 31 178 104 234 11 179 94 43 219 72 146 16 239 71 156 Results – Binary (subset of 20Ng data)
Conclusions • Information-theoretic approach to clustering, co-clustering and matrix approximation • Implicit dimensionality reduction at each step to overcome sparsity & high-dimensionality • Theoretical approach has the potential of extending to other problems: • Multi-dimensional co-clustering • MDL to choose number of co-clusters • Generalized co-clustering by Bregman divergence
More Information • Email: inderjit@cs.utexas.edu • Papers are available at: http://www.cs.utexas.edu/users/inderjit • “Divisive Information-Theoretic Feature Clustering for Text Classification”, Dhillon, Mallela & Kumar, Journal of Machine Learning Research(JMLR), March 2003 (also KDD, 2002) • “Information-Theoretic Co-clustering”, Dhillon, Mallela & Modha, KDD, 2003. • “Clustering with Bregman Divergences”, Banerjee, Merugu, Dhillon & Ghosh, SIAM Data Mining Proceedings, April, 2004. • “A Generalized Maximum Entropy Approach to Bregman Co-clustering & Matrix Approximation”, Banerjee, Dhillon, Ghosh, Merugu & Modha, working manuscript, 2004.