Multi-way Distributional Clustering via Pairwise Interactions

1. Multi-way Distributional Clustering via Pairwise Interactions Ron Bekkerman UMass Ran El-Yaniv Technion Andrew McCallum UMass

2. 8/10/2005 2 Contingency table clustering Not necessarily documents! Images/features, genes/samples, movies/actors� First let me give a short introduction on the distributional clustering. Let�s say that we�re given a collection of documents. We can represent the collection as a contingency table of documents and their words, where each cell contains a count � a number of times a certain word appears in a certain document. Distributional clustering will be then to group rows that are similar according to a certain similarity measure. The name of distributional clustering came from the fact that documents are distributed over words (it�s an equivalent formulation to the contingency table). This clustering scheme can be used in many domains such as machine vision, bioinformatics collaborative filtering etc.First let me give a short introduction on the distributional clustering. Let�s say that we�re given a collection of documents. We can represent the collection as a contingency table of documents and their words, where each cell contains a count � a number of times a certain word appears in a certain document. Distributional clustering will be then to group rows that are similar according to a certain similarity measure. The name of distributional clustering came from the fact that documents are distributed over words (it�s an equivalent formulation to the contingency table). This clustering scheme can be used in many domains such as machine vision, bioinformatics collaborative filtering etc.

3. 8/10/2005 3 Contingency table clustering Similarly we can cluster columns (words) Note that the problem is essentially symmetric so words can be represented as distributions over documents in which they appear and then clustered similarly. But probably a better idea is to do both word clustering and document clustering, which is called the two-way clusteringNote that the problem is essentially symmetric so words can be represented as distributions over documents in which they appear and then clustered similarly. But probably a better idea is to do both word clustering and document clustering, which is called the two-way clustering

4. 8/10/2005 4 Two-way clustering Main motivation: Can overcome statistical sparseness Extensively studied: All showing impressive improvements Or double clustering, bi-clustering, co-clustering, coupled clustering, bimodal clustering etc� The issue is that the contingency table is usually very sparse (I think that in the text case only about 5% of the cells are actually occupied). So you can�t get much out of a document representation of as a 50,000-dimensional vector where only 50 entries are non-zero. However, if you represent a document as a distribution over word _clusters_, then the representation becomes meaningful. Btw that�s what we did for our JMLR paper a couple of years ago. In the text categorization domain, we represented each word as a distribution over document categories and clustered the words using the Information Bottleneck method and then represented each document as a distribution over word clusters, applied the SVM and achieved best results on 20NG. But this was not the symmetric case where words are distributed over documents and documents are distributed over words. However, the symmetric case has been widely explored as well: Slonim & Tishby proposed it in 2000 for text, Getz et al proposed it in the bio-informatics field, EY and S then came up with an iterative version of Slonim and Tishby�s approach, Dhillon et al applied k-means to 2-way clustering and theoretically proved its convergence to a local minimum. All these works showed a significant improvement over the 1-way clustering scheme.Or double clustering, bi-clustering, co-clustering, coupled clustering, bimodal clustering etc� The issue is that the contingency table is usually very sparse (I think that in the text case only about 5% of the cells are actually occupied). So you can�t get much out of a document representation of as a 50,000-dimensional vector where only 50 entries are non-zero. However, if you represent a document as a distribution over word _clusters_, then the representation becomes meaningful. Btw that�s what we did for our JMLR paper a couple of years ago. In the text categorization domain, we represented each word as a distribution over document categories and clustered the words using the Information Bottleneck method and then represented each document as a distribution over word clusters, applied the SVM and achieved best results on 20NG. But this was not the symmetric case where words are distributed over documents and documents are distributed over words. However, the symmetric case has been widely explored as well: Slonim & Tishby proposed it in 2000 for text, Getz et al proposed it in the bio-informatics field, EY and S then came up with an iterative version of Slonim and Tishby�s approach, Dhillon et al applied k-means to 2-way clustering and theoretically proved its convergence to a local minimum. All these works showed a significant improvement over the 1-way clustering scheme.

5. 8/10/2005 5 Multiple views Various views of data can be observed Example: email What the 2-way clustering approach doesn�t take into account is that the data is often richer that just a stream of words. There are more than two dimensions. For example, let�s consider an email collection. Together with email messages and their words, you have email senders who are naturally distributed both over the messages and words, you also have email receivers (the groups of senders and receivers are probably overlapping but not identical), you have subject lines (and certainly when performing clustering documents with similar subject lines should probably be together in the same cluster, but say if their bodies� vocabularies have nothing in common, then probably not). If some of your email is written using html, you have different html constructions that are probably similar for messages that should be placed in the same cluster, some messages also have attachments that can be taken into consideration as well. What the 2-way clustering approach doesn�t take into account is that the data is often richer that just a stream of words. There are more than two dimensions. For example, let�s consider an email collection. Together with email messages and their words, you have email senders who are naturally distributed both over the messages and words, you also have email receivers (the groups of senders and receivers are probably overlapping but not identical), you have subject lines (and certainly when performing clustering documents with similar subject lines should probably be together in the same cluster, but say if their bodies� vocabularies have nothing in common, then probably not). If some of your email is written using html, you have different html constructions that are probably similar for messages that should be placed in the same cluster, some messages also have attachments that can be taken into consideration as well.

6. 8/10/2005 6 Multiple views Various views of data can be observed Example: email Statistical interaction of views You see there�re a lot of various dimensions and all of them are interacting with each other You see there�re a lot of various dimensions and all of them are interacting with each other

7. 8/10/2005 7 Multiple views Various views of data can be observed Example: email Statistical interaction of views Not necessarily all interactions are relevant But probably not all of interactions should be exploited. For example, I don�t see an intuitive correlation between subject lines and message receivers, probably messages with similar subject lines go to people who�re connected to each other, but probably not. Unintuitive interactions can introduce noise and it�s always better to make your model be lighter, so it�s worth to omit them if they are not necessary.But probably not all of interactions should be exploited. For example, I don�t see an intuitive correlation between subject lines and message receivers, probably messages with similar subject lines go to people who�re connected to each other, but probably not. Unintuitive interactions can introduce noise and it�s always better to make your model be lighter, so it�s worth to omit them if they are not necessary.

8. 8/10/2005 8 Multi-way clustering Motivating question: can we extend two-way clustering to utilize multiple views? Goal: construct N �clean� clusterings of N interdependent variables Well it looks like it would be nice to generalize the simple 2-way clustering problem to multi-way clustering, where by multi-way clustering I mean the construction of n clustering systems of n highly correlated random variables. Btw the goal of the multi-way clustering may not be just document clustering. Say if you�re interested in word clusters, you can use the output of such a system, if you�re doing social network analysis, you would probably be interested in clusters of people etc.Well it looks like it would be nice to generalize the simple 2-way clustering problem to multi-way clustering, where by multi-way clustering I mean the construction of n clustering systems of n highly correlated random variables. Btw the goal of the multi-way clustering may not be just document clustering. Say if you�re interested in word clusters, you can use the output of such a system, if you�re doing social network analysis, you would probably be interested in clusters of people etc.

9. 8/10/2005 9 Our contributions Objective function for fitting useful multi-way interaction model Novel clustering algorithm to maximize the objective Striking empirical results We propose a novel model for multi-way distributional clustering and a new algorithm that efficiently maximizes our objective function. What is important is that our algorithm achieves decent results, 5-10% better than two previous state-of-the-art document clustering algorithms, on 6 real-world datasets.We propose a novel model for multi-way distributional clustering and a new algorithm that efficiently maximizes our objective function. What is important is that our algorithm achieves decent results, 5-10% better than two previous state-of-the-art document clustering algorithms, on 6 real-world datasets.

10. 8/10/2005 10 Earlier attempts Multivariate Information Bottleneck (mIB) Very general approach for dealing with several variables Objective: Multi-Information Not feasible for practical applications Some work has been previously done in this fieldSome work has been previously done in this field

11. 8/10/2005 11 Our approach Consider only pairwise interactions Pairwise interaction graph Defines interactions between N=2 variables Multimedia information retrieval domainMultimedia information retrieval domain

12. 8/10/2005 12 Our objective Let be pairwise interaction graph Extending Dhillon et al.: Objective: weighted sum of pairwise MI Subject to No multi-dimensional probability tables Can be easily factorized

13. 8/10/2005 13 Objective factorization Consider triangle: Objective in this case: �is broken into 3 parts:

14. 8/10/2005 14 Implementation We have tried various schemes: Top-down Bottom-up Flat (K-means, sequential IB) Best results obtained with hybrid Top-down for some variables Bottom-up for other variables Flat correction routine after each split/merge Implementation schemeImplementation scheme

15. 8/10/2005 15 Multi-way Distributional Clustering Initialization If , put each in a singleton cluster If , put all in one common cluster Main loop If , merge every two closest clusters If , split each cluster to two halves Correction loop Pull each out of its cluster Put it into s.t. the objective is maximized

16. 8/10/2005 16 Example: 4-way MDC

17. 8/10/2005 17 Computational complexity General case At each iteration of the main loop: Pass over all Pass over all Pass over all If bottom-up system is only one 2-way case At each iteration is doubled While is halved Complexity is appealingComplexity is appealing

18. 8/10/2005 18 Experimental setup 2-way MDC Documents and Words 3-way MDC Documents, Words and Authors 4-way MDC Documents, Words, Authors and documents� Titles Documents: bottom-up, the rest: top-down Note we omit the interaction between authors and titlesNote we omit the interaction between authors and titles

19. 8/10/2005 19 Evaluation methodology Clustering evaluation Is generally unintuitive Is an entire ML research field We use the �accuracy� measure Following Slonim et al. and Dhillon et al. Ground truth: Our results:

20. 8/10/2005 20 Datasets Three CALO email datasets: acheyer: 664 messages, 38 folders mgervasio: 777 messages, 15 folders mgondek: 297 messages, 14 folders Two Enron email datasets: kitchen-l: 4015 messages, 47 folders sanders-r: 1188 messages, 30 folders The 20 Newsgroups: 19997 messages

21. 8/10/2005 21 Results

22. 8/10/2005 22 Improvement over the baseline

23. 8/10/2005 23 More results

24. 8/10/2005 24 Even more results

25. 8/10/2005 25 Even more results

26. 8/10/2005 26 Discussion Improvement over Slonim et al. Which is a 1-way clustering algorithm Shows that multi-modality helps Improvement over Dhillon et al. Which is a 2-way clustering algorithm Shows that hierarchical setup helps MDC is an efficient method Which allows exploring complex models 3-way, 4-way etc. 3-way 4-way may also improve3-way 4-way may also improve

27. 8/10/2005 27 Conclusion Unsupervised model without generative assumptions Exploit multiple views of your data Efficient algorithm Impressive empirical results ?

28. 8/10/2005 28 Future work Inference of optimal schedule Inference on �optimal� number of clusters? Extend to semi-supervised setup For our future work we have short term and long term goalsFor our future work we have short term and long term goals

29. 8/10/2005 29 Future work Inference of optimal schedule Inference on �optimal� number of clusters? Extend to semi-supervised setup

Multi-way Distributional Clustering via Pairwise Interactions

Multi-way Distributional Clustering via Pairwise Interactions

Presentation Transcript

Distributional models

Multi-way Anova

Multi-Way search Trees

On feature distributional clustering for text categorization

Binary Shape Clustering via Zernike Moments

Multi-view Clustering via Canonical Correlation Analysis Kamalika Chaudhuri et al. ICML 2009.

Multi-way communication

Clustering via SAS

Document Clustering via Matrix Representation

Multi-Way Gates

Automated Homogenization of Monthly Temperature Series via Pairwise Comparisons

Kernel-based Weighted Multi-view Clustering

Achieving Anonymity via Clustering

Local versus Global Interactions in Clustering Algorithms

Distributional Clustering of English Words

CLUSTERING ALGORITHMS VIA FUNCTION OPTIMIZATION

On feature distributional clustering for text categorization

Distributional Clustering of Words for Text Classification

Multi-way communication

Distributional clustering of English words

Correlating Summarization of Multi - source News with K - Way Graph Bi - clustering