280 likes | 449 Views
8/10/2005. 2. Contingency table clustering. Not necessarily documents!Images/features, genes/samples, movies/actors
E N D
1. Multi-way Distributional Clustering via Pairwise Interactions Ron Bekkerman UMass
Ran El-Yaniv Technion
Andrew McCallum UMass
2. 8/10/2005 2 Contingency table clustering Not necessarily documents!
Images/features, genes/samples, movies/actors… First let me give a short introduction on the distributional clustering. Let’s say that we’re given a collection of documents. We can represent the collection as a contingency table of documents and their words, where each cell contains a count – a number of times a certain word appears in a certain document. Distributional clustering will be then to group rows that are similar according to a certain similarity measure. The name of distributional clustering came from the fact that documents are distributed over words (it’s an equivalent formulation to the contingency table). This clustering scheme can be used in many domains such as machine vision, bioinformatics collaborative filtering etc.First let me give a short introduction on the distributional clustering. Let’s say that we’re given a collection of documents. We can represent the collection as a contingency table of documents and their words, where each cell contains a count – a number of times a certain word appears in a certain document. Distributional clustering will be then to group rows that are similar according to a certain similarity measure. The name of distributional clustering came from the fact that documents are distributed over words (it’s an equivalent formulation to the contingency table). This clustering scheme can be used in many domains such as machine vision, bioinformatics collaborative filtering etc.
3. 8/10/2005 3 Contingency table clustering Similarly we can cluster columns (words) Note that the problem is essentially symmetric so words can be represented as distributions over documents in which they appear and then clustered similarly. But probably a better idea is to do both word clustering and document clustering, which is called the two-way clusteringNote that the problem is essentially symmetric so words can be represented as distributions over documents in which they appear and then clustered similarly. But probably a better idea is to do both word clustering and document clustering, which is called the two-way clustering
4. 8/10/2005 4 Two-way clustering Main motivation:
Can overcome statistical sparseness
Extensively studied:
All showing impressive improvements Or double clustering, bi-clustering, co-clustering, coupled clustering, bimodal clustering etc… The issue is that the contingency table is usually very sparse (I think that in the text case only about 5% of the cells are actually occupied). So you can’t get much out of a document representation of as a 50,000-dimensional vector where only 50 entries are non-zero. However, if you represent a document as a distribution over word _clusters_, then the representation becomes meaningful. Btw that’s what we did for our JMLR paper a couple of years ago. In the text categorization domain, we represented each word as a distribution over document categories and clustered the words using the Information Bottleneck method and then represented each document as a distribution over word clusters, applied the SVM and achieved best results on 20NG. But this was not the symmetric case where words are distributed over documents and documents are distributed over words. However, the symmetric case has been widely explored as well: Slonim & Tishby proposed it in 2000 for text, Getz et al proposed it in the bio-informatics field, EY and S then came up with an iterative version of Slonim and Tishby’s approach, Dhillon et al applied k-means to 2-way clustering and theoretically proved its convergence to a local minimum. All these works showed a significant improvement over the 1-way clustering scheme.Or double clustering, bi-clustering, co-clustering, coupled clustering, bimodal clustering etc… The issue is that the contingency table is usually very sparse (I think that in the text case only about 5% of the cells are actually occupied). So you can’t get much out of a document representation of as a 50,000-dimensional vector where only 50 entries are non-zero. However, if you represent a document as a distribution over word _clusters_, then the representation becomes meaningful. Btw that’s what we did for our JMLR paper a couple of years ago. In the text categorization domain, we represented each word as a distribution over document categories and clustered the words using the Information Bottleneck method and then represented each document as a distribution over word clusters, applied the SVM and achieved best results on 20NG. But this was not the symmetric case where words are distributed over documents and documents are distributed over words. However, the symmetric case has been widely explored as well: Slonim & Tishby proposed it in 2000 for text, Getz et al proposed it in the bio-informatics field, EY and S then came up with an iterative version of Slonim and Tishby’s approach, Dhillon et al applied k-means to 2-way clustering and theoretically proved its convergence to a local minimum. All these works showed a significant improvement over the 1-way clustering scheme.
5. 8/10/2005 5 Multiple views Various views of data can be observed
Example: email What the 2-way clustering approach doesn’t take into account is that the data is often richer that just a stream of words. There are more than two dimensions. For example, let’s consider an email collection. Together with email messages and their words, you have email senders who are naturally distributed both over the messages and words, you also have email receivers (the groups of senders and receivers are probably overlapping but not identical), you have subject lines (and certainly when performing clustering documents with similar subject lines should probably be together in the same cluster, but say if their bodies’ vocabularies have nothing in common, then probably not). If some of your email is written using html, you have different html constructions that are probably similar for messages that should be placed in the same cluster, some messages also have attachments that can be taken into consideration as well. What the 2-way clustering approach doesn’t take into account is that the data is often richer that just a stream of words. There are more than two dimensions. For example, let’s consider an email collection. Together with email messages and their words, you have email senders who are naturally distributed both over the messages and words, you also have email receivers (the groups of senders and receivers are probably overlapping but not identical), you have subject lines (and certainly when performing clustering documents with similar subject lines should probably be together in the same cluster, but say if their bodies’ vocabularies have nothing in common, then probably not). If some of your email is written using html, you have different html constructions that are probably similar for messages that should be placed in the same cluster, some messages also have attachments that can be taken into consideration as well.
6. 8/10/2005 6 Multiple views Various views of data can be observed
Example: email
Statistical interaction of views
You see there’re a lot of various dimensions and all of them are interacting with each other
You see there’re a lot of various dimensions and all of them are interacting with each other
7. 8/10/2005 7 Multiple views Various views of data can be observed
Example: email
Statistical interaction of views
Not necessarily all interactions are relevant But probably not all of interactions should be exploited. For example, I don’t see an intuitive correlation between subject lines and message receivers, probably messages with similar subject lines go to people who’re connected to each other, but probably not. Unintuitive interactions can introduce noise and it’s always better to make your model be lighter, so it’s worth to omit them if they are not necessary.But probably not all of interactions should be exploited. For example, I don’t see an intuitive correlation between subject lines and message receivers, probably messages with similar subject lines go to people who’re connected to each other, but probably not. Unintuitive interactions can introduce noise and it’s always better to make your model be lighter, so it’s worth to omit them if they are not necessary.
8. 8/10/2005 8 Multi-way clustering Motivating question: can we extend two-way clustering to utilize multiple views?
Goal: construct N “clean” clusterings of N interdependent variables
Well it looks like it would be nice to generalize the simple 2-way clustering problem to multi-way clustering, where by multi-way clustering I mean the construction of n clustering systems of n highly correlated random variables. Btw the goal of the multi-way clustering may not be just document clustering. Say if you’re interested in word clusters, you can use the output of such a system, if you’re doing social network analysis, you would probably be interested in clusters of people etc.Well it looks like it would be nice to generalize the simple 2-way clustering problem to multi-way clustering, where by multi-way clustering I mean the construction of n clustering systems of n highly correlated random variables. Btw the goal of the multi-way clustering may not be just document clustering. Say if you’re interested in word clusters, you can use the output of such a system, if you’re doing social network analysis, you would probably be interested in clusters of people etc.
9. 8/10/2005 9 Our contributions Objective function for fitting useful multi-way interaction model
Novel clustering algorithm to maximize the objective
Striking empirical results We propose a novel model for multi-way distributional clustering and a new algorithm that efficiently maximizes our objective function. What is important is that our algorithm achieves decent results, 5-10% better than two previous state-of-the-art document clustering algorithms, on 6 real-world datasets.We propose a novel model for multi-way distributional clustering and a new algorithm that efficiently maximizes our objective function. What is important is that our algorithm achieves decent results, 5-10% better than two previous state-of-the-art document clustering algorithms, on 6 real-world datasets.
10. 8/10/2005 10 Earlier attempts Multivariate Information Bottleneck (mIB)
Very general approach for dealing with several variables
Objective: Multi-Information
Not feasible for practical applications Some work has been previously done in this fieldSome work has been previously done in this field
11. 8/10/2005 11 Our approach Consider only pairwise interactions
Pairwise interaction graph
Defines interactions between N=2 variables
Multimedia information retrieval domainMultimedia information retrieval domain
12. 8/10/2005 12 Our objective Let be pairwise interaction graph
Extending Dhillon et al.:
Objective: weighted sum of pairwise MI
Subject to
No multi-dimensional probability tables
Can be easily factorized
13. 8/10/2005 13 Objective factorization Consider triangle:
Objective in this case:
…is broken into 3 parts:
14. 8/10/2005 14 Implementation We have tried various schemes:
Top-down
Bottom-up
Flat (K-means, sequential IB)
Best results obtained with hybrid
Top-down for some variables
Bottom-up for other variables
Flat correction routine after each split/merge Implementation schemeImplementation scheme
15. 8/10/2005 15 Multi-way Distributional Clustering Initialization
If , put each in a singleton cluster
If , put all in one common cluster
Main loop
If , merge every two closest clusters
If , split each cluster to two halves
Correction loop
Pull each out of its cluster
Put it into s.t. the objective is maximized
16. 8/10/2005 16 Example: 4-way MDC
17. 8/10/2005 17 Computational complexity General case
At each iteration of the main loop:
Pass over all
Pass over all
Pass over all
If bottom-up system is only one
2-way case
At each iteration is doubled
While is halved Complexity is appealingComplexity is appealing
18. 8/10/2005 18 Experimental setup 2-way MDC
Documents and Words
3-way MDC
Documents, Words and Authors
4-way MDC
Documents, Words, Authors and
documents’ Titles
Documents: bottom-up, the rest: top-down Note we omit the interaction between authors and titlesNote we omit the interaction between authors and titles
19. 8/10/2005 19 Evaluation methodology Clustering evaluation
Is generally unintuitive
Is an entire ML research field
We use the “accuracy” measure
Following Slonim et al. and Dhillon et al.
Ground truth:
Our results:
20. 8/10/2005 20 Datasets Three CALO email datasets:
acheyer: 664 messages, 38 folders
mgervasio: 777 messages, 15 folders
mgondek: 297 messages, 14 folders
Two Enron email datasets:
kitchen-l: 4015 messages, 47 folders
sanders-r: 1188 messages, 30 folders
The 20 Newsgroups: 19997 messages
21. 8/10/2005 21 Results
22. 8/10/2005 22 Improvement over the baseline
23. 8/10/2005 23 More results
24. 8/10/2005 24 Even more results
25. 8/10/2005 25 Even more results
26. 8/10/2005 26 Discussion Improvement over Slonim et al.
Which is a 1-way clustering algorithm
Shows that multi-modality helps
Improvement over Dhillon et al.
Which is a 2-way clustering algorithm
Shows that hierarchical setup helps
MDC is an efficient method
Which allows exploring complex models
3-way, 4-way etc. 3-way 4-way may also improve3-way 4-way may also improve
27. 8/10/2005 27 Conclusion Unsupervised model without generative assumptions
Exploit multiple views of your data
Efficient algorithm
Impressive empirical results ?
28. 8/10/2005 28 Future work Inference of optimal schedule
Inference on “optimal” number of clusters?
Extend to semi-supervised setup For our future work we have short term and long term goalsFor our future work we have short term and long term goals
29. 8/10/2005 29 Future work Inference of optimal schedule
Inference on “optimal” number of clusters?
Extend to semi-supervised setup