1 / 17

Document Clustering

Document Clustering. Carl Staelin. Motivation. It is hard to rapidly understand a big bucket of documents Humans look for patterns, and are good at pattern matching “Random” collections of documents don’t have a recognizable structure

giona
Download Presentation

Document Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Clustering Carl Staelin

  2. Motivation It is hard to rapidly understand a big bucket of documents • Humans look for patterns, and are good at pattern matching • “Random” collections of documents don’t have a recognizable structure • Clustering documents into recognizable groups makes it easier to see patterns • Can rapidly eliminate irrelevant clusters Information Retrieval and Digital Libraries

  3. Basic Idea • Choose a document similarity measure • Choose a cluster cost criterion Information Retrieval and Digital Libraries

  4. Basic Idea • Choose a document similarity measure • Choose a cluster cost or similarity criterion • Group like documents into clusters with minimal cluster cost Information Retrieval and Digital Libraries

  5. Cluster Cost Criteria • Sum-of-squared-error • Cost = i||xi-x||2 • Average squared distance • Cost = (1/n2)ij||xi-xj||2 Information Retrieval and Digital Libraries

  6. Cluster Similarity Measure Measures the similarity of two clusters Ci, Cj • dmin(Ci, Cj) = minxiCi,xjCj||xi – xj|| • dmax(Ci, Cj) = maxxiCi,xjCj||xi – xj|| • davg(Ci, Cj) = (1/ ninj)xiCixjCj||xi – xj|| • dmean(Ci, Cj) = ||(1/nj)xiCixi–(1/nj),xjCjxj|| • … Information Retrieval and Digital Libraries

  7. Iterative Clustering • Assign points to initial k clusters • Often this is done by random assignment • Until done • Select a candidate point x, in cluster c • Find “best” cluster c’ for x • If c c’, then move x to c’ Information Retrieval and Digital Libraries

  8. Iterative Clustering • The user must pre-select the number of clusters • Often the “correct” number is not known in advance! • The quality of the outcome is usually dependent on the quality of the initial assignment • Possibly use some other algorithm to create a good initial assignment? Information Retrieval and Digital Libraries

  9. Hierarchical Agglomerative Clustering • Create N single-document clusters • For i in 1..n • Merge two clusters with greatest similarity Information Retrieval and Digital Libraries

  10. Hierarchical Agglomerative Clustering • Create N single-document clusters • For i in 1..n • Merge two clusters with greatest similarity Information Retrieval and Digital Libraries

  11. Hierarchical Agglomerative Clustering • Create N single-document clusters • For i in 1..n • Merge two clusters with greatest similarity Information Retrieval and Digital Libraries

  12. Hierarchical Agglomerative Clustering Hierarchical agglomerative clustering gives a hierarchy of clusters • This makes it easier to explore the set of possible k-cluster values to choose the best number of clusters 3 4 5 Information Retrieval and Digital Libraries

  13. High density variations • Intuitively “correct” clustering Information Retrieval and Digital Libraries

  14. High density variations • Intuitively “correct” clustering • HAC-generated clusters Information Retrieval and Digital Libraries

  15. Hybrid Combine HAC and iterative clustering • Assign points to initial clusters using HAC • Until done • Select a candidate point x, in cluster c • Find “best” cluster c’ for x • If c c’, then move x to c’ Information Retrieval and Digital Libraries

  16. Other Algorithms • Support Vector Clustering • Information Bottleneck • … Information Retrieval and Digital Libraries

  17. High density variations Information Retrieval and Digital Libraries

More Related