180 likes | 393 Views
Document Clustering. Carl Staelin. Motivation. It is hard to rapidly understand a big bucket of documents Humans look for patterns, and are good at pattern matching “Random” collections of documents don’t have a recognizable structure
E N D
Document Clustering Carl Staelin
Motivation It is hard to rapidly understand a big bucket of documents • Humans look for patterns, and are good at pattern matching • “Random” collections of documents don’t have a recognizable structure • Clustering documents into recognizable groups makes it easier to see patterns • Can rapidly eliminate irrelevant clusters Information Retrieval and Digital Libraries
Basic Idea • Choose a document similarity measure • Choose a cluster cost criterion Information Retrieval and Digital Libraries
Basic Idea • Choose a document similarity measure • Choose a cluster cost or similarity criterion • Group like documents into clusters with minimal cluster cost Information Retrieval and Digital Libraries
Cluster Cost Criteria • Sum-of-squared-error • Cost = i||xi-x||2 • Average squared distance • Cost = (1/n2)ij||xi-xj||2 Information Retrieval and Digital Libraries
Cluster Similarity Measure Measures the similarity of two clusters Ci, Cj • dmin(Ci, Cj) = minxiCi,xjCj||xi – xj|| • dmax(Ci, Cj) = maxxiCi,xjCj||xi – xj|| • davg(Ci, Cj) = (1/ ninj)xiCixjCj||xi – xj|| • dmean(Ci, Cj) = ||(1/nj)xiCixi–(1/nj),xjCjxj|| • … Information Retrieval and Digital Libraries
Iterative Clustering • Assign points to initial k clusters • Often this is done by random assignment • Until done • Select a candidate point x, in cluster c • Find “best” cluster c’ for x • If c c’, then move x to c’ Information Retrieval and Digital Libraries
Iterative Clustering • The user must pre-select the number of clusters • Often the “correct” number is not known in advance! • The quality of the outcome is usually dependent on the quality of the initial assignment • Possibly use some other algorithm to create a good initial assignment? Information Retrieval and Digital Libraries
Hierarchical Agglomerative Clustering • Create N single-document clusters • For i in 1..n • Merge two clusters with greatest similarity Information Retrieval and Digital Libraries
Hierarchical Agglomerative Clustering • Create N single-document clusters • For i in 1..n • Merge two clusters with greatest similarity Information Retrieval and Digital Libraries
Hierarchical Agglomerative Clustering • Create N single-document clusters • For i in 1..n • Merge two clusters with greatest similarity Information Retrieval and Digital Libraries
Hierarchical Agglomerative Clustering Hierarchical agglomerative clustering gives a hierarchy of clusters • This makes it easier to explore the set of possible k-cluster values to choose the best number of clusters 3 4 5 Information Retrieval and Digital Libraries
High density variations • Intuitively “correct” clustering Information Retrieval and Digital Libraries
High density variations • Intuitively “correct” clustering • HAC-generated clusters Information Retrieval and Digital Libraries
Hybrid Combine HAC and iterative clustering • Assign points to initial clusters using HAC • Until done • Select a candidate point x, in cluster c • Find “best” cluster c’ for x • If c c’, then move x to c’ Information Retrieval and Digital Libraries
Other Algorithms • Support Vector Clustering • Information Bottleneck • … Information Retrieval and Digital Libraries
High density variations Information Retrieval and Digital Libraries