260 likes | 365 Views
A matrix density based algorithm to hierarchically co-cluster documents and words. Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Bhushan Mandhani Sachindra Joshi Krishna Kummamuru. outline. Motivation Objective Introduction background
E N D
A matrix density based algorithm to hierarchically co-cluster documents and words Advisor : Dr. Hsu Graduate:Keng-Wei Chang Author :Bhushan Mandhani Sachindra Joshi Krishna Kummamuru
outline • Motivation • Objective • Introduction • background • Rowset Partitioning and Submatrix Agglomeration(RPSA) • Experimental results • Conclusions • Personal Opinion
Motivation • With this explosion of unstructured information, it has become increasingly important to organize the information in a comprehensible and navigable manner.
Objective • A hierarchical arrangement of documents is very useful in browsing a document collection, as the popularity of the Yahoo、Google. • This paper proposes an algorithm to hierarchically cluster documents for solving problems.
Introduction • 90s -> 100 thousand pages; • 2002 -> 2 billion pages; • it has become increasingly important to organize the information • Manually is accurate, but not always feasible • Need tools to automatically arrange documents to labeled hierarchies • Propose RPSA -> two step partitional-agglomerative
background • Vector Model for Documents • Evaluation of Clustering Quality • Evaluation of Hierarchical Clustering
Vector Model for Documents Unitized-TF IDF We have d documents Document i is represented by is the number of occurrences of word j in document i Term Frequency,TF Inverse Document Frequency,IDF
Evaluation of Clustering Quality • 1. Purity: • 2. Entropy:
Rowset Partitioning and Submatrix Agglomeration(RPSA) • tow-step partitional-agglomerative algorithm • 1th step:The Partitioning Step • 2th step:The Agglomerative Step
The Partitioning Step • Define the density of submatices a row r,a column c a set R of rows,a set C of columns
The Partitioning Step • Generating a Leaf Cluster
The Partitioning Step • Choice of Leader Documents • The sum of TFIDF vector representing that document • Documents with relatively large lengths were observed to be better leader documents for the algorithm above
The Partitioning Step • The Complete Partitioning Algorithm
The Partitioning Step • Complexity Analysis • The time complexity is O(mz) • The space complexity is O(z)
The Agglomerative Step • Reduce the number of clusters • The similarity measure between two clusters for merging • Flat Clustering • Hierarchical Clustering
The Agglomerative Step • Complexity Analysis • The time complexity is O( ) • The space complexity is O( )
Experimental results-Flat Clustering • Data Sets
Experimental results-Flat Clustering • Results
Experimental results-Hierarchical Clustering • Data Sets
Experimental results-Hierarchical Clustering • Data Sets
Conclusions • It is comparable with or better than the best k-means run • It’s performance does not degrade on small data sets • It’s acceptable on purity in hierarchy