A matrix density based algorithm to hierarchically co-cluster documents and words

A matrix density based algorithm to hierarchically co-cluster documents and words Advisor ： Dr. Hsu Graduate：Keng-Wei Chang Author ：Bhushan Mandhani Sachindra Joshi Krishna Kummamuru

outline • Motivation • Objective • Introduction • background • Rowset Partitioning and Submatrix Agglomeration(RPSA) • Experimental results • Conclusions • Personal Opinion

Motivation • With this explosion of unstructured information, it has become increasingly important to organize the information in a comprehensible and navigable manner.

Objective • A hierarchical arrangement of documents is very useful in browsing a document collection, as the popularity of the Yahoo、Google. • This paper proposes an algorithm to hierarchically cluster documents for solving problems.

Introduction • 90s -> 100 thousand pages； • 2002 -> 2 billion pages; • it has become increasingly important to organize the information • Manually is accurate, but not always feasible • Need tools to automatically arrange documents to labeled hierarchies • Propose RPSA -> two step partitional-agglomerative

background • Vector Model for Documents • Evaluation of Clustering Quality • Evaluation of Hierarchical Clustering

Vector Model for Documents Unitized-TF IDF We have d documents Document i is represented by is the number of occurrences of word j in document i Term Frequency，TF Inverse Document Frequency，IDF

Evaluation of Clustering Quality • 1. Purity： • 2. Entropy：

Evaluation of Hierarchical Clustering

Rowset Partitioning and Submatrix Agglomeration(RPSA) • tow-step partitional-agglomerative algorithm • 1th step：The Partitioning Step • 2th step：The Agglomerative Step

The Partitioning Step • Define the density of submatices a row r，a column c a set R of rows，a set C of columns

The Partitioning Step • Generating a Leaf Cluster

The Partitioning Step • Choice of Leader Documents • The sum of TFIDF vector representing that document • Documents with relatively large lengths were observed to be better leader documents for the algorithm above

The Partitioning Step • The Complete Partitioning Algorithm

The Partitioning Step • Complexity Analysis • The time complexity is O(mz) • The space complexity is O(z)

The Agglomerative Step • Reduce the number of clusters • The similarity measure between two clusters for merging • Flat Clustering • Hierarchical Clustering

The Agglomerative Step • Complexity Analysis • The time complexity is O( ) • The space complexity is O( )

Experimental results-Flat Clustering • Data Sets

Experimental results-Flat Clustering • Results

Experimental results-Flat Clustering

Experimental results-Hierarchical Clustering • Data Sets

Experimental results-Hierarchical Clustering • Results

Conclusions • It is comparable with or better than the best k-means run • It’s performance does not degrade on small data sets • It’s acceptable on purity in hierarchy

Personal Opinion

A matrix density based algorithm to hierarchically co-cluster documents and words

A matrix density based algorithm to hierarchically co-cluster documents and words

Presentation Transcript

Algorithm-Based Fault Tolerance for Matrix Operations

Server Cluster and LVS based Cluster

Density matrix renormalization group method

Density Matrix Renormalization Group DMRG

The density matrix renormalization group

A network-based quark-cluster algorithm

Prim’s Algorithm from a matrix

local-density based spatial clustering algorithm with noise

Density based Clustering

Strassen Matrix Multiplication Algorithm

A density-based cluster validity approach using multi-representatives

Algorithm-Based Fault Tolerance Matrix Multiplication

The Union-Split Algorithm and Cluster-Based Anonymization of Social Networks

A Concurrent Matrix Transpose Algorithm, The Verification

A Concurrent Matrix Transpose Algorithm

Extraction and segmentation of tables from Chinese ink documents based on a matrix model

PC Cluster Reconstruction Algorithm

Lecture Density Matrix Formalism

Density-based Approaches

Statistical mechanics/Density Matrix

How to cluster data Algorithm review

A density-based cluster validity approach using multi-representatives