200 likes | 281 Views
A Technique For Text Clustering Using A Partition Based Approach Doan Nguyen. Presentation Outline. Knowledge Retrieval Scenarios Challenges for Clustering of Documents Applicability Assumptions Cluster Analysis Steps An Example Conclusion. Knowledge Retrieval Scenarios.
E N D
A Technique For Text Clustering Using A Partition Based Approach Doan Nguyen
Presentation Outline • Knowledge Retrieval Scenarios • Challenges for Clustering of Documents • Applicability Assumptions • Cluster Analysis Steps • An Example • Conclusion
Knowledge Retrieval Scenarios • Brute Force Search (no Guidance) • Contextual Search • On the Fly Structuring/Grouping of Result List
Clustering With Retrieved Documents (Jamie Callan, 2002)
Clustering With Retrieved Documents (Jamie Callan, 2002)
Challenges for Clustering of Documents • Handling of high dimensionality • Clustering quality • Supporting of multi-subject content • Clustering Data Presentation • Performance throughput
Applicability Assumptions • Clustering of documents returning from a search engine • Maximum of number of documents searchers may gain access to • Clustering is applied only when result list is broad • Web-Searchers want quick access to content whenever possible
Proposed System Architecture Result List + Clustering Data User Query Query processor Search Engine Cluster Analysis Result Formulator Stop Words Stemming Rules Search Indexes
Document Vectors • Represented as vectors when used computationally • Each vector holds a place for every term in the collection. • Therefore, most vectors are sparse. Doc A Doc B Doc C Doc D Doc E Doc F Doc G Doc H Doc I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 “Nova” occurs 10 times in Doc A “Galaxy” occurs 5 times in Doc A “Heat” occurs 3 times in Doc A (Blank means 0 occurrences.) (Hearst & Larson: Simm 202 UCB )
Measuring Similarity • Cosine Similarity Expression: • Tanimoto Expression:
Document Frequency (DF) • Document frequency: The number of documents in which a word occurs in the dataset. Use for feature selection purpose. DocA DocB DocC DocD DocE DocF DocG DocH DocI DF nova galaxy heat h’wood film role diet fur x x x x x x x x x x x x x x x x x x x x x x x x x x 3 4 2 4 5 2 3 3
Cluster Analysis Steps • Text Preprocessor • - Build Word_Index_Table (Word ids, Doc ids, Document Frequency (DF)) • Features Extraction (Data Splitter) • Build Word_Processed_Table (Document Vectors) • Combined Words when applicable • Build Clusters • Compute seeded (centroid) documents • Centroid document consists of the present words that appear in two or more documents under the same • Word_Cluster_Id. • Assign documents to a seeded cluster if (1) share a common • Word_Clustered_Id and (2) distance between centroid document and • inspected documents within a threshold value • Decompose of main cluster into sub-clusters • Cluster Output • Order all clusters by their DF values • Label each cluster with name: Word_Clustered_Ids and DF • Output the result
An Example Search Result for a Query of “Car”
Sample Output of Preprocessor Word_Index_Table
Sample Output of Feature Extraction (Data Splitter)
Sample Output of Word_Processed_Table
Sample Output of Word_Processed_Table (with combined words)
Result of Clustering Formulation
Conclusion • Implementation concerns • Data analysis • Throughput performance considerations • Usability of label of cluster name