120 likes | 331 Views
A novel document similarity measure based on earth mover’s distance. Presenter : Shao -Wei Cheng Authors : Xiaojun Wan. InfSci 2007. Outline. Motivation Objective Methodology Experiments Conclusion Comments. Motivation.
E N D
A novel document similarity measure based on earth mover’s distance Presenter : Shao-Wei Cheng Authors : Xiaojun Wan InfSci2007
Outline • Motivation • Objective • Methodology • Experiments • Conclusion • Comments
Motivation • Measuring pair-wise document similarity is crucial for various text applications, including document clustering, document filtering, and nearest neighbor search. • There are too many many many methods: • VSM - Cosine, Dice, Jaccard, Overlap • Information theoretic • Retrieval Model - BM25, NVSM, LM • OM-based one-to-one < measure by subtopics > : document structure information 3
Objectives • Not only one-to-one matching Many-To-Many • More information, more nature 4
Methodology • Framework TextTiling document decomposition Sentence clustering similarity measure The proposed EMD-based (earth mover’s distance ) measure (Improve the OM-based measure to allow many to many matching)
Methodology • TextTiling • Tokenization • Lexical score determination • Boundary identification • Sentence clustering • hierarchical agglomerative clustering algorithm. • Use the average-link method to compute similarity. • The merging threshold can be determined through cross-validation.
Methodology • OM-based measure • Change the similarity measure to Optimal matching problem. • The constraint of optimal matching problem • No two edges share the same node. • Find the matching M ( the best E ) that has the largest total weight. The one-to-one matching might loss information
Methodology • EMD-based measure • Change the similarity measure to transportation problem. • The earth mover’s distance • Find a flow F = [fij] that minimizes the overall cost • The constraint :
Experiments • Performance comparison for different similarity measures. • MAP - non-interpolated mean average precision 9
Experiments • Influence of document decomposition algorithm • Sentence clustering algorithm • TextTiling 10
Conclusion • The proposed measure can overcome the one-to-one matching problem and the experimental results show the effectiveness and robustness of the EMD-based measure. • Future work • Combine the Cosine measure and the EMD-based measure in a re-ranking process. • Other document decomposition algorithms. 11
Comments • Advantage • Change document similarity measure to another math problem. • Drawback • Application • Clustering • Classification • Search engine • …