1 / 12

A novel document similarity measure based on earth mover’s distance

A novel document similarity measure based on earth mover’s distance. Presenter : Shao -Wei Cheng Authors : Xiaojun Wan. InfSci 2007. Outline. Motivation Objective Methodology Experiments Conclusion Comments. Motivation.

nyx
Download Presentation

A novel document similarity measure based on earth mover’s distance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A novel document similarity measure based on earth mover’s distance Presenter : Shao-Wei Cheng Authors : Xiaojun Wan InfSci2007

  2. Outline • Motivation • Objective • Methodology • Experiments • Conclusion • Comments

  3. Motivation • Measuring pair-wise document similarity is crucial for various text applications, including document clustering, document filtering, and nearest neighbor search. • There are too many many many methods: • VSM - Cosine, Dice, Jaccard, Overlap • Information theoretic • Retrieval Model - BM25, NVSM, LM • OM-based one-to-one < measure by subtopics > : document structure information 3

  4. Objectives • Not only one-to-one matching  Many-To-Many • More information, more nature 4

  5. Methodology • Framework TextTiling document decomposition Sentence clustering similarity measure The proposed EMD-based (earth mover’s distance ) measure (Improve the OM-based measure to allow many to many matching)

  6. Methodology • TextTiling • Tokenization • Lexical score determination • Boundary identification • Sentence clustering • hierarchical agglomerative clustering algorithm. • Use the average-link method to compute similarity. • The merging threshold can be determined through cross-validation.

  7. Methodology • OM-based measure • Change the similarity measure to Optimal matching problem. • The constraint of optimal matching problem • No two edges share the same node. • Find the matching M ( the best E ) that has the largest total weight. The one-to-one matching might loss information

  8. Methodology • EMD-based measure • Change the similarity measure to transportation problem. • The earth mover’s distance • Find a flow F = [fij] that minimizes the overall cost • The constraint :

  9. Experiments • Performance comparison for different similarity measures. • MAP - non-interpolated mean average precision 9

  10. Experiments • Influence of document decomposition algorithm • Sentence clustering algorithm • TextTiling 10

  11. Conclusion • The proposed measure can overcome the one-to-one matching problem and the experimental results show the effectiveness and robustness of the EMD-based measure. • Future work • Combine the Cosine measure and the EMD-based measure in a re-ranking process. • Other document decomposition algorithms. 11

  12. Comments • Advantage • Change document similarity measure to another math problem. • Drawback • Application • Clustering • Classification • Search engine • …

More Related