1 / 15

Enhancing Text Clustering by Leveraging Wikipedia Semantics

Enhancing Text Clustering by Leveraging Wikipedia Semantics. Presenter : Cheng- Hui Chen Authors : Jian Hu , Lujun Fang, Yang Cao, Hua -Jun Zeng , Hua Li, Qiang Yang, Zheng Chen SIGIR, 2008. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments.

jett
Download Presentation

Enhancing Text Clustering by Leveraging Wikipedia Semantics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enhancing Text Clustering by Leveraging Wikipedia Semantics Presenter : Cheng-HuiChen Authors : JianHu, Lujun Fang, Yang Cao, Hua-Jun Zeng, Hua Li, Qiang Yang, Zheng Chen SIGIR, 2008

  2. Outlines • Motivation • Objectives • Methodology • Experiments • Conclusions • Comments

  3. Motivation Most traditional text clustering methods ignores the important information on the semantic relationships between key terms. Lack of an effective word sense disambiguation method. The synonymy and polysemy are not easy to handle the problems.

  4. Objectives Enhance the clustering result by obtaining a more accurate distance measure. The generated thesaurus serves as a control vocabulary that bridges the variety of idiolects and terminologies present in the document corpus.

  5. Methodology • Wikipedia Thesaurus • Traditional text clustering • The framework

  6. Methodology • Wikipedia Thesaurus • Wikipedia Concept • Synonymy • Polysemy • Hypernymy (Hierarchical Relation) • Associative relations • Content based measure • Out-linked category based measure • Combination of the two measure

  7. Methodology • Traditional text clustering • Traditional text similarity measure • Compute cosine similarity • Traditional text representation enrichment strategies • Generated new features replace or append to original document, and construct new vector representation

  8. Methodology Use the category decay factor μ • The framework • Mapping text documents into wikipedia concept sets • Considering frequently occurred synonymy, polysemy andhypernymy in text documents, accurate allocation of terms in Wikipedia. • Enriching similarity measure with hierarchical relation • Similarity measure using category vectors • Considering the original document content, the similarity measure can be represented as:

  9. Methodology Ca={(CS,1),(ML,1)} Cb={(DM,1),(DB,1)} =0.57 • Enriching similarity measure with synonym and associative relation • The expanded weighted concept set • The set Cextto Cband get the extended Cbas: • Define the similarity

  10. Methodology • The set α and β to equal weights α =β =1/ 3 The combination

  11. Experiments M C N1 = 180 N2 = 100 N1 ∩N2=30 Purity = 33.33% Inpurity = 30% • Evaluation Criteria • M = M1,M2, ...,Mn represent the nmanually labeled clusters, C = C1,C2, ...,Cn represent the n clusters generated using our algorithm. • Precision of Ci and Mj is defined: • The purity of the clustering result is defined: • The corresponding inverse purity is defined:

  12. Experiments BASE1: Traditional text document similarity measure. BASE2: Improved with Gabrilovich’s feature generation technique on Wikipedia. BASE3: K-Means clustering with Hotho’s text document representation enrichment with WordNet.

  13. Conclusions • The clustering performance of our method is improved compared with previous methods. • The future work • Use the multilingual relations to explore the application in Cross language Information Retrieval and Cross-language Text Categorization.

  14. Comments • Advantages • Improved text clustering performance. • Applications • Clustering • Information Retrieval

More Related