150 likes | 316 Views
Enhancing Cluster Labeling Using Wikipedia. David Carmel, Haggai Roitman , Naama Zwerdling IBM Research Lab { carmel,haggai,naamaz }@ il.ibm.com Present b y Miguel Panuera mpanuera@gmail.com. School of Computer Science San Pablo Catholic University AREQUIPA – PERU 2010. CONTENT.
E N D
Enhancing Cluster Labeling Using Wikipedia David Carmel, HaggaiRoitman, NaamaZwerdlingIBM ResearchLab {carmel,haggai,naamaz}@il.ibm.com Presentby Miguel Panuera mpanuera@gmail.com • School of Computer Science San Pablo CatholicUniversity AREQUIPA – PERU 2010
CONTENT • ClusterLabeling • WhyWikipedia • Terms extracted: JSD vs Wikipedia • General Framework forclusterlabeling • Experiments • Summary
ClusterLabeling • This process tries to select descriptive labels for the clusters
WhyWikipedia • One of the major knowledge resource for manyinformationretrievaltasks. • Textcategorizationand clustering. • Computing semanticrelatednessbetweenconcepts. • Predictingdocumenttopics.
Terms extracted: JSD vs Wikipedia While the list of important terms fairly represents the content of the categories, these terms can serve as appropriate labels for only a few categories. On the other hand, Wikipedia labels agree with human annotated labels much more.
GENERAL FRAMEWORK FOR CLUSTER LABELING Documents are first parsed and tokenized
GENERAL FRAMEWORK FOR CLUSTER LABELING The clustering algorithms goal is to create coherent clusters for which documents within a cluster share the same topics
GENERAL FRAMEWORK FOR CLUSTER LABELING We now wish to find a list of terms ordered by their estimated importance, to represent the content of the cluster’s documents. Such terms consist of single keywords
GENERAL FRAMEWORK FOR CLUSTER LABELING Wenowwishtoextract candidate labels for cluster C
GENERAL FRAMEWORK FOR CLUSTER LABELING Candidate labels are evaluated by several judges. Theneachjudge evaluates the candidates according to its evaluation policy.
Experiments K: indicates the number of required cluster labels Match@K: The relative number of clusters for which at least one of the top-k labels is correct.
Summary • Wedescribed a general framework for cluster labeling that extracts candidate labels from the text and from Wikipedia • Cluster labeling with Wikipedia is extremely successful, as shown by our results.
Enhancing Cluster Labeling Using Wikipedia David Carmel, HaggaiRoitman, NaamaZwerdlingIBM ResearchLab {carmel,haggai,naamaz}@il.ibm.com Presentby Miguel Panuera mpanuera@gmail.com San Pablo CatholicUniversity • School of Computer Science AREQUIPA – PERU 2010