200 likes | 1.08k Views
ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09 Document Clustering A method of aggregating a set of documents such that : Documents within cluster are as similar as possible.
E N D
ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09
Document Clustering • A method of aggregating a set of documents such that : • Documents within cluster are as similar as possible. • Documents from different clusters should be dissimilar. Cluster 2 Cluster 1 Cluster 3
Cluster Labeling • To assign each cluster a human readable label that can best represent the cluster. • Traditional method is to pick the label from the important terms within the cluster. • The statistically significant terms may not be a good label. • A good label may not occur directly in the text. Electronics Bowling Ice Hockey Cluster 2 Cluster 1 Cluster 3
Approach • Utilizing the external resource to help the cluster labeling. • Besides the important terms extracted from the cluster, the metadata of Wikipedia such as title and category is used to serve as candidate label.
A General Framework i i i i i i
Step1: Indexing • Documents are parsed and tokenized. • Term weight are determined by tf-idf. • Use Lucene to generate a search index such that the tf and idf value of term t can be quickly accessed.
Step2: Clustering • Given the document collection D, return a set of document clusters C={C1,C2,…,Cn}. • A cluster is represented by its centroid of the cluster's documents. • The term weight of the cluster's centroid is slightly modified:
Step3: Important Terms Extraction • Given a cluster , find a list of important terms ordered by their estimated importance. • This can be achieved by • Selecting the top weighted terms from the cluster centroid. • Use the Jensen-Shannon Divergence(JSD) to measure the distance between the cluster and the collection.
Step4: Label Extraction • One way is to use the top k important terms directly. • The other way is to use the top k important terms to query Wikipedia. The title and the set of categories of the returned Wiki documents serve as candidate labels.
Step5: Output the Recommended Labels from Candidate Labels • MI(Mutual Information) Judge • Score each candidate label by its pointwise mutual information with the cluster's important terms. • SP(Score Propagation) Judge • Propagate the document score to the candidate label. • Document score can be the original score of the IR system or the rank(d)-1 • Socore Aggregation • Use linear combination to combine the above two judges. • The recommend labels are the top ranked labels.
Data Collection • 20 News Groups • 20 (clusters) X 1000 (documents/ clusters) • Open Directory Project(ODP) • 100 (clusters) X 100 (documents/ clusters) • The Ground Truth • The correct label itself. • The correct label's inflection. • The correct label's Wordnet synonym .
Evaluation Metrics label1 label1 label1 label1 • Match@K • Ex: • Mean Reciprocal Rank(MRR@K) • Ex: label2 label2 label2 label2 label3 label3 label3 label3 • Match@4 • =1/2 • =0.5 Correct Correct label4 label4 label4 label4 c1 c1 c2 c2 • MRR@4 =((1/2)+(1/3))/2 • =0.416… Correct
Parameters • The important term selection method(JSD, ctf-cdf-idf, MI, chi-square). • The number of important terms for querying Wikipedia. • The number of Wikipedia results to be used for label extraction. • The judges used for candidate evaluation.
Evaluation 1 • The effectiveness of using Wikipedia to enhance cluster labeling.
Evaluation 2 • Candidate label extraction
Evaluation 3 • Judge effectiveness
Evaluation 4.1 • The Effect of Clusters' Coherency on Label Quality • Testing on a "noisy cluster": • For a noise level p(in [0,1]) of clusters, each document in one cluster have probability p to swap with document in other cluster.
Evaluation 4.2 • The Effect of Clusters' Coherency on Label Quality
Conclusion • Proposed a general framework for solving cluster labeling problem. • The metadata of Wikipedia can boost the performance of cluster labeling. • The proposed method has good resiliency to noisy clusters.