270 likes | 285 Views
This paper discusses the incorporation of new technologies for entity linking and acronym expansion in text analysis. It explores algorithms for NIL query clustering and proposes a combination system. The paper also introduces a Wikipedia-LDA model for entity linking with batch size changing instance selection.
E N D
I2R-NUS-MSRA at TAC 2011: Entity Linking Wei Zhang1, Jian Su2, Bin Chen2,WentingWang2, Zhiqiang Toh2, Yanchuan Sim2, Yunbo Cao3, Chin Yew Lin3 and Chew Lim Tan1 1 National University of Singapore 2 Institute for Infocomm Research 3Microsoft Research Asia Text Analysis Conference, November 14-15, 2011
I2R-NUS-MSRA at TAC 2011: Entity Linking Outline • I2R-NUS team at TAC • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011) • Acronym Expansion • Semantic Features • Instance Selection • Investigate three algorithms for NIL query clustering • Spectral Graph Partitioning (SGP) • Hierarchical Agglomerative Clustering (HAC) • Latent Dirichlet allocation (LDA) • Combination system • Offline Combination with the system of MSRA team at KB linking step Text Analysis Conference, November 14-15, 2011
I2R-NUS-MSRA at TAC 2011: Entity Linking Outline • I2R-NUS team at TAC • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011) • Acronym Expansion • Semantic Features • Instance Selection • Investigate three algorithms for NIL query clustering • Spectral Graph Partitioning (SGP) • Hierarchical Agglomerative Clustering (HAC) • Latent Dirichlet allocation (LDA) • Combination system • Combine with the system of MSRA team at KB linking step Text Analysis Conference, November 14-15, 2011
I2R-NUS-MSRA at TAC 2011: Entity Linking Acronym Expansion - Motivation • Expanding an acronym from its context to reduce the ambiguities of a name • E.g.TSE in Wikipedia refers to 33 entries Vs. Tokyo Stock Exchange is unambiguous. Text Analysis Conference, November 14-15, 2011
I2R-NUS-MSRA at TAC 2011: Entity Linking Step 1 – Find Expansion Candidates • Identifying Candidate Expansions (e.g. for ACM) Text Analysis Conference, November 14-15, 2011
I2R-NUS-MSRA at TAC 2011: Entity Linking Step 2 – Candidate Expansions Ranking • Using SVM classifier to rank the candidates • Our SVM based acronym expansion • can handle link acronyms and full strings in the different sentences in the articles • Number of common characters between acronym and leading character of the expansion. • can handle acronym with swapped letters. • E.g. Communist Party of China Vs. CCP • Sentence distance between acronym and expansion Text Analysis Conference, November 14-15, 2011
I2R-NUS-MSRA at TAC 2011: Entity Linking Outline • I2R-NUS team at TAC • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011) • Acronym Expansion • Semantic Features • Instance Selection • Investigate three algorithms for NIL query clustering • Spectral Graph Partitioning (SGP) • Hierarchical Agglomerative Clustering (HAC) • Latent Dirichlet allocation (LDA) • Combination system • Combine with the system of MSRA team at KB linking step Text Analysis Conference, November 14-15, 2011
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Related Work on Context Similarity • Zhang et al., 2010; Zheng et al., 2010; Dredze et al., 2010 • Term Matching • However, 1) Michael Jordan is a leading researcher in machine learning and artificial intelligence. 2) Michael Jordan is currently a full professor at the University of California, Berkeley. 3) Michael Jordan (born February, 1963) is a former American professional basketball player. 4) Michael Jordan wins NBA MVP of 91-92 season. No Term Match The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Our System - A Wikipedia-LDA model • 1) Michael Jordan is a leading researcher in machine learning and artificial intelligence. • 2) Michael Jordan is currently a full professor at the University of California, Berkeley. • 3) Michael Jordan (born February, 1963) is a former American professional basketballplayer. • 4) Michael Jordan wins NBA MVP of 91-92 season. Topic: Science Topic: Basketball The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Wikipedia – LDA Model P( word i| category j) P( category i| document j) Document Document … … The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Wikipedia – LDA Model • 1) Michael Jordan is a leading researcherin machine learning and artificial intelligence. • 2) Michael Jordan is currently a full professor at the University of California, Berkeley. • 3) Michael Jordan (born February, 1963) is a former American professional basketball player. • 4) Michael Jordan winsNBA MVP of 91-92 season. The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand
I2R-NUS-MSRA at TAC 2011: Entity Linking Outline • I2R-NUS team at TAC • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011) • Acronym Expansion • Semantic Features • Instance Selection • Investigate three algorithms for NIL query clustering • Spectral Graph Partitioning (SGP) • Hierarchical Agglomerative Clustering (HAC) • Latent Dirichlet allocation (LDA) • Combination system • Combine with the system of MSRA team at KB linking step Text Analysis Conference, November 14-15, 2011
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Related Work • Vector Space Model • Difficult to combine bag of words (BOW) with other features. • Performance needs to be improved • Supervised Approaches • Using manual annotated training instances • Dredze et al., 2010; Zheng et al., 2010 • Using automatically generated training instances • Zhang et al. 2010 The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Related Work • (News Article) Obama Campaign Drops The George W. Bush Talking Point … • Auto-generate training instance (Zhang et al., 2010) The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Related Work • From “George W. Bush” articles • No positive instances for “George H. W. Bush” “George P. Bush” and “George Washington Bush” generated • No negative instances for “George W. Bush” generated • Such positive negative training instance distributions may not be the same with the original ambiguous cases in the raw text collection • The distribution of the unambiguous mentions may not be the same in test data The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand
The Approach in Our System A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection • An instance selection approach • Select an informative,representative, and diversesubset from the auto-generated data set. • Reduce the effect of the distribution differences The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 16
A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Instance Selection auto-generated data set Test on training SVM Classifier Small Initial data set Add these selected instances to Initial data set 2-D data set Illustration Select Informative, representative and diverse Instances The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand SVM hyperplane
I2R-NUS-MSRA at TAC 2011: Entity Linking Outline • I2R-NUS team at TAC • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011) • Acronym Expansion • Semantic Features • Instance Selection • Investigate three algorithms for NIL query clustering • Spectral Graph Partitioning (SGP) • Hierarchical Agglomerative Clustering (HAC) • Latent Dirichlet allocation (LDA) • Combination system • Combine with the system of MSRA team at KB linking step Text Analysis Conference, November 14-15, 2011
Spectral Clustering • Advantages over other clustering techniques • Globally optimized results • Efficient in time and space • Generally, produce a better result • Success in many areas • Image segmentation • Gene expression clustering
Spectral Clustering • Eigen Decomposition on Graph Laplacian • Dimensionality Reduction • (Luxburg, 2006) George W. Bush A = QɅQ-1 George H.W. Bush
I2R-NUS-MSRA at TAC 2011: Entity Linking Hierarchical Agglomerative Clustering • Convert a doc into a feature vector: Wikipedia concepts, bag-of-words and named entities. • Estimate the weight of each feature using Query Relevance Weighting Model (Long and Shi, 2010): • this model shows good performance in Web People Search • In our work, original query name, its Wikipedia redirected names and its coreference chain mentions are all considered as appearances of the query name in the text. • Similarity scores : cosine similarity and overlap similarity. Text Analysis Conference, November 14-15, 2011
I2R-NUS-MSRA at TAC 2011: Entity Linking Hierarchical Agglomerative Clustering • Docs referred to the same entity are clustered according to doc pair-wise similarity scores. • Start with singleton: each doc is a cluster • If there are two docs D and D' in clusters Ci and Cj respectively: Two clusters Ci and Cj are merged to form a new cluster Cij if Sim(D,D' ) > γ γ = 0.25 Calculate the similarity between the new cluster Cij and all remaining clusters Text Analysis Conference, November 14-15, 2011
I2R-NUS-MSRA at TAC 2011: Entity Linking Latent Dirichlet Allocation (LDA) • LDA has been applied to many NLP tasks such as: summarization and text classification • In our approach, the learned topics can represent the underlying entities of the ambiguous names • Generative story: Text Analysis Conference, November 14-15, 2011
I2R-NUS-MSRA at TAC 2011: Entity Linking Three Clustering Systems Combination • Three classes SVM classifier to decide which system to be trusted • Features: scores given by the three systems Combine with the system of MSRA team at KB linking step • Binary SVM classifier to decide which system to be trusted • Features: scores given by the two systems Text Analysis Conference, November 14-15, 2011
I2R-NUS-MSRA at TAC 2011: Entity Linking Experiment for Three Clustering Algorithms Text Analysis Conference, November 14-15, 2011
I2R-NUS-MSRA at TAC 2011: Entity Linking Submissions Text Analysis Conference, November 14-15, 2011
I2R-NUS-MSRA at TAC 2011: Entity Linking Conclusion • Incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011) • Acronym Expansion • Semantic Features • Instance Selection • Investigate three algorithms for NIL query clustering • Spectral Graph Partitioning (SGP) • Hierarchical Agglomerative Clustering (HAC) • Latent Dirichlet allocation (LDA) Text Analysis Conference, November 14-15, 2011