1 / 27

Entity Linking and Acronym Expansion for Text Analysis

This paper discusses the incorporation of new technologies for entity linking and acronym expansion in text analysis. It explores algorithms for NIL query clustering and proposes a combination system. The paper also introduces a Wikipedia-LDA model for entity linking with batch size changing instance selection.

stevieg
Download Presentation

Entity Linking and Acronym Expansion for Text Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. I2R-NUS-MSRA at TAC 2011: Entity Linking Wei Zhang1, Jian Su2, Bin Chen2,WentingWang2, Zhiqiang Toh2, Yanchuan Sim2, Yunbo Cao3, Chin Yew Lin3 and Chew Lim Tan1 1 National University of Singapore 2 Institute for Infocomm Research 3Microsoft Research Asia Text Analysis Conference, November 14-15, 2011

  2. I2R-NUS-MSRA at TAC 2011: Entity Linking Outline • I2R-NUS team at TAC • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011) • Acronym Expansion • Semantic Features • Instance Selection • Investigate three algorithms for NIL query clustering • Spectral Graph Partitioning (SGP) • Hierarchical Agglomerative Clustering (HAC) • Latent Dirichlet allocation (LDA) • Combination system • Offline Combination with the system of MSRA team at KB linking step Text Analysis Conference, November 14-15, 2011

  3. I2R-NUS-MSRA at TAC 2011: Entity Linking Outline • I2R-NUS team at TAC • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011) • Acronym Expansion • Semantic Features • Instance Selection • Investigate three algorithms for NIL query clustering • Spectral Graph Partitioning (SGP) • Hierarchical Agglomerative Clustering (HAC) • Latent Dirichlet allocation (LDA) • Combination system • Combine with the system of MSRA team at KB linking step Text Analysis Conference, November 14-15, 2011

  4. I2R-NUS-MSRA at TAC 2011: Entity Linking Acronym Expansion - Motivation • Expanding an acronym from its context to reduce the ambiguities of a name • E.g.TSE in Wikipedia refers to 33 entries Vs. Tokyo Stock Exchange is unambiguous. Text Analysis Conference, November 14-15, 2011

  5. I2R-NUS-MSRA at TAC 2011: Entity Linking Step 1 – Find Expansion Candidates • Identifying Candidate Expansions (e.g. for ACM) Text Analysis Conference, November 14-15, 2011

  6. I2R-NUS-MSRA at TAC 2011: Entity Linking Step 2 – Candidate Expansions Ranking • Using SVM classifier to rank the candidates • Our SVM based acronym expansion • can handle link acronyms and full strings in the different sentences in the articles • Number of common characters between acronym and leading character of the expansion. • can handle acronym with swapped letters. • E.g. Communist Party of China Vs. CCP • Sentence distance between acronym and expansion Text Analysis Conference, November 14-15, 2011

  7. I2R-NUS-MSRA at TAC 2011: Entity Linking Outline • I2R-NUS team at TAC • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011) • Acronym Expansion • Semantic Features • Instance Selection • Investigate three algorithms for NIL query clustering • Spectral Graph Partitioning (SGP) • Hierarchical Agglomerative Clustering (HAC) • Latent Dirichlet allocation (LDA) • Combination system • Combine with the system of MSRA team at KB linking step Text Analysis Conference, November 14-15, 2011

  8. A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Related Work on Context Similarity • Zhang et al., 2010; Zheng et al., 2010; Dredze et al., 2010 • Term Matching • However, 1) Michael Jordan is a leading researcher in machine learning and artificial intelligence. 2) Michael Jordan is currently a full professor at the University of California, Berkeley. 3) Michael Jordan (born February, 1963) is a former American professional basketball player. 4) Michael Jordan wins NBA MVP of 91-92 season. No Term Match The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

  9. A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Our System - A Wikipedia-LDA model • 1) Michael Jordan is a leading researcher in machine learning and artificial intelligence. • 2) Michael Jordan is currently a full professor at the University of California, Berkeley. • 3) Michael Jordan (born February, 1963) is a former American professional basketballplayer. • 4) Michael Jordan wins NBA MVP of 91-92 season. Topic: Science Topic: Basketball The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

  10. A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Wikipedia – LDA Model P( word i| category j) P( category i| document j) Document Document … … The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

  11. A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Wikipedia – LDA Model • 1) Michael Jordan is a leading researcherin machine learning and artificial intelligence. • 2) Michael Jordan is currently a full professor at the University of California, Berkeley. • 3) Michael Jordan (born February, 1963) is a former American professional basketball player. • 4) Michael Jordan winsNBA MVP of 91-92 season. The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

  12. I2R-NUS-MSRA at TAC 2011: Entity Linking Outline • I2R-NUS team at TAC • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011) • Acronym Expansion • Semantic Features • Instance Selection • Investigate three algorithms for NIL query clustering • Spectral Graph Partitioning (SGP) • Hierarchical Agglomerative Clustering (HAC) • Latent Dirichlet allocation (LDA) • Combination system • Combine with the system of MSRA team at KB linking step Text Analysis Conference, November 14-15, 2011

  13. A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Related Work • Vector Space Model • Difficult to combine bag of words (BOW) with other features. • Performance needs to be improved • Supervised Approaches • Using manual annotated training instances • Dredze et al., 2010; Zheng et al., 2010 • Using automatically generated training instances • Zhang et al. 2010 The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

  14. A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Related Work • (News Article) Obama Campaign Drops The George W. Bush Talking Point … • Auto-generate training instance (Zhang et al., 2010) The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

  15. A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Related Work • From “George W. Bush” articles • No positive instances for “George H. W. Bush” “George P. Bush” and “George Washington Bush” generated • No negative instances for “George W. Bush” generated • Such positive negative training instance distributions may not be the same with the original ambiguous cases in the raw text collection • The distribution of the unambiguous mentions may not be the same in test data The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand

  16. The Approach in Our System A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection • An instance selection approach • Select an informative,representative, and diversesubset from the auto-generated data set. • Reduce the effect of the distribution differences The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand 16

  17. A Wikipedia-LDA Model for Entity Linking with Batch Size Changing Instance Selection Instance Selection auto-generated data set Test on training SVM Classifier Small Initial data set Add these selected instances to Initial data set 2-D data set Illustration Select Informative, representative and diverse Instances The 5th International Joint Conference on Natural Language Processing, November 8-13, 2011, Chiang Mai, Thailand SVM hyperplane

  18. I2R-NUS-MSRA at TAC 2011: Entity Linking Outline • I2R-NUS team at TAC • incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011) • Acronym Expansion • Semantic Features • Instance Selection • Investigate three algorithms for NIL query clustering • Spectral Graph Partitioning (SGP) • Hierarchical Agglomerative Clustering (HAC) • Latent Dirichlet allocation (LDA) • Combination system • Combine with the system of MSRA team at KB linking step Text Analysis Conference, November 14-15, 2011

  19. Spectral Clustering • Advantages over other clustering techniques • Globally optimized results • Efficient in time and space • Generally, produce a better result • Success in many areas • Image segmentation • Gene expression clustering

  20. Spectral Clustering • Eigen Decomposition on Graph Laplacian • Dimensionality Reduction • (Luxburg, 2006) George W. Bush A = QɅQ-1 George H.W. Bush

  21. I2R-NUS-MSRA at TAC 2011: Entity Linking Hierarchical Agglomerative Clustering • Convert a doc into a feature vector: Wikipedia concepts, bag-of-words and named entities. • Estimate the weight of each feature using Query Relevance Weighting Model (Long and Shi, 2010): • this model shows good performance in Web People Search • In our work, original query name, its Wikipedia redirected names and its coreference chain mentions are all considered as appearances of the query name in the text. • Similarity scores : cosine similarity and overlap similarity. Text Analysis Conference, November 14-15, 2011

  22. I2R-NUS-MSRA at TAC 2011: Entity Linking Hierarchical Agglomerative Clustering • Docs referred to the same entity are clustered according to doc pair-wise similarity scores. • Start with singleton: each doc is a cluster • If there are two docs D and D' in clusters Ci and Cj respectively: Two clusters Ci and Cj are merged to form a new cluster Cij if Sim(D,D' ) > γ γ = 0.25 Calculate the similarity between the new cluster Cij and all remaining clusters Text Analysis Conference, November 14-15, 2011

  23. I2R-NUS-MSRA at TAC 2011: Entity Linking Latent Dirichlet Allocation (LDA) • LDA has been applied to many NLP tasks such as: summarization and text classification • In our approach, the learned topics can represent the underlying entities of the ambiguous names • Generative story: Text Analysis Conference, November 14-15, 2011

  24. I2R-NUS-MSRA at TAC 2011: Entity Linking Three Clustering Systems Combination • Three classes SVM classifier to decide which system to be trusted • Features: scores given by the three systems Combine with the system of MSRA team at KB linking step • Binary SVM classifier to decide which system to be trusted • Features: scores given by the two systems Text Analysis Conference, November 14-15, 2011

  25. I2R-NUS-MSRA at TAC 2011: Entity Linking Experiment for Three Clustering Algorithms Text Analysis Conference, November 14-15, 2011

  26. I2R-NUS-MSRA at TAC 2011: Entity Linking Submissions Text Analysis Conference, November 14-15, 2011

  27. I2R-NUS-MSRA at TAC 2011: Entity Linking Conclusion • Incorporate the new technologies proposed in our recent papers (IJCAI 2011, IJCNLP 2011) • Acronym Expansion • Semantic Features • Instance Selection • Investigate three algorithms for NIL query clustering • Spectral Graph Partitioning (SGP) • Hierarchical Agglomerative Clustering (HAC) • Latent Dirichlet allocation (LDA) Text Analysis Conference, November 14-15, 2011

More Related