1 / 27

Self-adjustable bootstrapping for Named Entity set expansion

Self-adjustable bootstrapping for Named Entity set expansion. Sushant Narsale (JHU) Satoshi Sekine (NYU). Nail: Set (NE list) Expansion using bootstrapping. Expand Named Entity Sets for 150 Named Entity Categories. Self-adjustable bootstrapping. ngrams. July 30th, 2009.

oralee
Download Presentation

Self-adjustable bootstrapping for Named Entity set expansion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Self-adjustable bootstrappingfor Named Entity set expansion Sushant Narsale (JHU) Satoshi Sekine (NYU)

  2. Nail: Set (NE list) Expansionusing bootstrapping Expand Named Entity Sets for 150 Named Entity Categories Self-adjustable bootstrapping ngrams July 30th, 2009 Lexical Knowledge from Ngrams Lexical Knowledge from Ngrams 2

  3. Our Task • Input: Seeds for 150 Named Entity Categories • Output: More examples like seeds • Motivation • “Creating lists of Named Entities on Web is critical for query analysis, document categorization and ad matching” -Web Scale Distributional Similarity and Entity Set Expansion, Pantel et. al July 30th, 2009 Lexical Knowledge from Ngrams Lexical Knowledge from Ngrams 3

  4. Examples of 3 categories from 150 Lexical Knowledge from Ngrams

  5. 150 category Named Entity Lexical Knowledge from Ngrams

  6. Bootstrapping Get more of similar Set of names (i.e. Presidents) Clinton, Bush Putin, Chirac They must share something… • They share the same context in texts • President * said yesterday of President * in • President * , the President * , who • The contexts may be shared by other Presidents • Yeltsin, Zemin, Hussein, Obama We need scoring function to score the candidates We need to set the number of contexts/examples to learn Lexical Knowledge from Ngrams

  7. Problem Different NE categories need different parameter settings in bootstrapping “Academic” has a small number of strong contexts (Department of … at) “Company” has a large number of weak contexts (… was bankrupted, … hires) “Award” has strong suffix feature (… Award/Prize) “Nationality” has a specific length (1), “Book” has a wide length variation Lexical Knowledge from Ngrams

  8. Self-Adjustable bootstrapping We need to find the best parameter setting for each category Idea: Bootstrapping + Machine Learning Approach Use 80% of seeds for training (train-data), 20% of seeds to optimize the functions and thresholds (dev-data) Lexical Knowledge from Ngrams

  9. Our Approach Parameters Context Formula’s to score Contexts and Targets Number of contexts to be used Suffix/Prefix e.g. Suffix=Awards, for award categories Length a bias on lengths of retrieved Entity set Weighted Linear Interpolation of three functions Optimization Function : Total Reciprocal Rank Lexical Knowledge from Ngrams

  10. Our Approach Parameters Context Formula’s to score Contexts and Targets Number of contexts to be used Suffix/Prefix e.g. Suffix=Awards, for award categories Length a bias on lengths of retrieved Entity set Weighted Linear Interpolation of three functions Optimization Function : Total Reciprocal Rank Lexical Knowledge from Ngrams

  11. 1. Scoring formula’s We observed that different scoring formula’s work best for different categories • Scoring Context • Fi / CF • Ft / log(CF) • Fi * log(Ft) / CF • log(Fi) * Ft / CF • log(Fi) * Ft / log(CF) • Scoring Targets • Fi / CF • Ft / log(CF) • Ft * log(Fi) /CF • log(Fi)*Ft / CF • log(Fi)*Ft / log(CF) Fi = Co-occurrence frequency of targets and the context Ft = Number of target types co-occurred with the context CF = Corpus frequency of the context Lexical Knowledge from Ngrams

  12. Our Approach Parameters Context Formula’s to score Contexts and Targets Number of contexts to be used Suffix/Prefix e.g. Suffix=Awards, for award categories Length a bias on lengths of retrieved Entity set Weighted Linear Interpolation of three functions Optimization Function : Total Reciprocal Rank Lexical Knowledge from Ngrams

  13. 2. Prefix/Suffix July 30th, 2009 Lexical Knowledge from Ngrams Lexical Knowledge from Ngrams 13

  14. Our Approach Parameters Context Formula’s to score Contexts and Targets Number of contexts to be used Suffix/Prefix e.g. Suffix=Awards, for award categories Length a bias on lengths of retrieved Entity set Weighted Linear Interpolation of three functions Optimization Function : Total Reciprocal Rank Lexical Knowledge from Ngrams

  15. 3. Length Set bias for length of retrieved entity set based on distribution of length over the seed words. Nationality Lexical Knowledge from Ngrams

  16. 3. Length Set bias for length of retrieved entity set based on distribution of length over the seed words. Bird Lexical Knowledge from Ngrams

  17. 3. Length Set bias for length of retrieved entity set based on distribution of length over the seed words. Book Lexical Knowledge from Ngrams

  18. 3. Length Set bias for length of retrieved entity set based on distribution of length over the seed words. Lexical Knowledge from Ngrams

  19. Our Approach Parameters Context Formula’s to score Contexts and Targets Number of contexts to be used Suffix/Prefix e.g. Suffix=Awards, for award categories Length a bias on lengths of retrieved Entity set Weighted Linear Interpolation of three functions Optimization Function : Total Reciprocal Rank Lexical Knowledge from Ngrams

  20. Optimization Function TRR (Total Reciprocal Rank) We want to get higher score for parameters which retrieve our test examples at the top of the retrieved set. TRR = Score = 1/1 +1/2+1/8 = 1.625 Score = 1/2 +1/3+1/4+1/6 = 1.358 Lexical Knowledge from Ngrams

  21. Experiment Data The dataset consists of seeds for all 150 NE’s The number of seeds vary from 20-20,000 extracted from Wikipedia list pages and other list pages (Sekine et.al 2004) Examples Program N-gram search engine for Wikipedia. 1.7 billion tokens and 1.2 billion 7-grams. Lexical Knowledge from Ngrams

  22. Optimization Result Different parameter settings give the best results for different categories Last line is the best for all categories combined (baseline) Lexical Knowledge from Ngrams

  23. Results Recall: percentage of held-out seed examples in top 2,000 Precision: percentage of correct targets in 100 random sample of top 2,000 Lexical Knowledge from Ngrams

  24. Future Work More Features Phrase Clustering Genre information Longer dependency Better optimization Start with smaller number of seeds Other targets (e.g. relation) Make a tool (like Google Sets) Lexical Knowledge from Ngrams

  25. Using Phrase Clusters Lexical Knowledge from Ngrams 25

  26. Airport Cluster • 1301 “airport” in Cluster #287 • Chicago 's O'Hare Airport • Ben Gurion International Airport • Little Rock National Airport • London 's Heathrow airport • Austin airport • Burbank airport • London 's Heathrow Airport • Memphis airport • La Guardia airport • Corpus Christi International Airport • Boston 's Logan Airport • Cincinnati/Northern Kentucky International Airport • Sea-Tac airport Lexical Knowledge from Ngrams

  27. Conclusion A solution for “Different methods work different categories” Large dictionary of 150 category Named Entities sushant@jhu.edu, sekine@cs.nyu.edu Lexical Knowledge from Ngrams

More Related