270 likes | 426 Views
Self-adjustable bootstrapping for Named Entity set expansion. Sushant Narsale (JHU) Satoshi Sekine (NYU). Nail: Set (NE list) Expansion using bootstrapping. Expand Named Entity Sets for 150 Named Entity Categories. Self-adjustable bootstrapping. ngrams. July 30th, 2009.
E N D
Self-adjustable bootstrappingfor Named Entity set expansion Sushant Narsale (JHU) Satoshi Sekine (NYU)
Nail: Set (NE list) Expansionusing bootstrapping Expand Named Entity Sets for 150 Named Entity Categories Self-adjustable bootstrapping ngrams July 30th, 2009 Lexical Knowledge from Ngrams Lexical Knowledge from Ngrams 2
Our Task • Input: Seeds for 150 Named Entity Categories • Output: More examples like seeds • Motivation • “Creating lists of Named Entities on Web is critical for query analysis, document categorization and ad matching” -Web Scale Distributional Similarity and Entity Set Expansion, Pantel et. al July 30th, 2009 Lexical Knowledge from Ngrams Lexical Knowledge from Ngrams 3
Examples of 3 categories from 150 Lexical Knowledge from Ngrams
150 category Named Entity Lexical Knowledge from Ngrams
Bootstrapping Get more of similar Set of names (i.e. Presidents) Clinton, Bush Putin, Chirac They must share something… • They share the same context in texts • President * said yesterday of President * in • President * , the President * , who • The contexts may be shared by other Presidents • Yeltsin, Zemin, Hussein, Obama We need scoring function to score the candidates We need to set the number of contexts/examples to learn Lexical Knowledge from Ngrams
Problem Different NE categories need different parameter settings in bootstrapping “Academic” has a small number of strong contexts (Department of … at) “Company” has a large number of weak contexts (… was bankrupted, … hires) “Award” has strong suffix feature (… Award/Prize) “Nationality” has a specific length (1), “Book” has a wide length variation Lexical Knowledge from Ngrams
Self-Adjustable bootstrapping We need to find the best parameter setting for each category Idea: Bootstrapping + Machine Learning Approach Use 80% of seeds for training (train-data), 20% of seeds to optimize the functions and thresholds (dev-data) Lexical Knowledge from Ngrams
Our Approach Parameters Context Formula’s to score Contexts and Targets Number of contexts to be used Suffix/Prefix e.g. Suffix=Awards, for award categories Length a bias on lengths of retrieved Entity set Weighted Linear Interpolation of three functions Optimization Function : Total Reciprocal Rank Lexical Knowledge from Ngrams
Our Approach Parameters Context Formula’s to score Contexts and Targets Number of contexts to be used Suffix/Prefix e.g. Suffix=Awards, for award categories Length a bias on lengths of retrieved Entity set Weighted Linear Interpolation of three functions Optimization Function : Total Reciprocal Rank Lexical Knowledge from Ngrams
1. Scoring formula’s We observed that different scoring formula’s work best for different categories • Scoring Context • Fi / CF • Ft / log(CF) • Fi * log(Ft) / CF • log(Fi) * Ft / CF • log(Fi) * Ft / log(CF) • Scoring Targets • Fi / CF • Ft / log(CF) • Ft * log(Fi) /CF • log(Fi)*Ft / CF • log(Fi)*Ft / log(CF) Fi = Co-occurrence frequency of targets and the context Ft = Number of target types co-occurred with the context CF = Corpus frequency of the context Lexical Knowledge from Ngrams
Our Approach Parameters Context Formula’s to score Contexts and Targets Number of contexts to be used Suffix/Prefix e.g. Suffix=Awards, for award categories Length a bias on lengths of retrieved Entity set Weighted Linear Interpolation of three functions Optimization Function : Total Reciprocal Rank Lexical Knowledge from Ngrams
2. Prefix/Suffix July 30th, 2009 Lexical Knowledge from Ngrams Lexical Knowledge from Ngrams 13
Our Approach Parameters Context Formula’s to score Contexts and Targets Number of contexts to be used Suffix/Prefix e.g. Suffix=Awards, for award categories Length a bias on lengths of retrieved Entity set Weighted Linear Interpolation of three functions Optimization Function : Total Reciprocal Rank Lexical Knowledge from Ngrams
3. Length Set bias for length of retrieved entity set based on distribution of length over the seed words. Nationality Lexical Knowledge from Ngrams
3. Length Set bias for length of retrieved entity set based on distribution of length over the seed words. Bird Lexical Knowledge from Ngrams
3. Length Set bias for length of retrieved entity set based on distribution of length over the seed words. Book Lexical Knowledge from Ngrams
3. Length Set bias for length of retrieved entity set based on distribution of length over the seed words. Lexical Knowledge from Ngrams
Our Approach Parameters Context Formula’s to score Contexts and Targets Number of contexts to be used Suffix/Prefix e.g. Suffix=Awards, for award categories Length a bias on lengths of retrieved Entity set Weighted Linear Interpolation of three functions Optimization Function : Total Reciprocal Rank Lexical Knowledge from Ngrams
Optimization Function TRR (Total Reciprocal Rank) We want to get higher score for parameters which retrieve our test examples at the top of the retrieved set. TRR = Score = 1/1 +1/2+1/8 = 1.625 Score = 1/2 +1/3+1/4+1/6 = 1.358 Lexical Knowledge from Ngrams
Experiment Data The dataset consists of seeds for all 150 NE’s The number of seeds vary from 20-20,000 extracted from Wikipedia list pages and other list pages (Sekine et.al 2004) Examples Program N-gram search engine for Wikipedia. 1.7 billion tokens and 1.2 billion 7-grams. Lexical Knowledge from Ngrams
Optimization Result Different parameter settings give the best results for different categories Last line is the best for all categories combined (baseline) Lexical Knowledge from Ngrams
Results Recall: percentage of held-out seed examples in top 2,000 Precision: percentage of correct targets in 100 random sample of top 2,000 Lexical Knowledge from Ngrams
Future Work More Features Phrase Clustering Genre information Longer dependency Better optimization Start with smaller number of seeds Other targets (e.g. relation) Make a tool (like Google Sets) Lexical Knowledge from Ngrams
Using Phrase Clusters Lexical Knowledge from Ngrams 25
Airport Cluster • 1301 “airport” in Cluster #287 • Chicago 's O'Hare Airport • Ben Gurion International Airport • Little Rock National Airport • London 's Heathrow airport • Austin airport • Burbank airport • London 's Heathrow Airport • Memphis airport • La Guardia airport • Corpus Christi International Airport • Boston 's Logan Airport • Cincinnati/Northern Kentucky International Airport • Sea-Tac airport Lexical Knowledge from Ngrams
Conclusion A solution for “Different methods work different categories” Large dictionary of 150 category Named Entities sushant@jhu.edu, sekine@cs.nyu.edu Lexical Knowledge from Ngrams