Self-adjustable bootstrapping for Named Entity set expansion

Self-adjustable bootstrappingfor Named Entity set expansion Sushant Narsale (JHU) Satoshi Sekine (NYU)

Nail: Set (NE list) Expansionusing bootstrapping Expand Named Entity Sets for 150 Named Entity Categories Self-adjustable bootstrapping ngrams July 30th, 2009 Lexical Knowledge from Ngrams Lexical Knowledge from Ngrams 2

Our Task • Input: Seeds for 150 Named Entity Categories • Output: More examples like seeds • Motivation • “Creating lists of Named Entities on Web is critical for query analysis, document categorization and ad matching” -Web Scale Distributional Similarity and Entity Set Expansion, Pantel et. al July 30th, 2009 Lexical Knowledge from Ngrams Lexical Knowledge from Ngrams 3

Examples of 3 categories from 150 Lexical Knowledge from Ngrams

150 category Named Entity Lexical Knowledge from Ngrams

Bootstrapping Get more of similar Set of names (i.e. Presidents) Clinton, Bush Putin, Chirac They must share something… • They share the same context in texts • President * said yesterday of President * in • President * , the President * , who • The contexts may be shared by other Presidents • Yeltsin, Zemin, Hussein, Obama We need scoring function to score the candidates We need to set the number of contexts/examples to learn Lexical Knowledge from Ngrams

Problem Different NE categories need different parameter settings in bootstrapping “Academic” has a small number of strong contexts (Department of … at) “Company” has a large number of weak contexts (… was bankrupted, … hires) “Award” has strong suffix feature (… Award/Prize) “Nationality” has a specific length (1), “Book” has a wide length variation Lexical Knowledge from Ngrams

Self-Adjustable bootstrapping We need to find the best parameter setting for each category Idea: Bootstrapping + Machine Learning Approach Use 80% of seeds for training (train-data), 20% of seeds to optimize the functions and thresholds (dev-data) Lexical Knowledge from Ngrams

Our Approach Parameters Context Formula’s to score Contexts and Targets Number of contexts to be used Suffix/Prefix e.g. Suffix=Awards, for award categories Length a bias on lengths of retrieved Entity set Weighted Linear Interpolation of three functions Optimization Function : Total Reciprocal Rank Lexical Knowledge from Ngrams

1. Scoring formula’s We observed that different scoring formula’s work best for different categories • Scoring Context • Fi / CF • Ft / log(CF) • Fi * log(Ft) / CF • log(Fi) * Ft / CF • log(Fi) * Ft / log(CF) • Scoring Targets • Fi / CF • Ft / log(CF) • Ft * log(Fi) /CF • log(Fi)*Ft / CF • log(Fi)*Ft / log(CF) Fi = Co-occurrence frequency of targets and the context Ft = Number of target types co-occurred with the context CF = Corpus frequency of the context Lexical Knowledge from Ngrams

2. Prefix/Suffix July 30th, 2009 Lexical Knowledge from Ngrams Lexical Knowledge from Ngrams 13

3. Length Set bias for length of retrieved entity set based on distribution of length over the seed words. Nationality Lexical Knowledge from Ngrams

3. Length Set bias for length of retrieved entity set based on distribution of length over the seed words. Bird Lexical Knowledge from Ngrams

3. Length Set bias for length of retrieved entity set based on distribution of length over the seed words. Book Lexical Knowledge from Ngrams

3. Length Set bias for length of retrieved entity set based on distribution of length over the seed words. Lexical Knowledge from Ngrams

Optimization Function TRR (Total Reciprocal Rank) We want to get higher score for parameters which retrieve our test examples at the top of the retrieved set. TRR = Score = 1/1 +1/2+1/8 = 1.625 Score = 1/2 +1/3+1/4+1/6 = 1.358 Lexical Knowledge from Ngrams

Experiment Data The dataset consists of seeds for all 150 NE’s The number of seeds vary from 20-20,000 extracted from Wikipedia list pages and other list pages (Sekine et.al 2004) Examples Program N-gram search engine for Wikipedia. 1.7 billion tokens and 1.2 billion 7-grams. Lexical Knowledge from Ngrams

Optimization Result Different parameter settings give the best results for different categories Last line is the best for all categories combined (baseline) Lexical Knowledge from Ngrams

Results Recall: percentage of held-out seed examples in top 2,000 Precision: percentage of correct targets in 100 random sample of top 2,000 Lexical Knowledge from Ngrams

Future Work More Features Phrase Clustering Genre information Longer dependency Better optimization Start with smaller number of seeds Other targets (e.g. relation) Make a tool (like Google Sets) Lexical Knowledge from Ngrams

Using Phrase Clusters Lexical Knowledge from Ngrams 25

Airport Cluster • 1301 “airport” in Cluster #287 • Chicago 's O'Hare Airport • Ben Gurion International Airport • Little Rock National Airport • London 's Heathrow airport • Austin airport • Burbank airport • London 's Heathrow Airport • Memphis airport • La Guardia airport • Corpus Christi International Airport • Boston 's Logan Airport • Cincinnati/Northern Kentucky International Airport • Sea-Tac airport Lexical Knowledge from Ngrams

Conclusion A solution for “Different methods work different categories” Large dictionary of 150 category Named Entities sushant@jhu.edu, sekine@cs.nyu.edu Lexical Knowledge from Ngrams

Self-adjustable bootstrapping for Named Entity set expansion

Self-adjustable bootstrapping for Named Entity set expansion

Presentation Transcript

Named Entity Recognition

Named Entity Classification

Exploiting Domain Structure for Named Entity Recognition

Named Entity Recognition

Cross-Domain Bootstrapping for Named Entity Recognition

Using Encyclopedic Knowledge for Named Entity Disambiguation

Biomedical Named Entity Recognition

Named Entity Recognition

Entity Set Expansion in Opinion Documents

Using term informativeness for named entity detection

Term Informativeness for Named Entity Detection

Named Entity Recognition

Named Entity Tagging

Using Term Informativeness for Named Entity Detection

Helping Editors Choose Better Seed Sets for Entity Set Expansion

Unsupervised Models for Named Entity Classifcation

NAMED ENTITY RECOGNITION

Named Entity Extraction

Named Entity Recognition

Iterative Set Expansion of Named Entities using the Web

Named Entity Tagging