300 likes | 417 Views
Bootstrapping Information Extraction with Unlabeled Data. Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture. (With contributions from Tom Mitchell and Ellen Riloff). What is Information Extraction?.
E N D
Bootstrapping Information Extraction with Unlabeled Data Rayid GhaniAccenture Technology Labs Rosie JonesCarnegie Mellon University & Overture (With contributions from Tom Mitchell and Ellen Riloff)
What is Information Extraction? • Analyze unrestricted text in order to extract pre-specified types of events, entities or relationships • Recent Commercial Applications • Database of Job Postings extracted from corporate web pages (flipdog.com) • Extracting specific fields from resumes to populate HR databases (mohomine.com) • Information Integration (fetch.com) • Shopping Portals
IE Approaches • Hand-Constructed Rules • Supervised Learning • Still costly to train and port to new domains • 3-6 months to port to new domain (Cardie 98) • 20,000 words to learn named entity extraction (Seymore et al 99) • 7000 labeled examples to learn MUC extraction rules (Soderland 99) • Semi-Supervised Learning
Semi-Supervised Approaches • Several algorithms proposed for different tasks (semantic tagging, text categorization) and tested on different corpora • Expectation-Maximization, Co-Training, CoBoost, Meta-Bootstrapping, Co-EM, etc. • Goal: • Systematically analyze and test • The Assumptions underlying the algorithms • The Effectiveness of the algorithms on a common set of problems and corpus
Tasks • Extract Noun Phrases belonging to the following semantic classes • Locations • Organizations • People
Aren’t you missing the obvious? • Acquire lists of proper nouns • Locations : countries, states, cities • Organizations : online database • People: Names • Named Entity Extraction? • But not all instances are proper nouns • *by the river*, *customer*,*client*
Use context to disambiguate • A lot of NPs are unambiguous • “The corporation” • A lot of contexts are also unambiguous • Subsidiary of <NP> • But as always, there are exceptions….and a LOT of them in this case • customer, John Hancock, Washington
Bootstrapping Approaches • Utilize Redundancy in Text • Noun-Phrases • New York, China, place we met last time • Contexts • Located in <X>, Traveled to <X> • Learn two models • Use NPs to label Contexts • Use Contexts to label NPs
Interesting Dimensions for Bootstrapping Algorithms • Incremental vs. Iterative • Symmetric vs. Asymmetric • Probabilistic vs. Heuristic
Algorithms for Bootstrapping • Meta-Bootstrapping (Riloff & Jones, 1999) • Incremental, Asymmetric, Heuristic • Co-Training (Blum & Mitchell, 1999) • Incremental, Symmetric, Probabilistic(?) • Co-EM (Nigam & Ghani, 2000) • Iterative, Symmetric, Probabilistic • Baseline • Seed-Labeling: label all NPs that match the seeds • Head-Labeling: label all NPs whose head matches the seeds
Data Set • ~4200 corporate web pages (WebKB project at CMU) • Test data marked up manually by labeling every NP as one or more of the following semantic categories: • location, organization, person, none • Preprocessed (parsed) to generate NPs and extraction patterns using AutoSlog (Riloff, 1996)
Seeds • Location: australia, canada, china, england, france, germany, united states, switzerland, mexico, japan • People: customer, customers, subscriber, people, users, shareholders, individuals, clients, leader, director • Organizations: inc, praxair, company, companies, marine group, xerox, arco, timberlands, puretec, halter, marine group, ravonier
Intuition Behind Bootstrapping Noun Phrases Contexts the dog <X> ran away australia travelled to <X> france <X> is beautiful the canary islands
Co-Training(Blum & Mitchell, 99) • Incremental, symmetric, probabilistic • Initialize with pos and neg NP seeds • Use NPs to label all contexts • Add n top scoring contexts for both positive and negative class • Use new contexts to label all NPS • Add n top scoring NPs for both positive and negative class • Loop
Co-EM(Nigam & Ghani, 2000) • Iterative, Symmetric, Probabilistic • Similar to Co-Training • Probabilistically labels and adds all NPs and contexts to the labeled set
Meta-Bootstrapping(Riloff & Jones, 99) • Incremental, Asymmetric, Heuristic • Two-level process • NPs are used to score contexts according to co-occurring frequency and diversity • After first level, all contexts are discarded and only the best NPs are retained
Common Assumptions • Seeds • Seed Density in the corpus • Head-labeling Accuracy • Syntactic-Semantic Agreement • Redundancy • Feature Sets are redundant and sufficient • Labeling disagreement
Feature Set Ambiguity • Feature Sets: NPs and Contexts • If Feature Sets were redundantly sufficient, either of them alone would be enough to correctly classify the instance • Calculate the ambiguity for each feature set • Washington, Went to <<X>>, Visit <<X>>
NP Ambiguity 2%
Labeling Disagreement • Agreement among human labelers • Same set of instances but different levels of information • NP only • Context Only • NP and Context • NP, Context and the entire sentence from the corpus
Labeling Disagreement • 90.5% agreement when NP, context and sentence are given • 88.5% when sentence is not given
Results Comparing Bootstrapping Algorithms • Meta-Bootstrapping, Co-Training, co-EM • Locations, Organizations, Person
Co-EM MetaBoot Co-Training
Co-EM Co-Training MetaBoot
Co-EM Co-Training MetaBoot
More Results • Bootstrapping outperforms both baselines • Improvement is less pronounced for “people” class • Ambiguous classes don’t benefit as much from bootstrapping?
Why does co-EM work well? • Co-EM outperforms Meta-bootstrapping & Co-Training • Co-EM is probabilistic and does not do hard classifications • Reflective of the ambiguity among classes
Summary • Starting with 10 seed words, extract NPs matching specific semantic classes using MetaBootstrapping, Co-Training, Co-EM • Probabilistic Bootstrapping with redundant feature sets is effective – even for ambiguous classes • Co-EM performs robustly even when the underlying assumptions are violated
Ongoing Work • Varying initial seed size and type • Collecting Training Corpus automatically (from the Web) • Incorporating the user in the loop (Active Learning)