Bootstrapping Information Extraction with Unlabeled Data

Bootstrapping Information Extraction with Unlabeled Data Rayid GhaniAccenture Technology Labs Rosie JonesCarnegie Mellon University & Overture (With contributions from Tom Mitchell and Ellen Riloff)

What is Information Extraction? • Analyze unrestricted text in order to extract pre-specified types of events, entities or relationships • Recent Commercial Applications • Database of Job Postings extracted from corporate web pages (flipdog.com) • Extracting specific fields from resumes to populate HR databases (mohomine.com) • Information Integration (fetch.com) • Shopping Portals

IE Approaches • Hand-Constructed Rules • Supervised Learning • Still costly to train and port to new domains • 3-6 months to port to new domain (Cardie 98) • 20,000 words to learn named entity extraction (Seymore et al 99) • 7000 labeled examples to learn MUC extraction rules (Soderland 99) • Semi-Supervised Learning

Semi-Supervised Approaches • Several algorithms proposed for different tasks (semantic tagging, text categorization) and tested on different corpora • Expectation-Maximization, Co-Training, CoBoost, Meta-Bootstrapping, Co-EM, etc. • Goal: • Systematically analyze and test • The Assumptions underlying the algorithms • The Effectiveness of the algorithms on a common set of problems and corpus

Tasks • Extract Noun Phrases belonging to the following semantic classes • Locations • Organizations • People

Aren’t you missing the obvious? • Acquire lists of proper nouns • Locations : countries, states, cities • Organizations : online database • People: Names • Named Entity Extraction? • But not all instances are proper nouns • *by the river*, *customer*,*client*

Use context to disambiguate • A lot of NPs are unambiguous • “The corporation” • A lot of contexts are also unambiguous • Subsidiary of <NP> • But as always, there are exceptions….and a LOT of them in this case • customer, John Hancock, Washington

Bootstrapping Approaches • Utilize Redundancy in Text • Noun-Phrases • New York, China, place we met last time • Contexts • Located in <X>, Traveled to <X> • Learn two models • Use NPs to label Contexts • Use Contexts to label NPs

Interesting Dimensions for Bootstrapping Algorithms • Incremental vs. Iterative • Symmetric vs. Asymmetric • Probabilistic vs. Heuristic

Algorithms for Bootstrapping • Meta-Bootstrapping (Riloff & Jones, 1999) • Incremental, Asymmetric, Heuristic • Co-Training (Blum & Mitchell, 1999) • Incremental, Symmetric, Probabilistic(?) • Co-EM (Nigam & Ghani, 2000) • Iterative, Symmetric, Probabilistic • Baseline • Seed-Labeling: label all NPs that match the seeds • Head-Labeling: label all NPs whose head matches the seeds

Data Set • ~4200 corporate web pages (WebKB project at CMU) • Test data marked up manually by labeling every NP as one or more of the following semantic categories: • location, organization, person, none • Preprocessed (parsed) to generate NPs and extraction patterns using AutoSlog (Riloff, 1996)

Seeds • Location: australia, canada, china, england, france, germany, united states, switzerland, mexico, japan • People: customer, customers, subscriber, people, users, shareholders, individuals, clients, leader, director • Organizations: inc, praxair, company, companies, marine group, xerox, arco, timberlands, puretec, halter, marine group, ravonier

Intuition Behind Bootstrapping Noun Phrases Contexts the dog <X> ran away australia travelled to <X> france <X> is beautiful the canary islands

Co-Training(Blum & Mitchell, 99) • Incremental, symmetric, probabilistic • Initialize with pos and neg NP seeds • Use NPs to label all contexts • Add n top scoring contexts for both positive and negative class • Use new contexts to label all NPS • Add n top scoring NPs for both positive and negative class • Loop

Co-EM(Nigam & Ghani, 2000) • Iterative, Symmetric, Probabilistic • Similar to Co-Training • Probabilistically labels and adds all NPs and contexts to the labeled set

Meta-Bootstrapping(Riloff & Jones, 99) • Incremental, Asymmetric, Heuristic • Two-level process • NPs are used to score contexts according to co-occurring frequency and diversity • After first level, all contexts are discarded and only the best NPs are retained

Common Assumptions • Seeds • Seed Density in the corpus • Head-labeling Accuracy • Syntactic-Semantic Agreement • Redundancy • Feature Sets are redundant and sufficient • Labeling disagreement

Feature Set Ambiguity • Feature Sets: NPs and Contexts • If Feature Sets were redundantly sufficient, either of them alone would be enough to correctly classify the instance • Calculate the ambiguity for each feature set • Washington, Went to <<X>>, Visit <<X>>

NP Ambiguity 2%

Context Ambiguity 36%

Labeling Disagreement • Agreement among human labelers • Same set of instances but different levels of information • NP only • Context Only • NP and Context • NP, Context and the entire sentence from the corpus

Labeling Disagreement • 90.5% agreement when NP, context and sentence are given • 88.5% when sentence is not given

Results Comparing Bootstrapping Algorithms • Meta-Bootstrapping, Co-Training, co-EM • Locations, Organizations, Person

Co-EM MetaBoot Co-Training

Co-EM Co-Training MetaBoot

More Results • Bootstrapping outperforms both baselines • Improvement is less pronounced for “people” class • Ambiguous classes don’t benefit as much from bootstrapping?

Why does co-EM work well? • Co-EM outperforms Meta-bootstrapping & Co-Training • Co-EM is probabilistic and does not do hard classifications • Reflective of the ambiguity among classes

Summary • Starting with 10 seed words, extract NPs matching specific semantic classes using MetaBootstrapping, Co-Training, Co-EM • Probabilistic Bootstrapping with redundant feature sets is effective – even for ambiguous classes • Co-EM performs robustly even when the underlying assumptions are violated

Ongoing Work • Varying initial seed size and type • Collecting Training Corpus automatically (from the Web) • Incorporating the user in the loop (Active Learning)

Bootstrapping Information Extraction with Unlabeled Data

Bootstrapping Information Extraction with Unlabeled Data

Presentation Transcript

Information Extraction

Information Extraction

Information Extraction

Bootstrapping information extraction from semi-structured web pages

information extraction

Information Extraction with Linked Life Data

Information Extraction

Information Extraction

Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Information Extraction

Bootstrapping Information Extraction from Semi-Structured Web Pages

BOEMIE: Bootstrapping Ontology Evolution with Multimedia Information Extraction

Information Extraction

Classification of unlabeled data:

Information Extraction with Unlabeled Data