150 likes | 261 Views
Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. Ellen Riloff and Rosie Jones, AAAI 99 Presented by: Sunandan Chakraborty. Information Extraction. Extracting domain-specific information from NL text Example Domains Locations Companies Terrorism.
E N D
Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping Ellen Riloff and Rosie Jones, AAAI 99 Presented by: Sunandan Chakraborty
Information Extraction • Extracting domain-specific information from NL text • Example Domains • Locations • Companies • Terrorism
Required Lexical Resources • Semantic lexicons • Dictionary of words tagged using semantic categories • e.g. name of locations (countries, cities) • Extraction patterns • e.g. outlets in <x>, from <y> • From Noun Phrase • Outlets in New York
Mutual Bootstrapping • No annotated corpus • Learning extraction patterns and semantic lexicon • Input • Unannotated corpus • Seed words
Mutual Bootstrapping • Starting from seed words • Identifying NPs related to the seed words [for extraction patterns] • Using extraction patterns to identify new terms • New terms should be in the same lexical category • Using new terms to search for more patterns
Algorithm • Input: • Candidate extraction pattern from AutoSlog • Seed words • Data Structures • EPdata – to store candidate extraction patterns • Initial value: extraction patterns from AutoSlog an the extractions • SemLex – to store semantic lexicons as they are identified • Initial value: seed words • Cat_EPlist – to store the extraction patterns • Initial value: null
Algorithm (contd...) • For all Extraction Patterns Pi in EPdata • score(Pi) = Ri * log2(Fi) • Fi = no. of lexicons produced by Pi • Ri = Fi/Ni, Ni: no. of NPs extracted by Pi • 2. Insert Pi to Cat_Eplist, wherescore(Pi) is max • 3. Insert Pi’s extraction SemLex • 4. Repeat from step 1.
Multi-level Bootstrapping • Problem with mutual bootstrapping • Insertion of incorrect word in SemLex can drastically reduce accuracy • Solution • Second level of bootstrapping
Meta-bootstrapping • Outer level of bootstrapping • Retains the best 5 NPs • Corresponding lexicons are added to a permanent list • Reliability score: • rel(NPi) = ΣNik=1(1+ score(pk)) • Using reliable lexicons for the next iteration of Mutual-BS
Evaluation • Corpus • 4160 Corporate web pages • 1500 terrorism text • AutoSlog candidate extraction patterns • 19,690 for the web pages • 14,064 for the terrorism text • Seed words • Web company: Co., Company, Corp… • Web Location: Different country names • Terrorism location: Bolivia, city, Colombia, district
Evaluation (contd…) • 50 iterations of Meta-bootstrapping • Mutual bootstrapping ran until to produced 10 unique patterns
Evaluation (contd…) • Other systems’ accuracy (weapon): • 17% (Rilof & Shepherd, 1997) • 36% (Roark & Charniak, 1998)
Evaluation (contd…) • Tested on 233 new web pages