350 likes | 694 Views
Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction. Kiyoshi Sudo Ph.D. Research Proposal New York University. Committee: Ralph Grishman Satoshi Sekine I. Dan Melamed. Outline. Introduction Research Proposal Problem Setting Approach
E N D
Automatic Acquisition ofLexical Classes and Extraction Patternsfor Information Extraction Kiyoshi Sudo Ph.D. Research Proposal New York University Committee: Ralph Grishman Satoshi Sekine I. Dan Melamed
Outline • Introduction • Research Proposal • Problem Setting • Approach • Application to Information Extraction • Discussion Kiyoshi Sudo Thesis Proposal Presentation
MUC Scenario Template Task MURREE, Pakistan (AP) -- Masked gunmen firing Kalashnikov rifles burst through the front gates of a Christian school Monday, killing six people and wounding three in the latest attack against Western interests since Pakistan joined the war against terrorism. Kiyoshi Sudo Thesis Proposal Presentation
Monday Masked gunmen six people Kalashnikov rifles a Christian school three MUC Scenario Template Task MURREE, Pakistan (AP) -- Masked gunmen firing Kalashnikov rifles burst through the front gates of a Christian schoolMonday, killing six people and wounding three in the latest attack against Western interests since Pakistan joined the war against terrorism. Kiyoshi Sudo Thesis Proposal Presentation
High Cost forAcquiring Knowledge-Base • Find extraction patterns • Find relevant documents • Find relevant events • Analyze sentences • Find domain-specific lexicon • Find existing KB (e.g. thesaurus, gazetteers) Kiyoshi Sudo Thesis Proposal Presentation
Prior Work Automatic Knowledge Acquisition Lexical Acquisition Pattern Acquisition Mutual Bootstrapping (Riloff and Jones 1999) Pattern Discovery with Document Re-ranking (Yangarber et al. 2000) Simultaneous Multi-Semantic Class (Thelen and Riloff 2002) (Yangarber et al. 2002) Pattern Acquisition for QA (Ravichandran and Hovy 2002) Kiyoshi Sudo Thesis Proposal Presentation
MUC-3: Terrorism Event Challenge User Seed Lexicon Seed Pattern Expanded Lexicon Expanded Pattern Set Knowledge Base Kiyoshi Sudo Thesis Proposal Presentation
Semantic Clustering Scenario Description Semantic Cluster Meeting the Challenge User Seed Lexicon Seed Pattern Expanded Lexicon Expanded Pattern Set Knowledge Base Kiyoshi Sudo Thesis Proposal Presentation
Semantic Clustering Scenario Description Semantic Cluster Semantic Lexicon Extraction Patterns Semantic Clustering • Input: • Description specific enough • to define the scenario • (terrorism, bombing, kidnapping) • “Tell me about the terrorism action, • such as bombing and kidnapping.” • Goal: • Find Scenario-specific Semantic Clusters • each of which consists of • Semantic Lexicon • Extraction Patterns Kiyoshi Sudo Thesis Proposal Presentation
Semantic Clustering Scenario Description Semantic Cluster Benefit for User • Simplify Domain Analysis • Low-cost Knowledge-base Acquisition for IE systems Kiyoshi Sudo Thesis Proposal Presentation
(x, bombs, himself) Sequential: context = Case Frame: (bomb (v), x (subj), himself (obj)) Dependency: x bomb himself Extraction Patterns • Definition where cunifies with the context that is defined by semantic class L V:subj V:obj (cf. Sudo et al. 2001) Kiyoshi Sudo Thesis Proposal Presentation
Outline • Introduction • Research Proposal • Problem Setting • Approach • Information Extraction • Evaluation Kiyoshi Sudo Thesis Proposal Presentation
Source Information Retrieval Scenario Description Boot- strapping Query Expansion Semantic Cluster Overview Semantic Clustering Kiyoshi Sudo Thesis Proposal Presentation
Source Information Retrieval Scenario Description Boot- strapping Query Expansion Semantic Cluster Overview Semantic Clustering Kiyoshi Sudo Thesis Proposal Presentation
Information Retrieval • Get Relevant Document set • Get list of lexical items and extraction patterns ordered by relevance to the scenario • TF/IDF scoring R Kiyoshi Sudo Thesis Proposal Presentation
Example of TF/IDF scoring(Management Succession: Business) 300 documents retrieved From WSJ (7/94 - 8/94) Extracted by MINIPAR (Lin 1998) Kiyoshi Sudo Thesis Proposal Presentation
Source Information Retrieval Scenario Description extraction patterns lexicon Boot- strapping Query Expansion Semantic Cluster Overview Semantic Clustering Kiyoshi Sudo Thesis Proposal Presentation
Bootstrapping Assumption: • Patterns provide Lexical Classes. • Lexicon provides contextual information. • Find one cluster that consists of Lexicon and Extraction Patterns Riloff and Jones 1999 Agichtein and Gravano 2000 Kiyoshi Sudo Thesis Proposal Presentation
Bootstrapping (Cont.) • Algorithm (cf. Riloff and Jones 1999) • Given • the ordered list of terms • the ordered list of extraction patterns • Lexicon = (), Pattern = () • w the most relevant term in the list and add it into Lexicon • p the most relevant pattern among those that extract w. • Add p into Pattern • wthe most relevant term among those that are extracted by p • Add w into Lexicon • Go to 1 Kiyoshi Sudo Thesis Proposal Presentation
Example of Bootstrapping(Management Succession: Business) From WSJ (7/94 - 8/94) Extracted by MINIPAR (Lin 1998) Kiyoshi Sudo Thesis Proposal Presentation
Example of Bootstrapping(Management Succession: Business) From WSJ (7/94 - 8/94) Extracted by MINIPAR (Lin 1998) Kiyoshi Sudo Thesis Proposal Presentation
Problem:Polysemous Lexicon, Pattern • Lexicon can be ambiguous • e.g. Clinton (Person, Organization, Location … ) • Extraction patterns can be ambiguous • e.g. be killed in <x> (x: Location, Date … ) • Needs more study • more restriction • Probabilistic Model ?? Kiyoshi Sudo Thesis Proposal Presentation
Scenario Description pt lex pattern Semantic Cluster lexicon Overview Semantic Clustering Source Information Retrieval Boot- strapping Query Expansion Kiyoshi Sudo Thesis Proposal Presentation
Query Expansion • Generalize terms in a query with a newly discovered cluster • cf. Rocchio 1971 (Vector model) • Zhai and Lafferty 2001 (Language-modeling) Kiyoshi Sudo Thesis Proposal Presentation
Scenario Description pt lex pattern Semantic Cluster lexicon Overview Semantic Clustering Source Information Retrieval Boot- strapping Query Expansion Kiyoshi Sudo Thesis Proposal Presentation
Outline • Introduction • Research Proposal • Problem Setting • Approach • Application to Information Extraction • Discussion Kiyoshi Sudo Thesis Proposal Presentation
Semantic Clustering Preprocessing Scenario Description Entity Recognition Event Recognition Role Assignment Semantic Cluster Pattern Matching Semantic Lexicon Merging Extraction Patterns Application toInformation Extraction Kiyoshi Sudo Thesis Proposal Presentation
Human Intervention • Extraction patterns • Event pattern • Context contains a verb or nominalization of verb • Used for event extraction and role assignment • e.g. (terrorist, fire, x) • Local pattern • Context contains only enough information to recognize semantic class • Used for entity recognition only • e.g. (x,Inc.) • Association of Event Pattern to Role • e.g. (company, hire, x)PersonIn and (company, fire, x)PersonOut Kiyoshi Sudo Thesis Proposal Presentation
Outline • Introduction • Research Proposal • Problem Setting • Approach • Application to Information Extraction • Discussion Kiyoshi Sudo Thesis Proposal Presentation
Discussion • Domain Portability • User only needs to specify the scenario • Language Portability • Language-dependent Tools • Segmentation (Lemmatization) • Dependency Parsing Kiyoshi Sudo Thesis Proposal Presentation
Evaluation • MUC-style (Scenario-Template task) • Slot-base • Precision, Recall, F-measure • Domain Portability • Several pre-defined tasks that differ in difficulty • Language Portability • Japanese • English Kiyoshi Sudo Thesis Proposal Presentation
Contribution • Tool for Domain Analysis • Low-cost Knowledge-base Acquisition • Towards Open-domain Information Extraction Kiyoshi Sudo Thesis Proposal Presentation
Conclusion • Proposed New Approach for Knowledge-base Acquisition (Semantic Clustering) • Discussed Application of Acquired KB to Information Extraction (Human Intervention and Local vs. Event patterns) • Discussed Evaluation with several predefined MUC-style tasks different in difficulty and across languages (Domain portability and Language portability) Kiyoshi Sudo Thesis Proposal Presentation
ToDo • Implementation • Preparation for Evaluation • Evaluation Kiyoshi Sudo Thesis Proposal Presentation
Time for Questions(Conclusion) • Proposed New Approach for Knowledge-base Acquisition (Semantic Clustering) • Discussed Application of Acquired KB to Information Extraction (Human Intervention and Local vs. Event patterns) • Discussed Evaluation with several predefined MUC-style tasks different in difficulty and across languages (Domain portability and Language portability) Kiyoshi Sudo Thesis Proposal Presentation