410 likes | 543 Views
Information Extraction: What has Worked, What hasn't, and What has Promise for the Future. Ralph Weischedel BBN Technologies 7 November 2000. Outline. Information extraction tasks & past performance Two approaches Learning to extract relations Our view of the future.
E N D
Information Extraction:What has Worked,What hasn't, and What has Promise for the Future Ralph Weischedel BBN Technologies 7 November 2000
Outline • Information extraction tasks & past performance • Two approaches • Learning to extract relations • Our view of the future
MUC Tasks Named Entity (NE) Names only of persons, organizations, locations Template Element (TE) All names, a description (if any) and type of organizations and persons; name and type of a named location Template Relations (TR) Who works for what organization; Where an organization is located; What an organization produces Generic Scenario Template (ST) Domain Specific
Scenario Template Example • Terrorism Event • Location: Georgia • Date: 09/06/95 • Type: bombing_event • Instrument: a bomb • Victim: Georgian leader Eduard Shevardnadze • Injury: nothing worse than cuts and bruises • Accused: a group of people with plans of the parliament building • Accuser: Officials investigating the bombing Georgian leader Eduard Shevardnadze suffered nothing worse than cuts and bruises when a bomb exploded yesterday near the parliament building. Officials investigating the bombing said they are blaming a group of people with plans of the parliament building.
F MUC-4 MUC-6 MUC-5 MUC-7 MUC-3 Year Best Performance in Scenario Template • No discernible progress on the domain specific task of scenario templates
Problems with Scenario Template Task • Templates are too domain dependent • not reusable or extensible to new domains • Answer keys are inappropriate for machine learning • inadequate information content • many facts are omitted due to relevancy filtering • weak association between facts and texts • insufficient quantity -- too expensive to produce • Scenario template conflates too many phenomena • entity and descriptor finding • sentence understanding • co-reference • world knowledge / inference • relevancy filtering
Locations Persons Organizations The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leaderRadovan Karadzic. Named Entity • Within a document, identify every • name mention of locations, persons, and organizations • mention of dates, times, monetary amounts, and percentages
Template Entity Task Find all names of an organization/person/location, one description of the organization/person, and a classification for the organization/person/location. “...according to the report by Edwin Dorn, under secretary of defense for personnel and readiness. … Dorn's conclusion that Washington…” <ENTITY-9601020516-13> := ENT_NAME: "Edwin Dorn" "Dorn" ENT_TYPE: PERSON ENT_DESCRIPTOR: "under secretary of defense for personnel and readiness" ENT_CATEGORY: PER_CIV
Template Relation Task Determine who works for what organization, where an organization is located, what an organization produces. “Donald M. Goldstein, a historian at the University of Pittsburgh who helped write…” <EMPLOYEE_OF-9601020516-5> := PERSON: <ENTITY-9601020516-18> ORGANIZATION: <ENTITY-9601020516-9> <ENTITY-9601020516-9> := ENT_NAME: "University of Pittsburgh" ENT_TYPE: ORGANIZATION ENT_CATEGORY: ORG_CO <ENTITY-9601020516-18> := ENT_NAME: "Donald M. Goldstein" ENT_TYPE: PERSON ENT_DESCRIPTOR: "a historian at the University of Pittsburgh"
BN BN MUC-7 BN BN MUC-6 MUC-7 BN BN Performance in MUC/Broadcast News
Performance in MUC/BN Tasks • Clear progress in named entity for broadcast news • Promising progress in template element task for newswire • Promising start on three template relations
Existing Approaches • Manually constructed rules • New rules required for each domain, relation/template type, divergent source, new language • Written by an expert computational linguist (not just any computational linguist) • Cascaded components • Adequate performance apparently only for named entity recognition • Learning algorithms • Require manually annotated training data • Integrated search space offers potential of reduced errors • Adequate performance apparently only for named entity recognition
Training Program training sentences answers NE Models The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leaderRadovan Karadzic. Speech Speech Recognition • Since 1997 - Statistical approaches (BBN, NYU, MITRE) achieve state-of-the-art performance • By 1998 - Performance on automatically transcribed broadcast news of interest Named Entity (NE) Extraction Locations Identify every name mention of locations, persons, and organizations. Persons Organizations The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic. Entities Extractor Text • Up to 1996 - no learning approach competitive with hand-built rules
Typical Architecture Message Text Finder Morphological Analyzer Lexical Pattern Matcher Output Traditional (Rule-Based) Architecture Morphological analysis may determine part of speech Lots of manually constructed patterns • <NNP>+ [“Inc” | “Ltd” | Gmbh ...] • <NNP> [“Power & Light”] • <NNP>+ [“City” | “River” | “Valley”, …] • <title> <NNP>+
Bi-gram transition probabilities Name Extraction via IdentiFinderTM Structure of IdentiFinder’s Model • One language model for each category plus one for other (not-a-name) • The number of categories is learned from training
Effect of Speech Recognition Errors HUB 4 98 SER(speech) @ SER(text) + WER
Message Text Finder After 1995 Part of Speech HMM Named Entity Extraction HMM (Chunk) Parser LPCFG-HD (Chunk) Semantics Sentence-Level Pattern Matcher Coref, Merging, & Inference Template Generator Output Traditional (Rule-based) Architecture ( beyond names) BBN Architecture in MUC-6 (1995) Waterfall architecture • Errors in early processes propagate • Little chance of correcting errors • Learning rules for one component at a time Discourse/Document Clause/Sentence
Rule-based Extraction Examples Determining which person holds what office in what organization • [person] , [office] of [org] • Vuk Draskovic, leader of the Serbian Renewal Movement • [org] (named, appointed, etc.) [person] P [office] • NATO appointed Wesley Clark as Commander in Chief Determining where an organization is located • [org] in [loc] • NATO headquarters in Brussels • [org] [loc] (division, branch, headquarters, etc.) • KFOR Kosovo headquarters
Motivation for a New Approach • Breakthrough in parsing technology achieved in the mid-90s • Few attempts to embed the technology in a task • Information extraction tasks in MUC-7 (1998) offered an opportunity • How can the (limited) semantic interpretation required for MUC be integrated with parsing? • Since the source documents were NYT newswire, rather than WSJ, would we need to treebank NYT data? • Would computational linguists be required as semantic annotators?
Language Answers Message Trainer Sentence Discourse Semantics Syntax Models Fact Identification Template Generator Output New Approach • SIFT, statistical processing of • Name finding • Part of speech • Parsing • Penn TREEBANK for data about syntax • Core semantics of descriptions
S VP NP VP SBAR PP S NP VP WHNP NP NP NP NP NP , was ousted Sharif who led Pakistan , Muscharraf 12 by , Pervez General October Nawaz Army Pakistani A TREEBANK Skeletal Parse
Employee relation Coreference person-descriptor organization person Nance , who is also a paid consultant to ABC News , said ... Semantic training data consists ONLY of • Named entities (as in NE) • Descriptor phrases (as in MUC TE) • Descriptor references (as in MUC TE) • Relation/events to be extracted (as in MUC TR) Integrated syntactic-semantic parsing: Semantic Annotation Required
Automatic Augmentation of Parse Trees • Add nodes for names and descriptors not bracketed in Treebank, e.g. Lt. Cmdr. Edwin Lewis • Attach semantics to noun phrases corresponding to entities (persons, organizations, descriptors) • Insert a node indicating the relation between entities (where one entity is a modifier of another) • Attach semantics indicating relation to lowermost ancestor node of related entities (where one entity is not a modifier of another) • Add pointer semantics to intermediate nodes for entities not immediately dominated by the relation node
s Semantic label Syntax label per/np vp per-desc-of/sbar-lnk per-desc-ptr/sbar per-desc-ptr/vp per-desc-r/np emp-of/pp-lnk org-ptr/pp per-r/np whnp advp per-desc/np org-r/np per/nnp , wp vbz rb det vbn per-desc/nn to org-c/nnp org/nnp , vbd Nance , who is also a paid consultant to ABC News , said ... Augmented Semantic Tree
Do we need to treebank NYT data? - No • Key idea is to exploit the Penn Treebank • Train the sentence-level model on syntactic trees from Treebank • For each sentence in the semantically annotated corpus • Parse the sentence constraining the search to find parses that are consistent with semantics • Augment the syntactic parse with semantic structure • Result is a corpus that is annotated both semantically and syntactically
P(node | history) Head category: P(ch | cp), e.g. P(vp | s) Left modifier categories: PL(cm | cp,chp,cm-1,wp), e.g. PL(per/np | s, vp, null, said) Right modifier categories: PR(cm | cp,chp,cm-1,wp) PR(emp-of/pp-lnk | per-desc-r/np, per-desc/np, null, consultant) Head part-of-speech: P(tm | cm,th,wh), P(per/nnp | per/np, vbd, said) Head word: P(wm | cm, tm, th, wh), e.g. P(nance | per/np, per/nnp, vbd, said) Head word features: P(fm | cm, tm, th, wh, known(wm)), e.g. P(cap | per/np, per/nnp, vbd, said, true) (1) Max [ product[P(node | history)]] tree nodes Lexicalized Probabilistic CFG Model
A Generative Model • Trees are generated top-down, except • immediately upon generating each node, its head word and part-of-speech tag are generated • For each node, child nodes are constructed in three steps (1) the head node is generated (2) premodifier nodes, if any, are generated (3) postmodifier nodes, if any, are generated
Semantic label Syntax label per-desc-ptr/vp per-desc-r/np emp-of/pp-lnk org-ptr/pp advp per-desc/np org-r/np vbz rb det vbn per-desc/nn to org-c/nnp org/nnp is also a paid consultant to ABC News Tree Generation Example S per/np vp per-desc-of/sbar-lnk per-desc-ptr/sbar per-r/np whnp per/nnp , wp , vbd Nance , who , said ...
Searching the Model • CKY bottom-up search of top-down model • Dynamic programming • keep only the most likely constituent if several are equivalent relative to all future decisions • Constituents are considered identical if • They have identical category labels. • Their head constituents have identical labels. • They have the same head word. • Their leftmost modifiers have identical labels. • Their rightmost modifiers have identical labels.
Cross Sentence (Merging) Model • Classifier model applied to entity pairs • whose types fit the relation • first argument not already related • Feature-based model • structural features, e.g. distance between closest references • content features, e.g. similar relations found in training • feature probabilities estimated from annotated training • Compute odds ratio: p(rel) p(feat1|rel) p(feat2|rel) … p(~rel) p(feat1|~rel) p(feat2|~rel) ... • Create new relation if odds ratio > 1.0
How can the (limited) semantic interpretation required for MUC be integrated with parsing? Integrate syntax and semantics by training on and then generating parse trees augmented with semantic labels Since the source documents were NYT newswire, rather than WSJ, would we need to treebank NYT data? No. First train the parser on WSJ. Then constrain the parser on NYT to produce trees consistent with the semantic annotation. Retrain the parser to produce augmented syntax/semantics trees on the NYT data. Must computational linguists be the semantic annotators? No, college students from various majors are sufficient. Issues and Answers
Issues and Answers (cont.) • LPCFG can be effectively applied to information extraction • A single model performed all necessary sentential processing • Much future work required for successful deployment • Statistical modeling of co-reference • Improved performance • Cross-document tracking of entities, facts, and events • Robust handling of noisy input (speech recognition and OCR)
Pronoun Resolution • Statistical model attempts to resolve pronouns to • A previous mention of an entity (person, organization, geo-political entity, location, facility), or • An arbitrary noun phrase, for cases where the pronoun resolves to a non-entity, or • Null, for cases where the pronoun is unresolvable (such as “it is raining”) • This generative model depends on • All previous noun phrases and pronouns • Syntactically local lexical environment • Tree distance (similar to Hobbs ‘76) • Number and gender
Status • Named entity extraction is mature enough for technology transfer • In multiple languages • On online text or automatically recognized text (speech or OCR) • Fact extraction would benefit from further R&D • To increase accuracy from 70 - 75% on newswire to 85-95% on WWW, newswire, audio, video, or printed matter • To reduce training requirements from two person months to two person days • To correlate facts about entities and events across time and across sources for update of a relational data base
Key Effective Ideas in 90s • Recent results in learning algorithms • Named entity recognition via hidden Markov models • Lexicalized probabilistic context-free grammars • Pronoun resolution • Co-training • New training data -- TREEBANK data (parse trees, pronoun co-reference, ...) • A recipe for progress • Corpus of annotated data • An appropriate model for the data • Automatic learning techniques • Recognition search algorithm • Metric-based evaluation
Training Program Models Our Vision Tables training sentences answers Link Analysis Data Base Entities Extractor Events Relations Information Extraction Geo Display Time Line