Information Extraction: What has Worked, What hasn't, and What has Promise for the Future

Information Extraction:What has Worked,What hasn't, and What has Promise for the Future Ralph Weischedel BBN Technologies 7 November 2000

Outline • Information extraction tasks & past performance • Two approaches • Learning to extract relations • Our view of the future

Tasks and Performance in Information Extraction

MUC Tasks Named Entity (NE) Names only of persons, organizations, locations Template Element (TE) All names, a description (if any) and type of organizations and persons; name and type of a named location Template Relations (TR) Who works for what organization; Where an organization is located; What an organization produces Generic Scenario Template (ST) Domain Specific

Scenario Template Example • Terrorism Event • Location: Georgia • Date: 09/06/95 • Type: bombing_event • Instrument: a bomb • Victim: Georgian leader Eduard Shevardnadze • Injury: nothing worse than cuts and bruises • Accused: a group of people with plans of the parliament building • Accuser: Officials investigating the bombing Georgian leader Eduard Shevardnadze suffered nothing worse than cuts and bruises when a bomb exploded yesterday near the parliament building. Officials investigating the bombing said they are blaming a group of people with plans of the parliament building.

F MUC-4 MUC-6 MUC-5 MUC-7 MUC-3 Year Best Performance in Scenario Template • No discernible progress on the domain specific task of scenario templates

Problems with Scenario Template Task • Templates are too domain dependent • not reusable or extensible to new domains • Answer keys are inappropriate for machine learning • inadequate information content • many facts are omitted due to relevancy filtering • weak association between facts and texts • insufficient quantity -- too expensive to produce • Scenario template conflates too many phenomena • entity and descriptor finding • sentence understanding • co-reference • world knowledge / inference • relevancy filtering

Locations Persons Organizations The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leaderRadovan Karadzic. Named Entity • Within a document, identify every • name mention of locations, persons, and organizations • mention of dates, times, monetary amounts, and percentages

Template Entity Task Find all names of an organization/person/location, one description of the organization/person, and a classification for the organization/person/location. “...according to the report by Edwin Dorn, under secretary of defense for personnel and readiness. … Dorn's conclusion that Washington…” <ENTITY-9601020516-13> := ENT_NAME: "Edwin Dorn" "Dorn" ENT_TYPE: PERSON ENT_DESCRIPTOR: "under secretary of defense for personnel and readiness" ENT_CATEGORY: PER_CIV

Template Relation Task Determine who works for what organization, where an organization is located, what an organization produces. “Donald M. Goldstein, a historian at the University of Pittsburgh who helped write…” <EMPLOYEE_OF-9601020516-5> := PERSON: <ENTITY-9601020516-18> ORGANIZATION: <ENTITY-9601020516-9> <ENTITY-9601020516-9> := ENT_NAME: "University of Pittsburgh" ENT_TYPE: ORGANIZATION ENT_CATEGORY: ORG_CO <ENTITY-9601020516-18> := ENT_NAME: "Donald M. Goldstein" ENT_TYPE: PERSON ENT_DESCRIPTOR: "a historian at the University of Pittsburgh"

BN BN MUC-7 BN BN MUC-6 MUC-7 BN BN Performance in MUC/Broadcast News

Performance in MUC/BN Tasks • Clear progress in named entity for broadcast news • Promising progress in template element task for newswire • Promising start on three template relations

Overview of Approaches

Existing Approaches • Manually constructed rules • New rules required for each domain, relation/template type, divergent source, new language • Written by an expert computational linguist (not just any computational linguist) • Cascaded components • Adequate performance apparently only for named entity recognition • Learning algorithms • Require manually annotated training data • Integrated search space offers potential of reduced errors • Adequate performance apparently only for named entity recognition

Training Program training sentences answers NE Models The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leaderRadovan Karadzic. Speech Speech Recognition • Since 1997 - Statistical approaches (BBN, NYU, MITRE) achieve state-of-the-art performance • By 1998 - Performance on automatically transcribed broadcast news of interest Named Entity (NE) Extraction Locations Identify every name mention of locations, persons, and organizations. Persons Organizations The delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic. Entities Extractor Text • Up to 1996 - no learning approach competitive with hand-built rules

Typical Architecture Message Text Finder Morphological Analyzer Lexical Pattern Matcher Output Traditional (Rule-Based) Architecture Morphological analysis may determine part of speech Lots of manually constructed patterns • <NNP>+ [“Inc” | “Ltd” | Gmbh ...] • <NNP> [“Power & Light”] • <NNP>+ [“City” | “River” | “Valley”, …] • <title> <NNP>+

Bi-gram transition probabilities Name Extraction via IdentiFinderTM Structure of IdentiFinder’s Model • One language model for each category plus one for other (not-a-name) • The number of categories is learned from training

Effect of Speech Recognition Errors HUB 4 98 SER(speech) @ SER(text) + WER

Message Text Finder After 1995 Part of Speech HMM Named Entity Extraction HMM (Chunk) Parser LPCFG-HD (Chunk) Semantics Sentence-Level Pattern Matcher Coref, Merging, & Inference Template Generator Output Traditional (Rule-based) Architecture ( beyond names) BBN Architecture in MUC-6 (1995) Waterfall architecture • Errors in early processes propagate • Little chance of correcting errors • Learning rules for one component at a time Discourse/Document Clause/Sentence

Rule-based Extraction Examples Determining which person holds what office in what organization • [person] , [office] of [org] • Vuk Draskovic, leader of the Serbian Renewal Movement • [org] (named, appointed, etc.) [person] P [office] • NATO appointed Wesley Clark as Commander in Chief Determining where an organization is located • [org] in [loc] • NATO headquarters in Brussels • [org] [loc] (division, branch, headquarters, etc.) • KFOR Kosovo headquarters

Learning to Extract Relations

Motivation for a New Approach • Breakthrough in parsing technology achieved in the mid-90s • Few attempts to embed the technology in a task • Information extraction tasks in MUC-7 (1998) offered an opportunity • How can the (limited) semantic interpretation required for MUC be integrated with parsing? • Since the source documents were NYT newswire, rather than WSJ, would we need to treebank NYT data? • Would computational linguists be required as semantic annotators?

Language Answers Message Trainer Sentence Discourse Semantics Syntax Models Fact Identification Template Generator Output New Approach • SIFT, statistical processing of • Name finding • Part of speech • Parsing • Penn TREEBANK for data about syntax • Core semantics of descriptions

S VP NP VP SBAR PP S NP VP WHNP NP NP NP NP NP , was ousted Sharif who led Pakistan , Muscharraf 12 by , Pervez General October Nawaz Army Pakistani A TREEBANK Skeletal Parse

Employee relation Coreference person-descriptor organization person Nance , who is also a paid consultant to ABC News , said ... Semantic training data consists ONLY of • Named entities (as in NE) • Descriptor phrases (as in MUC TE) • Descriptor references (as in MUC TE) • Relation/events to be extracted (as in MUC TR) Integrated syntactic-semantic parsing: Semantic Annotation Required

Automatic Augmentation of Parse Trees • Add nodes for names and descriptors not bracketed in Treebank, e.g. Lt. Cmdr. Edwin Lewis • Attach semantics to noun phrases corresponding to entities (persons, organizations, descriptors) • Insert a node indicating the relation between entities (where one entity is a modifier of another) • Attach semantics indicating relation to lowermost ancestor node of related entities (where one entity is not a modifier of another) • Add pointer semantics to intermediate nodes for entities not immediately dominated by the relation node

s Semantic label Syntax label per/np vp per-desc-of/sbar-lnk per-desc-ptr/sbar per-desc-ptr/vp per-desc-r/np emp-of/pp-lnk org-ptr/pp per-r/np whnp advp per-desc/np org-r/np per/nnp , wp vbz rb det vbn per-desc/nn to org-c/nnp org/nnp , vbd Nance , who is also a paid consultant to ABC News , said ... Augmented Semantic Tree

Do we need to treebank NYT data? - No • Key idea is to exploit the Penn Treebank • Train the sentence-level model on syntactic trees from Treebank • For each sentence in the semantically annotated corpus • Parse the sentence constraining the search to find parses that are consistent with semantics • Augment the syntactic parse with semantic structure • Result is a corpus that is annotated both semantically and syntactically

A Generative Model • Trees are generated top-down, except • immediately upon generating each node, its head word and part-of-speech tag are generated • For each node, child nodes are constructed in three steps (1) the head node is generated (2) premodifier nodes, if any, are generated (3) postmodifier nodes, if any, are generated

Semantic label Syntax label per-desc-ptr/vp per-desc-r/np emp-of/pp-lnk org-ptr/pp advp per-desc/np org-r/np vbz rb det vbn per-desc/nn to org-c/nnp org/nnp is also a paid consultant to ABC News Tree Generation Example S per/np vp per-desc-of/sbar-lnk per-desc-ptr/sbar per-r/np whnp per/nnp , wp , vbd Nance , who , said ...

Searching the Model • CKY bottom-up search of top-down model • Dynamic programming • keep only the most likely constituent if several are equivalent relative to all future decisions • Constituents are considered identical if • They have identical category labels. • Their head constituents have identical labels. • They have the same head word. • Their leftmost modifiers have identical labels. • Their rightmost modifiers have identical labels.

Cross Sentence (Merging) Model • Classifier model applied to entity pairs • whose types fit the relation • first argument not already related • Feature-based model • structural features, e.g. distance between closest references • content features, e.g. similar relations found in training • feature probabilities estimated from annotated training • Compute odds ratio: p(rel) p(feat1|rel) p(feat2|rel) … p(~rel) p(feat1|~rel) p(feat2|~rel) ... • Create new relation if odds ratio > 1.0

Performance on MUC-7 Test Data

How can the (limited) semantic interpretation required for MUC be integrated with parsing? Integrate syntax and semantics by training on and then generating parse trees augmented with semantic labels Since the source documents were NYT newswire, rather than WSJ, would we need to treebank NYT data? No. First train the parser on WSJ. Then constrain the parser on NYT to produce trees consistent with the semantic annotation. Retrain the parser to produce augmented syntax/semantics trees on the NYT data. Must computational linguists be the semantic annotators? No, college students from various majors are sufficient. Issues and Answers

Issues and Answers (cont.) • LPCFG can be effectively applied to information extraction • A single model performed all necessary sentential processing • Much future work required for successful deployment • Statistical modeling of co-reference • Improved performance • Cross-document tracking of entities, facts, and events • Robust handling of noisy input (speech recognition and OCR)

Pronoun Resolution • Statistical model attempts to resolve pronouns to • A previous mention of an entity (person, organization, geo-political entity, location, facility), or • An arbitrary noun phrase, for cases where the pronoun resolves to a non-entity, or • Null, for cases where the pronoun is unresolvable (such as “it is raining”) • This generative model depends on • All previous noun phrases and pronouns • Syntactically local lexical environment • Tree distance (similar to Hobbs ‘76) • Number and gender

Our View of the Future

Status • Named entity extraction is mature enough for technology transfer • In multiple languages • On online text or automatically recognized text (speech or OCR) • Fact extraction would benefit from further R&D • To increase accuracy from 70 - 75% on newswire to 85-95% on WWW, newswire, audio, video, or printed matter • To reduce training requirements from two person months to two person days • To correlate facts about entities and events across time and across sources for update of a relational data base

Key Effective Ideas in 90s • Recent results in learning algorithms • Named entity recognition via hidden Markov models • Lexicalized probabilistic context-free grammars • Pronoun resolution • Co-training • New training data -- TREEBANK data (parse trees, pronoun co-reference, ...) • A recipe for progress • Corpus of annotated data • An appropriate model for the data • Automatic learning techniques • Recognition search algorithm • Metric-based evaluation

Training Program Models Our Vision Tables training sentences answers Link Analysis Data Base Entities Extractor Events Relations Information Extraction Geo Display Time Line

Information Extraction: What has Worked, What hasn't, and What has Promise for the Future