Information Extraction

Information Extraction • Extract meaningful information from text • Without fully understanding everything! • Basic idea: • Define domain-specific templates • Simple and reliable linguistic processing • Recognize known types of entities and relations • Fill templates with recognized information

Example 4 Apr. Dallas - Early last evening, a tornado swept through northwest Dallas. The twister occurred without warning at about 7:15 pm and destroyed two mobile homes. The Texaco station at 102 Main St. was also severely damaged, but no injuries were reported. Event: tornado Date: 4/3/97 Time: 19:15 Location: “northwest Dallas” : Texas : USA Damage: “mobile homes” (2) “Texaco station” (1) Injuries: none

Tokenization & Tagging Sentence Analysis Early last evening, a tornado swept through northwest Dallas. The twister occurred without warning at about .... Merging Pattern Extraction Template Generation tornado swept: Event: tornado through northwest Dallas: Loc: “northwest Dallas” causing extensive damage: Damage Early last evening: adv-phrase:time a tornado: noun-group:subject swept: verb-group ... Early/ADV last/ADJ evening/NN:time ,/, a/DT tornado/NN:weather swept/VBD ... 4 Apr. Dallas – Early last evening, a tornado swept through northwest.... Event: tornado Date: 4/3/97 Time: 19:15 Location: “northwest Dallas” : Texas : USA ...

MUC: Message Understanding Conference • “Competitive” conference with predefined tasks for research groups to address • Tasks (MUC-7): • Named Entities: Extract typed entities from text • Equivalence Classes: Solving coreference • Attributes: Fill in attributes of entities • Facts: Extract logical relations between entities • Events: Extract descriptions of events from text

Tokenization & Tagging • Tokenization & POS tagging • Also lexical semantic information, such as “time”, “location”, “weather”, “person”, etc. Sentence Analysis • Shallow parsing for phrase types • Use tagging & semantics to tag phrases • Note phrase heads

Pattern Extraction • Find domain-specific relations between text units • Typically use lexical triggers and relation-specific patterns to recognize relations Concept: Damaged-Object Trigger: destroyed Position: direct-object Constraints: physical-thing ... and [ destroyed ] [ two mobile homes ]  Damaged-Object = “two mobile homes”

Learning Extraction Patterns • Very difficult to predefine extraction patterns • Must be redone for each new domain • Hence, corpus-based approaches are indicated • Some methods: • AutoSlog (1992) – “syntactic” learning • PALKA (1995) – “conceptual” learning • CRYSTAL (1995) – covering algorithm

AutoSlog (Lehnert 1992) • Patterns based on recognizing “concepts” • Concept: what concept to recognize • Trigger: a word indicating an occurrence • Position: what syntactic role the concept will take in the sentence • Constraints: what type of entity to allow • Enabling conditions: constraints on the linguistic context

Concept: Event-Time • Trigger:“at” • Position: prep-phrase-object • Constraints: time • Enabling conditions: post-verb The twister occurred without warning at about 7:15 pm and destroyed two mobile homes. Event-Time = 19:15

Learning Patterns • Supervised: Training is text with patterns to be extracted from it • Knowledge: 13 general syntactic patterns • Algorithm: • Find sentence with target noun phrase “two mobile homes” • Partial parsing of sentence: find syntactic relations • Try all linguistic patterns to find match • Generate concept pattern from match

Linguistic Patterns • Identify domain-specific thematic roles based on syntactic structure active-voice-verb followed by target=direct object  Concept = target concept Trigger = verb of active-voice-verb Position = direct-object Constraints = semantic-class of target Enabling conditions = active-voice

More Examples • victim was murdered • perpetratorbombed • perpetrator attempted to kill • was aimed at target • Some bad extraction patterns occur (e.g, “is” as a trigger) • Human review process

CRYSTAL • Complex syntactic patterns • Use “covering” algorithm: • Generate most specific possible patterns for all occurrences of targets in corpus • Loop: • Find most specific unifier of the most similar patterns C & C’, generating new pattern P • If P has less than ε error on corpus, replace C and C’ with P • Continue until no new patterns can be added

Merging Motor Vehicles International Corp. announced a major management shake-up ... MVI said the CEO has resigned ... The Big 10 auto maker is attempting to regain market share ... It will announce losses ... A company spokesman said they are moving their operations ... MVI, the first company to announce such a move since the passage of the new international trade agreement, is facing increasing demands from unionized workers...

Coreference Resolution • Many different kinds of linguistic phenomena: • Proper names, • Aliases (MVI), • Definite NPs (the Big 10 auto maker), • Pronouns (it, they), • Appositives (, the first company to ...) • Errors of previous phases may be amplified

Learning to Merge • Treat coreference as a classification task • Should this pair of entities be linked? • Methodology: • Training corpus: manually link all coreferential expressions • Each possible pair is a training example, if they are linked it is positive if not, it is negative • Create a feature vector for each example • Use your favorite learning algorithm

MLR (1995) • 66 features were used, in 4 categories: • Lexical features of each phrase e.g, do they overlap? • Grammatical role of each phrase e.g, subject, direct-object • Semantic classes of each phrase e.g, physical-thing, company • Relative positions of the phrases e.g, X one sentence after Y • Decision-tree learning (C4.5)

C4.5 • Incrementally build decision-tree from labeled training examples • At each stage choose “best” attribute to split dataset • E.g, use info-gain to compare features • After building complete tree, prune the leaves to prevent overfitting • Use statistical tests to determine if enough examples are in leaf bins, if not – prune!

f2 f3 C1 C2 C2 C1 C4.5 40 training f1 25 training 15 training 18 training 7 training 2 training 13 training

RESOLVE (1995) • C4.5 with 8 complex features: • NAME-{1,2}: does reference include a name? • JV-CHILD-{1,2}: does reference refer to part of a joint venture? • ALIAS: does one reference contain an alias for the other? • BOTH-JV-CHILD: do both refer to part of a joint venture? • COMMON-NP: do both contain a common NP? • SAME-SENTENCE: are both in the same sentence?

Decision Tree

RESOLVE Results • 50 texts, leave-1-out cross-validation:

Pattern Recognition Coreference Resolution Output Template Partial Templates Template Merger Full System: FASTUS (1996) Input Text

Num Aux P Pers-Name Org-Name V N Poss-N-Group V-Group Domain-Event Pattern Recognition • Multiple passes of finite-state methods John Smith, 47, was named president of ABC Corp.

Person: _______ Pos: President Org: ABC Corp. Person: John Smith Pos: President Org: ABC Corp. Start: End: Partially-Instantiated Templates Domain-Dependent!!

Person: Mike Jones Pos: ________ Org: ________ Person: John Smith Pos: ________ Org: ________ Start: End: The Next Sentence... He replaces Mike Jones. Coreference analysis: He = John Smith

Person: Mike Jones Pos: President Org: ABC Corp. Person: John Smith Pos: President Org: ABC Corp. Start: End: Unification Unify new template with preceding template(s), if possible...

NN2 DT NN1 VBD CSub VBZ Event: Announce Actor: Committee heads Principle of Least Commitment • Idea: Maintain options as long as possible • E.g: parsing – maintain a lattice structure: The committee heads announced that... N-GRP Event

NN2 DT NN1 NNpos NN VBZ Head: Committee Effort: ABC’s recruitment Principle of Least Commitment • Idea: Maintain options as long as possible • E.g: parsing – maintain a lattice structure: The committee heads ABC’s recruitment effort. N-GRP N-GRP Event

More Least Commitment • Maintain multiple coreference hypotheses: • Disambiguate when creating domain-events • More information available • Too many possibilities? • Use beam search algorithm: maintain k ‘best’ hypotheses at every stage

Information Extraction