LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing • Lecture 22 • 4/8/2013

Recommended reading • Jurafsky & Martin Chapter 22, Information Extraction • Ellen Riloff. 1996. Automatically generating extraction patterns from untagged text. Proceedings of AAAI. • Roman Yangarber et al. 2000. Automatic acquisition of domain knowledge for information extraction. Proceedings of COLING. • Roman Yangarber. 2003. Counter-training in discovery of semantic patterns. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL). • William Gale and Kenneth Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1). • Regina Barzilay and Kathleen McKeown. 2001. Extracting paraphrases from a parallel corpus. Proc. of ACL/EACL.

Outline • Review: information extraction and paraphrases • Semi-supervised learning of rules for information extraction • Sentence alignment • Semi-supervised learning of paraphrases

Lexical Analysis Name Recognition (Partial) Syntax Rel’n/Event Patterns Reference Resolution Output Generation Pipeline (sequence of steps) for pattern-based extraction

Application of pattern fills in a database entry

Paraphrase relations • Similar linguistic patterns person was appointed as post of companycompany named person to post • Results from: • Different words • named, appointed, selected, chosen, promoted, … • Different syntactic constructions • IBM named Fred president • IBM announced the appointment of Fred as president • Fred, who was named president by IBM

MUC templates and gold standard answers TST4-MUC4-0010 SANTIAGO, 31 JUL 88 (EL MERCURIO) -- [TEXT] RANCAGUA – THE NATIONAL VANGUARD OFFICES IN THIS CITY WERE ATTACKED ON 29 JULY AT 2220. UNIDENTIFIED INDIVIDUALS DETONATED A BOMB THAT DAMAGED THE WINDOWS OF THE NATIONAL VANGUARD OFFICES AND THOSE OF THE NEIGHBORING HOUSES. 0. MESSAGE: ID TST4-MUC4-0010 1. MESSAGE: TEMPLATE 1 2. INCIDENT: DATE 29 JUL 88 3. INCIDENT: LOCATION CHILE: RANCAGUA (CITY) 4. INCIDENT: TYPE BOMBING 5. INCIDENT: STAGE OF EXECUTION ACCOMPLISHED 6. INCIDENT: INSTRUMENT ID "BOMB" 7. INCIDENT: INSTRUMENT TYPE BOMB: "BOMB" 8. PERP: INCIDENT CATEGORY - 9. PERP: INDIVIDUAL ID "UNIDENTIFIED INDIVIDUALS" 10. PERP: ORGANIZATION ID - 11. PERP: ORGANIZATION CONFIDENCE -

TST4-MUC4-0010 SANTIAGO, 31 JUL 88 (EL MERCURIO) -- [TEXT] RANCAGUA – THE NATIONAL VANGUARD OFFICES IN THIS CITY WERE ATTACKED ON 29 JULY AT 2220. UNIDENTIFIED INDIVIDUALS DETONATED A BOMB THAT DAMAGED THE WINDOWS OF THE NATIONAL VANGUARD OFFICES AND THOSE OF THE NEIGHBORING HOUSES. 12. PHYS TGT: ID "NATIONAL VANGUARD OFFICES" "NEIGHBORING HOUSES" / "HOUSES" 13. PHYS TGT: TYPE ORGANIZATION OFFICE / COMMERCIAL / OTHER: "NATIONAL VANGUARD OFFICES" CIVILIAN RESIDENCE: "NEIGHBORING HOUSES" / "HOUSES" 14. PHYS TGT: NUMBER PLURAL: "NATIONAL VANGUARD OFFICES" PLURAL: "NEIGHBORING HOUSES" / "HOUSES" 15. PHYS TGT: FOREIGN NATION - 16. PHYS TGT: EFFECT OF INCIDENT SOME DAMAGE: "NATIONAL VANGUARD OFFICES" SOME DAMAGE: "NEIGHBORING HOUSES" / "HOUSES" 17. PHYS TGT: TOTAL NUMBER -

Machine learning for information extraction • Want to learn IE patterns and paraphrases • Annotation of corpora is very expensive • MUC and other corpora have limited linguistic coverage • Patterns would need to be highly domain-specific • Medicine, terrorism, business, news, law, etc. • Therefore would need an annotated corpus for each domain • Solution: use semi-supervised learning

Semi-supervised learning of IE patterns • Intuition: if we collect documents DR relevant to the scenario, patterns relevant to the scenario will occur more frequently in DR than in the language as a whole

Riloff 1996 • AutoSlog: an earlier supervised system that began with knowledge of target strings • Strings in the text are already annotated as: victim, perpetrator, target, instrument • Want to discover extraction patterns for these strings • Apply sentence analyzer • CIRCUS (Lehnert 1991) • Finds subject, verb, direct object, prepositional phrases • See what patterns occur with annotated strings

Patterns found

Semi-supervised pattern extraction: AutoSlog-TS • Begin with a corpus of relevant documents and irrelevant documents • Construct patterns for all noun phrases in all documents • Using the 13 pattern templates in previous slide • Compare frequencies of patterns in relevant vs. irrelevant documents

Finding relevant patterns • Relevance rate for patterni = p(relevant doc | doc contains patterni) • Also consider raw frequency of patterns • Some relevant patterns, such as “X was killed”, occur in both relevant and irrelevant texts • Rank patterns by: relevance rate * log2(frequency)

Apply to terrorism data • MUC-4 training set: 1500 documents, about 50% relevant • AutoSlog-TS generated 32,345 unique extraction patterns • Discarded patterns only occurring once; 11,225 remaining patterns • Rank patterns • By the formula (on previous slide) • Manual filtering by human (!!!) • Started with 1,970 patterns, kept 210 • Evaluation: compared by hand against 100 documents in test set

(before manual filtering)

Problems with AutoSlog-TS • Assumes you already know which documents are relevant and irrelevant • Manual review of extracted patterns by a human (kept 210 out of 1970) • How many should we choose?

Yangarber et. al. 2000: ExDisco (“semi-unsupervised”) • Begin with seed extraction patterns written by hand • Use these seed patterns to identify relevant documents • Construct new patterns for all documents • Rank patterns by their correlation with document relevance • Add the highest ranking pattern to the pattern set • Apply patterns to corpus, and repeat process

Experiments • Data: • MUC-6 (news reports) • Topic: negotiation of labor disputes and corporate management succession • Compared performance of: • Seed patterns only • Top 100 extracted by ExDisco • Patterns manually developed by computational linguists for 1 month on MUC data • (and others)

Performance on test data Recall Precision F-measure Seed 27 74 39.58 ExDisco 52 72 60.16 Manual 47 70 56.40

Yangarber 2003 • Problem: how do you know how many patterns to keep? • Earlier system: kept top 100, an arbitrary number • Lower-ranked patterns: would not be domain-specific

Counter-training • Basic idea (see paper for details): • Identify patterns for multiple scenarios simultaneously • Begin with seed patterns for each scenario, grow incrementally • Automatically stop: stop adding patterns if they are more common in other scenarios

Sentence alignment • Used as a preprocessing step by the paraphrase induction algorithm • Though most often used in machine translation • Previous explanation of MT: • Begin with sentence-aligned corpus • Then estimate alignments • Develop translation model from alignments

Review: minimum edit distance • Find minimum operations needed to transform one string into another, such as intention execution • Fill in table with total edit cost: • Cost of 1 for insertions/deletions; cost of 2 for substitutions • Follow backpointers to recover sequence of edit operations http://conspectus-timgluz1conspectus.dotcloud.com/_images/lec1_levenshteinAlgo.png

http://conspectus-timgluz1conspectus.dotcloud.com/_images/lec1_editDistance.pnghttp://conspectus-timgluz1conspectus.dotcloud.com/_images/lec1_editDistance.png

Paraphrases • Phrases that mean roughly the same thing • <PER> was killed • <PER> died • <PER> kicked the bucket • According to linguistsHalliday 1985 and de Beaugrande and Dressler, 1981: paraphrases retain “approximate conceptual equivalence”

Acquisition of paraphrases • Lexical resources • Hand-built • E.g. WordNet: limited in scope, doesn’t include phrasal or syntactically-based paraphrases • Unsupervised acquisition using parallel corpora • Barzilay & McKeown 2001: multiple English translations of foreign novels • Shinyama et al. 2002: multiple news articles about the same subject

Barzilay & McKeown 2001

Data • Based on literary texts by foreign authors • 11 English translations total, over 3 different books: • Madame Bovary (Flaubert) • Fairy Tales (Andersen) • Twenty Thousand Leagues Under the Sea (Verne)

LING / C SC 439/539 Statistical Natural Language Processing