Information Extraction

Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff, E. (2010). Information extraction. Handbook of Natural LanguageProcessing, 2.

Context

History • Genesis = recognition of namedentities (organization & people names) • Online access = pushestowards • personal desktops -> structureddatabases, • scientificpublications -> structuredrecords, • Internet -> structuredfactfindingqueries.

Driving workshops / conferences • 1987-97: MUC (Message UnderstandingConference)Filling slots, namedentities & coreference (95-) • 1999-08: ACE (Automatic Content Extraction)« supportingvarious classification, filtering, and selection applications by extracting and representinglanguagecontent » • 2008-now: TAC (TextAutomatedComprehension) • Knowledge Base Population (09-11) • Others: Textualentailment, Summarization, QA (until 2009)

Example: MUC 0. MESSAGE: ID TST1-MUC3-0001 1. MESSAGE: TEMPLATE 1 2. INCIDENT: DATE 02 FEB 90 3. INCIDENT: LOCATION GUATEMALA: SANTO TOMAS (FARM) 4. INCIDENT: TYPE ATTACK 5. INCIDENT: STAGE OF EXECUTION ACCOMPLISHED 6. INCIDENT: INSTRUMENT ID - 7. INCIDENT: INSTRUMENT TYPE - 8. PERP: INCIDENT CATEGORY TERRORIST ACT 9. PERP: INDIVIDUAL ID "GUERRILLA COLUMN" / "GUERRILLAS" 10. PERP: ORGANIZATION ID "GUATEMALAN NATIONAL REVOLUTIONARY UNITY" / "URNG" 11. PERP: ORGANIZATION CONFIDENCE REPORTED AS FACT / CLAIMED OR ADMITTED: "GUATEMALAN NATIONAL REVOLUTIONARY UNITY" / "URNG" 12. PHYS TGT: ID "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM" 13. PHYS TGT: TYPE GOVERNMENT OFFICE OR RESIDENCE: "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM" 14. PHYS TGT: NUMBER 1: "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM" 15. PHYS TGT: FOREIGN NATION - 16. PHYS TGT: EFFECT OF INCIDENT - 17. PHYS TGT: TOTAL NUMBER - 18. HUM TGT: NAME "CEREZO" 19. HUM TGT: DESCRIPTION "PRESIDENT": "CEREZO" "CIVILIAN" 20. HUM TGT: TYPE GOVERNMENT OFFICIAL: "CEREZO" CIVILIAN: "CIVILIAN" 21. HUM TGT: NUMBER 1: "CEREZO" 1: "CIVILIAN" 22. HUM TGT: FOREIGN NATION - 23. HUM TGT: EFFECT OF INCIDENT NO INJURY: "CEREZO" DEATH: "CIVILIAN" 24. HUM TGT: TOTAL NUMBER -

Application • Enterprise Applications • News Tracking (terrorists, disease) • Customer care (linking mails to products, etc.) • Data Cleaning • ClassifiedAds • Personal Information Management (PIM) • Scientific Applications (e.g. bio-informatics) • Web Oriented • Citation databases • Opinion databases • Communitywebsites (DBLife, Rexa - UMASS) • Comparison Shopping • Ad Placement on Webpages • Structured Web Searches

IE - Taxonomy • Types of structures extracted • Entities, Records, Relationships • Open/Closed IE • Sources • Granularity of extraction • Heterogenity: machine generated, (semi)structured, open • Input resources • Structured DB • LabelledUnstructuredText • Preprocessing (tokenizer, chunker, parser<)

Process (I) • Annotated documents • Rules hand-crafted by humans (1500 hours!)

Process (I) • Annotated documents • Rules hand-crafted by humans (1500 hours!) • Rulesgenerated by a system • Rulesevaluated by humans

Process (II) • Annotated documents • Rules hand-crafted by humans (1500 hours!) • Rulesgenerated by a system • Ruleslearnt

Process (III) • Annotated documents • Rules hand-crafted by humans (1500 hours!) • Rulesgenerated by a system • Ruleslearnt • Models • Logic: First OrderLogic • Sequence: e.g. HMM • Classifiers: e.g. MEM, CRF • Decompositioninto a series of subproblems • Complexwords, basic phrases, complex phrases, events and merging

Process (IV) • Annotated documents • Relevant & non relevant documents • Rules hand-crafted by humans (1500 hours!) • Rulesgenerated by a system • Ruleslearnt • Models • Logic: First OrderLogic • Sequence: e.g. HMM • Classifiers: e.g. MEM, CRF

Process (V) • Annotated documents • Relevant & non relevant documents • Seeds -> boostrapping • Rules hand-crafted by humans (1500 hours!) • Rulesgenerated by a system • Ruleslearnt • Models • Logic: First OrderLogic • Sequence: e.g. HMM • Classifiers: e.g. MEM, CRF

Recognizingentities / FILLING SLOTS

Rulebasedsystems • Rules to mark an entity (or more) • Before the start of the entity • Tokens of the entity • After the end of the entity • Rules to mark the boundaries • Conflictsbetweenrules • Largerspan • Merge (if same action) • Order the rules

Entity Extraction – rulebased

Learning rules • Algorithms are based on • Coverage [how many cases are covered by the rule] • Precision • Twoapproaches • Top-down (e.g. FOIL): startwithcoverage = 100% • Bottom-up: startwithprecision = 100%

Rules – Autoslog Riloff, E. (1993). Automaticallyconstructing a dictionary for information extraction tasks, 811–811. • Rule Learning • Look at sentences containingtargets • Heuristic: looking for a linguistic pattern

Rules– LIEP Huffman, S. B. (2005). Learning information extraction patterns fromexamples. Learn (sets of meta-heuristics) by usingsyntacticpathsthat relate tworole-fillingconstituents,e.g. [subject(Bob,named),object(named,CE0)]. Followed by generalization (matching + disjonction)

Statisticalmodels • How to label • IOB sequences (Inside, Outside, Beginning) • Sequences • SegmentationAlleged/B guerrilla/I urban/I commandos/I launched/O  two/B highpower/I bombs/I against/O a/B car/I dealership/I in/O down- town/OSan/B Salvador/I this/B morning/I. • Grammarbased (longer dependencies) • Many ML models: • HMM • ME, CRF • SVM

Statisticalmodels (cont’d) • Features • Word • Orthographic • Dictionary • … • First order • Position: • Segment:

Examples of features

Statisticalmodels (cont’d) • Learning: • Likelihood • Max-Margin

Predictingrelationships

Overall • Goal: classify (E1,E2,x) • Features • Surface tokens (words, entities)[Entity label of E1 = Person, Entity label of E2 = Location] • Parsetree (syntaxic, dependency graph)[(POS = (noun,verb,noun), flag = “(1,none,2)”, type = “dependency”]

Models • Standard classifier (e.g. SVM) • Kernel-basedmethods • e.g. measure of commonpropertiesbetweentwopaths in the dependencytree • Convolution basedkernels • Rule-basedmethods

Extractingentities for a set of relationships • Threesteps • Learn extraction patterns for the seeds • Find documents whereentitiesappear close to eachother • Filtering • Generate candidate triplets • Pattern or keyword-based • Validation • # of occurrences

MANAGEMENT

Summary • Performance • Document selection: subset, crawling • Queries to DB: relatedentities (top-k retrieval) • Handling changes • Detectingwhen a page has changed • Integration • Detecting duplicates entities • Redundant extractions (open IE)

Evaluation

Metrics • Metrics • Precision-Recall • F-measure (-> harmonicmean)

The 60% barrier

Information Extraction