790 likes | 928 Views
Information Extraction. Day 18. HW#5. The mystery languages: 1 2 3 4 5 6 7 8 9 10. HW#5. The mystery languages: 1 - Afrikaans 2 - Breton 3 - Basque 4 - Catalan 5 - Croatian 6 - Icelandic 7 - Indonesian 8 - Manx 9 - Frisian 10 - Estonian. Roadmap. Information extraction:
E N D
Information Extraction Day 18
HW#5 • The mystery languages: 1 2 3 4 5 6 7 8 9 10
HW#5 • The mystery languages: 1 - Afrikaans 2 - Breton 3 - Basque 4 - Catalan 5 - Croatian 6 - Icelandic 7 - Indonesian 8 - Manx 9 - Frisian 10 - Estonian
Roadmap • Information extraction: • Definition & Motivation • General approach • Case Study of Extracting Structured Language Data
Information Extraction • Motivation: • Unstructured data hard to manage, assess, analyze
Information Extraction • Motivation: • Unstructured data hard to manage, assess, analyze • Many tools for manipulating structured data: • Databases: efficient search, agglomeration • Statistical analysis packages • Decision support systems
Information Extraction • Motivation: • Unstructured data hard to manage, assess, analyze • Many tools for manipulating structured data: • Databases: efficient search, agglomeration • Statistical analysis packages • Decision support systems • Goal: • Given unstructured information in texts • Convert to structured data • E.g., populate DBs
IE Example Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. • Lead airline: • Amount: • Effective Date: • Follower:
IE Example Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. • Lead airline: United Airlines • Amount: $6 • Effective Date: 2006-10-26 • Follower: American Airlines
Other Applications • Shared tasks: • Message Understanding Conferences (MUC)
Other Applications • Shared tasks: • Message Understanding Conferences (MUC) • Joint ventures/mergers:
Other Applications • Shared tasks: • Message Understanding Conferences (MUC) • Joint ventures/mergers: • CompanyA, CompanyB, CompanyC, amount, type • Terrorist incidents
Other Applications • Shared tasks: • Message Understanding Conferences (MUC) • Joint ventures/mergers: • CompanyA, CompanyB, CompanyC, amount, type • Terrorist incidents • Incident type, Actor, Victim, Date, Location • General:
Other Applications • Shared tasks: • Message Understanding Conferences (MUC) • Joint ventures/mergers: • CompanyA, CompanyB, CompanyC, amount, type • Terrorist incidents • Incident type, Actor, Victim, Date, Location • General: • Pricegrabber • Stock market analysis • Apartment finder
Information ExtractionComponents • Incorporates range of techniques, tools
Information ExtractionComponents • Incorporates range of techniques, tools • FSAs (pattern extraction) • Supervised learning: • Classification • Sequence labeling
Information ExtractionComponents • Incorporates range of techniques, tools • FSAs (pattern extraction) • Supervised learning: • Classification • Sequence labeling • Semi-, un-supervised learning: clustering, bootstrapping
Information ExtractionComponents • Incorporates range of techniques, tools • FSAs (pattern extraction) • Supervised learning: • Classification • Sequence labeling • Semi-, un-supervised learning: clustering, bootstrapping • Part-of-Speech tagging • Named Entity Recognition • Chunking • Parsing
Information Extraction Process • Typically pipeline of major components
Information Extraction Process • Typically pipeline of major components • First step: Named Entity Recognition (NER) • Often application-specific • Generic type: Person, Organization, Location • Specific: genes to college courses
Information Extraction Process • Typically pipeline of major components • First step: Named Entity Recognition (NER) • Often application-specific • Generic type: Person, Organization, Location • Specific: genes to college courses • Identify NE mentions: • specific instances
Example: Pre-NER • Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco.
Example: post-NER • Citing high fuel prices, [ORGUnited Airlines]said [TIMEFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [ORGUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday]an applies to most routes where it competes against discount carriers, such as [LOCChicago]to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].
Information Extraction:Reference Resolution • Builds on NER • Links entity mentions referring to same things • Creates entity equivalence classes
Information Extraction:Reference Resolution • Builds on NER • Links entity mentions referring to same things • Creates entity equivalence classes • Includes anaphora resolution • E.g. pronoun resolution (it, he, she…; one; this, that)
Information Extraction:Reference Resolution • Builds on NER • Links entity mentions referring to same things • Creates entity equivalence classes • Includes anaphora resolution • E.g. pronoun resolution (it, he, she…; one; this, that) • Also non-anaphoric, general mentions
Example: post-NER • Citing high fuel prices, [ORGUnited Airlines] said [TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].
Example: post-NER • Citing high fuel prices, [ORGUnited Airlines] said [TimeFriday]it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].
Relation Detection & Classification • Relation detection & classification: • Find and classify relations among entities in text
Relation Detection & Classification • Relation detection & classification: • Find and classify relations among entities in text • Mix of application-specific and generic relations • Generic relations:
Relation Detection & Classification • Relation detection & classification: • Find and classify relations among entities in text • Mix of application-specific and generic relations • Generic relations: • Family, employment, part-whole, membership, geospatial
Example: post-NER • Citing high fuel prices, [ORGUnited Airlines] said [TimeFriday]it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].
Relation Detection & Classification • Relation detection & classification: • Find and classify relations among entities in text • Mix of application-specific and generic relations • Generic relations: • Family, employment, part-whole, membership, geospatial • Generic relations: • United is part-of UAL Corp. • American Airlines is part-of AMR • Tim Wagner is an employee of American Airlines
Relation Detection & Classification • Relation detection & classification: • Find and classify relations among entities in text • Mix of application-specific and generic relations • Generic relations: • Family, employment, part-whole, membership, geospatial • Generic relations: • United is part-of UAL Corp. • American Airlines is part-of AMR • Tim Wagner is an employee of American Airlines • Domain-specific relations: • United serves Chicago; United serves Denver, etc
Example: post-NER • Citing high fuel prices, [ORGUnited Airlines] said [TimeFriday]it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].
Event Detection &Classification • Event detection and classification: • Find key events
Event Detection &Classification • Event detection and classification: • Find key events • Fare increase by United • Matching fare increase by American • Reporting of fare increases • How cued?
Event Detection &Classification • Event detection and classification: • Find key events • Fare increase by United • Matching fare increase by American • Reporting of fare increases • How cued? • Instances of “said” and ”cite”
Temporal Expressions:Recognition and Analysis • Temporal expression recognition • Finds temporal events in text
Example: post-NER • Citing high fuel prices, [ORGUnited Airlines] said [TimeFriday]it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].
Temporal Expressions:Recognition and Analysis • Temporal expression recognition • Finds temporal events in text • E.g. Friday, Thursday in example text
Temporal Expressions:Recognition and Analysis • Temporal expression recognition • Finds temporal events in text • E.g. Friday, Thursday in example text • Others: • Date expressions: • Absolute: days of week, holidays • Relative: two days from now, next year • Clock times: • noon, 3:30pm
Temporal Expressions:Recognition and Analysis • Temporal expression recognition • Finds temporal events in text • E.g. Friday, Thursday in example text • Others: • Date expressions: • Absolute: days of week, holidays • Relative: two days from now, next year • Clock times: • noon, 3:30pm • Temporal analysis: • Map expressions to points/spans on timeline • Anchor: e.g. Friday, Thursday: tied to example dateline
Template Filling • Texts describe stereotypical situations in domain • Identify texts evoking template situations • Fill slots in templates based on text
Template Filling • Texts describe stereotypical situations in domain • Identify texts evoking template situations • Fill slots in templates based on text • In example: fare-raise is stereo-typical event
Information Extraction Steps • Named Entity Recognition • Reference Resolution • Relation Detection & Classification • Event Detection & Classification • Temporal Expression Recognition & Analysis • Template-filling
Case Study: Extracting Semi-Structured Language Data (“Template Filling”) IE and language data
Linguistic Document • Page from scholarly linguistic paper • Note the embedded annotated language data • Identify characteristics • Of the data wrt the rest of the document • Of the data itself
Linguistic Document • Plan: Extract data of this type form Linguistic Docs • Find linguistic docs (crawl) • Identify docs that contain this kind of data • Identify the language for each instance • Align instances with • Papers/Titles • Authors • Languages • Extract the data • Preserve and align annotations • Enrich the data with further annotations
Interlinear Glossed Text • Interlinear Glossed Text (IGT) - enriched language data used for illustrative purposes as part of a larger analysis Transcription Line Gloss Line Translation Line