880 likes | 1.18k Views
Information Extraction. Adapted from slides by Junichi Tsujii, Ronen Feldman and others. Email Insurance claims News articles Web pages Patent portfolios …. Customer complaint letters Contracts Transcripts of phone calls with customers Technical documents ….
E N D
Information Extraction Adapted from slides by Junichi Tsujii, Ronen Feldman and others
Email Insurance claims News articles Web pages Patent portfolios … Customer complaint letters Contracts Transcripts of phone calls with customers Technical documents … Most Data are Unstructured (Text) or Semi-Structured… Text data mining has become more and more important… (Adapted from J. Dorre et al. “Text Mining: Finding Nuggets in Mountains of Textual Data”)
Application Tasks of NLP (1)Information Retrieval/Detection To search and retrieve documents in response to queries for information (2)Passage Retrieval To search and retrieve part of documents in response to queries for information (3)Information Extraction To extract information that fits pre-defined database schemas or templates, specifying the output formats (4) Question/Answering Tasks To answer general questions by using texts as knowledge base: Fact retrieval, combination of IR and IE (5)Text Understanding To understand texts as people do: Artificial Intelligence
Information Extraction:A Pragmatic Approach • Let application requirements drive semantic analysis • Identify the types of entities that are relevant to a particular task • Identify the range of facts that one is interested in for those entities • Ignore everything else
IE definitions • Entity: an object of interest such as a person or organization • Attribute: A property of an entity such as name, alias, descriptor or type • Fact: A relationship held between two or more entities such as Position of Person in Company • Event: An activity involving several entities such as terrorist act, airline crash, product information
IE accuracy typical figures by information type • Entity: 90-98% • Attribute: 80% • Fact: 60-70% • Event: 50-60%
MUC conferences • MUC 1 to MUC 7 • 1987 to 1997 • Topics: • Naval operations (2) • Terrorist Activity (2) • Joint venture and microelectronics • Management changes • Space Vehicles and Missile launches
MUC and Scenario Templates • Define a set of “interesting entities” • Persons, organizations, locations… • Define a complex scenario involving interesting events and relations over entities • Example: management succession: persons, companies, positions, reasons for succession • This collection of entities and relations is called a “scenario template.”
Problems with Scenario Template • Encouraged development of highly domain specific ontologies, rule systems, heuristics, etc. • Most of the effort expended on building a scenario template system was not directly applicable to a different scenario template.
Addressing the Problem • Address a large number of smaller, more focused scenario templates (Event-99) • Develop a more systematic ground-up approach to semantics by focusing on elementary entities, relations, and events (ACE)
The ACE Evaluation • The ACE program – challenge of extracting content from human language. Research effort directed to master • first the extraction of “entities” • Then the extraction of “relations” among these entities • Finally the extraction of “events” that are causally related sets of relations • After two years, top systems successfully capture well over 50 % of the value at the entity level
The ACE Program • “Automated Content Extraction” • Develop core information extraction technology by focusing on extracting specific semantic entities and relations over a very wide range of texts. • Corpora: Newswire and broadcast transcripts, but broad range of topics and genres. • Third person reports • Interviews • Editorials • Topics: foreign relations, significant events, human interest, sports, weather • Discourage highly domain- and genre-dependent solutions
Applications of IE • Routing of information • Infrastructure for IR and categorization (higher level features) • Event based summarization • Automatic creation of databases and knowledge bases
Where would IE be useful? • Semi-structured text • Generic documents like news articles • Most of the information in the doc is centered around a set of easily identifiable entities
The Problem Date Time: Start - End Location Speaker Person
What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Courtesy of William W. Cohen
What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanfounderFree Soft.. Courtesy of William W. Cohen
What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation aka “named entity extraction” Courtesy of William W. Cohen
What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Courtesy of William W. Cohen
What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Courtesy of William W. Cohen
NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Stallman founder Free Soft.. Richard What is “Information Extraction” Information Extraction = segmentation + classification + association+ clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * * * * Courtesy of William W. Cohen
Landscape of IE Tasks:Single Field/Record Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Single entity Binary relationship N-ary record Person: Jack Welch Relation: Person-Title Person: Jack Welch Title: CEO Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Person: Jeffrey Immelt Relation: Company-Location Company: General Electric Location: Connecticut Location: Connecticut “Named entity” extraction
Classify Pre-segmentedCandidates Sliding Window Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Classifier Classifier which class? which class? Try alternatewindow sizes: Context Free Grammars Boundary Models Finite State Machines Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. BEGIN Most likely state sequence? NNP NNP V V P NP Most likely parse? Classifier PP which class? VP NP VP BEGIN END BEGIN END S Landscape of IE Techniques Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming Courtesy of William W. Cohen
IE with Hidden Markov Models Given a sequence of observations: Yesterday Pedro Domingos spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) YesterdayPedro Domingosspoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Pedro Domingos
HMM for Segmentation • Simplest Model: One state per entity type
Discriminative Approaches Yesterday Pedro Domingos spoke this example sentence. Is this phrase (X) a name? Y=1 (yes); Y=0 (no) Learn from many examples to predict Y from X parameters Maximum Entropy, Logistic Regression: Features (e.g., is the phrase capitalized?) More sophisticated: Consider dependency between different labels (e.g. Conditional Random Fields)
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example of IE: FASTUS(1993)
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example of IE: FASTUS(1993)
production of 20, 000 iron and metal wood clubs [company] [set up] [Joint-Venture] with [company] FASTUS Based on finite states automata (FSA) 1.Complex Words: Recognition of multi-words and proper names set up new Twaiwan dallors 2.Basic Phrases: Simple noun groups, verb groups and particles a Japanese trading house had set up 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event.
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month. ACTIVITY-1 Activity: PRODUCTION Company: “Bridgestone Sports Taiwan Co.” Product: “iron and ‘metal wood’ clubs” Start Date: DURING: January 1990 TIE-UP-1 Relationship: TIE-UP Entities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Activity: ACTIVITY-1 Amount: NT$200000000 Example of IE: FASTUS(1993)
Information Extraction ………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the second floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou, was hacked to death with 45 cm watermelon knives. ………. Name of the Venture: Yaxing Benz Products: buses and bus chassis Location: Yangzhou,China Companies involved: (1)Name: X? Country: German (2)Name: Y? Country: China
Information Extraction A German vehicle-firm executive was stabbed to death …. ………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on the second floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou, was hacked to death with 45 cm watermelon knives. ………. Crime-Type: Murder Type: Stabbing The killed: Name: Jurgen Pfrang Age: 51 Profession: Deputy general manager Location: Nanjing, China Different template for crimes
User User System System System Interpretation of Texts (1)Information Retrieval/Detection (2)Passage Retrieval (3)Information Extraction (4) Question/Answering Tasks (5)Text Understanding
Characterization of Texts Queries IR System Collection of Texts
Knowledge Interpretation Characterization of Texts Queries IR System Collection of Texts
Knowledge Interpretation Characterization of Texts Queries Passage IR System Collection of Texts
Knowledge Interpretation Characterization of Texts Queries Structures of Sentences NLP Templates Passage IR System IE System Collection of Texts Texts
Knowledge Interpretation IE System Templates Texts
Knowledge General Framework of NLP/NLU IE as compromise NLP Interpretation IE System Templates Texts Predefined
Rather clear A bit vague Rather clear A bit vague Very vague Performance Evaluation (1)Information Retrieval/Detection (2)Passage Retrieval (3)Information Extraction (4) Question/Answering Tasks (5)Text Understanding
Query N: Correct Documents M:Retrieved Documents C: Correct Documents that are actually retrieved N M C C Precision: Recall: P M N C R 2P・R F-Value: P+R Collection of Documents
Query N: Correct Templates M:Retrieved Templates C: Correct Templates that are actually retrieved N M C C Precision: Recall: P M N C R 2P・R F-Value: P+R Collection of Documents More complicated due to partially filled templates
Framework of IE IE as compromise NLP
Predefined Aspects of Information Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Incomplete Domain Knowledge Interpretation Rules Context processing Interpretation
Predefined Aspects of Information Difficulties of NLP General Framework of NLP (1) Robustness: Incomplete Knowledge Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Incomplete Domain Knowledge Interpretation Rules Context processing Interpretation
Approaches for building IE systems • Knowledge Engineering Approach • Rules crafted by linguists in cooperation with domain experts • Most of the work done by insoecting a set of relevant documents
Approaches for building IE systems • Automatically trainable systems • Techniques based on statistics and almost no linguistic knowledge • Language independent • Main input – annotated corpus • Small effort for creating rules, but crating annotated corpus laborious