1 / 107

Relation Extraction

Relation Extraction. Slides from Dan Jurafsky , Rion Snow, Jim Martin, Chris Manning and William Cohen. Background: Information Extraction. Extract information from text Sometimes called text analytics commercially

simpkinsa
Download Presentation

Relation Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Relation Extraction Slides from Dan Jurafsky, RionSnow, Jim Martin, Chris Manning and William Cohen

  2. Background: Information Extraction • Extract information from text • Sometimes called text analyticscommercially • Extract entities (the people, organizations, locations, times, dates, genes, diseases, medicines, etc. in a text) • Extract the relations between entities • Figure out the larger events that are taking place

  3. Information Extraction • Creating knowledge bases and ontologies • Implications for cognitive modeling • Digital Libaries • Google scholar, Citeseer need to extract the title, author and references • Bioinformatics • Patent analysis • Specific market segments for stock analysis • SEC filings • Intelligence analysis

  4. Paradigms: Data Mining vs Text Mining • Data Mining is a set of techniques that aims to extract knowledge by big data analysis, e.g. from databases, using simple patterns and statistical methods • In contrast, the Text Mining's goal (or Machine Reading), • is to extract relationship between entities that just existing in examinatedtext, semantically deeper than data mining does • The difference is in the source information: • relationship in text just exist in our source, • we don't need to try to guess it but only discover it!

  5. Outline • Reminder: Named Entity Tagging • Relation Extraction • Hand-built patterns • Seed (bootstrap) methods • Supervised classification • Distant supervision

  6. Slide from William Cohen What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION

  7. Slide from William Cohen What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanfounderFree Soft..

  8. Slide from William Cohen What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation “named entity extraction”

  9. Slide from William Cohen What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

  10. Slide from William Cohen What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

  11. Slide from William Cohen TITLE ORGANIZATION NAME Bill Gates CEO Microsoft Bill Veghte VP Microsoft Free Soft.. Stallman founder Richard What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association+ clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * * * *

  12. Extracting Structured Knowledge Each article can contain hundreds or thousands of items of knowledge... “The Lawrence Livermore National Laboratory (LLNL) in Livermore, California is a scientific research laboratory founded by the University of California in 1952.” LLNL EQ Lawrence Livermore National Laboratory LLNL LOC-IN California Livermore LOC-IN California LLNL IS-A scientific research laboratory LLNL FOUNDED-BY University of California LLNL FOUNDED-IN 1952

  13. Goal: Machine-readable summaries Structured knowledge extraction: Summary for machine Textual abstract: Summary for human

  14. From Unstructured Text to Structured Knowledge Unstructured Text News articles... slide from Rion Snow

  15. From Unstructured Text to Structured Knowledge Unstructured Text Blog posts.... slide from Rion Snow

  16. From Unstructured Text to Structured Knowledge Unstructured Text Scientific journal articles... slide from Rion Snow

  17. From Unstructured Text to Structured Knowledge Unstructured Text Tweets, instant messages, chat logs... slide from Rion Snow

  18. From Unstructured Text to Structured Knowledge Unstructured Text slide from Rion Snow

  19. From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow

  20. From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow

  21. From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow

  22. From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow

  23. From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow

  24. From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow

  25. Relation Extraction • What is relation extraction? • Founded in 1801 as South Carolina College, USC is the flagship institution of the University of South Carolina System and offers more than 350 programs of study leading to bachelor's, master's, and doctoral degrees from fourteen degree-granting colleges and schools to an enrollment of approximately 45,251 students, 30,967 on the main Columbia campus. … [wiki] • complex relation = summarization • focus on binary relation predicate(subject, object) or triples <subj predicate obj>

  26. Wiki Info Box – structured data • template • standard things about Universities • Established • type • faculty • students • location • mascot

  27. Focus on extracting binary relations • predicate(subject, object) from predicate logic • triples <subj relation object> • Directed graphs

  28. Why relation extraction? • create new structured KB • Augmenting existing: words -> wordnet, facts -> FreeBase or DBPedia • Support question answering: Jeopardy • Which relations • Automated Content Extraction (ACE) http://www.itl.nist.gov/iad/mig//tests/ace/ • 17 relations • ACE examples

  29. Unified Medical Language System (UMLS) • UMLS: Unified Medical 134 entities, 54 relations http://www.nlm.nih.gov/research/umls/

  30. UMLS semantic network

  31. Current Relations in the UMLS Semantic Network • isaassociated_withphysically_related_topart_ofconsists_of            contains connected_to            interconnects branch_oftributary_ofingredient_ofspatially_related_tolocation_ofadjacent_to            surrounds             traverses functionally_related_to            affects                  … • … • temporally_related_to             co-occurs_with             precedes • conceptually_related_to • evaluation_ofdegree_of             analyzes assesses_effect_ofmeasurement_of             measures              diagnoses property_ofderivative_ofdevelopmental_form_ofmethod_of             …

  32. Databases of Wikipedia Relations • DBpedia is a crowd-sourced community effort • to extract structured information from Wikipedia • and to make this information readily available • DBpedia allows you to make sophisticated queries http://dbpedia.org/About

  33. English version of the DBpedia knowledge base • 3.77 million things • 2.35 million are classified in an ontology • including: • including 764,000 persons, • 573,000 places (including 387,000 populated places), • 333,000 creative works (including 112,000 music albums, 72,000 films and 18,000 video games), • 192,000 organizations (including 45,000 companies and 42,000 educational institutions), • 202,000 species and  • 5,500 diseases.

  34. freebase • google (freebase wiki) http://wiki.freebase.com/wiki/Main_Page

  35. Ontological relations • Ontological relations • IS-A hypernym • Instance-of • has-Part • hyponym (opposite of hypernym)

  36. Slide from Chris Manning Reminder: Task 1: Named Entity Tagging • General NER or Biomedical NER <PER> John Hennessy</PER> is a professor at <ORG> Stanford University </ORG>, in <LOC> Palo Alto </LOC>. <RNA> TAR </RNA> independent transactivation by <PROTEIN> Tat </PROTEIN>in cells derived from the <CELL> CNS </CELL>- a novel mechanism of <DNA> HIV-1 gene </DNA>regulation.

  37. Slide from Chris Manning Reminder: Maximum Entropy Markov Model DNA O DNA O regulation HIV−1 gene of

  38. Task II: Relation Extraction

  39. Relations between words • Language Understanding Applications needs word meaning! • Question answering • Conversational agents • Summarization • One key meaning component: word relations • Hierarchical (ontological) relations • “San Francisco” ISA “city” • Other relations between words • “alternator” is a part of a “car”

  40. Relation Prediction “...works by such authors as Herrick, Goldsmith, and Shakespeare.” “If you consider authors like Shakespeare...” “Some authors (including Shakespeare)...” “Shakespeare was the author of several...” “Shakespeare, author of The Tempest...” ShakespeareIS-A author(0.87) How can we capture the variability of expression of a relation in natural text from a large, unannotated corpus?

  41. Hyponymy • One sense is a hyponym of another if the first sense is more specific, denoting a subclass of the other • car is a hyponym of vehicle • dog is a hyponym of animal • mango is a hyponym of fruit • Conversely • vehicle is a hypernym/superordinate of car • animal is a hypernym of dog • fruit is a hypernym of mango

  42. WordNet relations X is-a-kind-of Y (hyponym / hypernym) X is-a-part-of Y (meronym / holonym)

  43. WordNet is incomplete • Ontological relations are missing for many words • Especially true for specific domains (restaurants, auto parts, finance)

  44. Slide from Eugene Agichtein Other kinds of Relations: Disease Outbreaks • Extract structured information from text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Information Extraction System (e.g., NYU’s Proteus)

  45. interact complex CBF-A CBF-C CBF-B CBF-A-CBF-C complex associates More relations: Protein Interactions „We show that CBF-A and CBF-C interact with each other to form a CBF-A-CBF-C complex and that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“ Slide from Rosario and Hearst

  46. Slide from Jim Martin Yet More Relations CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York

  47. Slide from Jim Martin Relation Types • For generic news texts...

  48. More relations: UMLS • Unified Medical Language System • integrates linguistic, terminological and semantic information • Semantic Network consists of 134 semantic types and 54 relations between types Pharmacologic Substance affects Pathologic Function Pharmacologic Substance causes Pathologic Function Pharmacologic Substance complicates Pathologic Function Pharmacologic Substance diagnoses Pathologic Function Pharmacologic Substance prevents Pathologic Function Pharmacologic Substance treats Pathologic Function Slide from Paul Buitelaar

  49. Relations in Ontologies: GO (Gene Ontology) • GO (Gene Ontology) • Aligns descriptions of gene products in different databases, including plant, animal and microbial genomes • Organizing principles are molecular function, biological process and cellular component Accession: GO:0009292 Ontology: biological process Synonyms: broad: genetic exchange Definition: In the absence of a sexual life cycle, the processes involved in the introduction of genetic information to create a genetically different individual. Term Lineage all : all (164142) GO:0008150 : biological process (115947) GO:0007275 : development (11892) GO:0009292 : genetic transfer (69) Slide from Paul Buitelaar

  50. F-Logic Ontology similar Relations in Ontologies: geographical Geographical Entity (GE) is-a flow_through Inhabited GE Natural GE capital_of city mountain river country instance_of located_in Zugspitze Neckar Germany capital_of height (m) length (km) flow_through located_in Stuttgart Berlin 2962 367 flow_through Design: Philipp Cimiano Slide from Paul Buitelaar

More Related