1.12k likes | 1.15k Views
Relation Extraction. Slides from Dan Jurafsky , Rion Snow, Jim Martin, Chris Manning and William Cohen. Background: Information Extraction. Extract information from text Sometimes called text analytics commercially
E N D
Relation Extraction Slides from Dan Jurafsky, RionSnow, Jim Martin, Chris Manning and William Cohen
Background: Information Extraction • Extract information from text • Sometimes called text analyticscommercially • Extract entities (the people, organizations, locations, times, dates, genes, diseases, medicines, etc. in a text) • Extract the relations between entities • Figure out the larger events that are taking place
Information Extraction • Creating knowledge bases and ontologies • Implications for cognitive modeling • Digital Libaries • Google scholar, Citeseer need to extract the title, author and references • Bioinformatics • Patent analysis • Specific market segments for stock analysis • SEC filings • Intelligence analysis
Paradigms: Data Mining vs Text Mining • Data Mining is a set of techniques that aims to extract knowledge by big data analysis, e.g. from databases, using simple patterns and statistical methods • In contrast, the Text Mining's goal (or Machine Reading), • is to extract relationship between entities that just existing in examinatedtext, semantically deeper than data mining does • The difference is in the source information: • relationship in text just exist in our source, • we don't need to try to guess it but only discover it!
Outline • Reminder: Named Entity Tagging • Relation Extraction • Hand-built patterns • Seed (bootstrap) methods • Supervised classification • Distant supervision
Slide from William Cohen What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION
Slide from William Cohen What is “Information Extraction” As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanfounderFree Soft..
Slide from William Cohen What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation “named entity extraction”
Slide from William Cohen What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation
Slide from William Cohen What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation
Slide from William Cohen TITLE ORGANIZATION NAME Bill Gates CEO Microsoft Bill Veghte VP Microsoft Free Soft.. Stallman founder Richard What is “Information Extraction” As a familyof techniques: Information Extraction = segmentation + classification+ association+ clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft CorporationCEOBill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation * * * *
Extracting Structured Knowledge Each article can contain hundreds or thousands of items of knowledge... “The Lawrence Livermore National Laboratory (LLNL) in Livermore, California is a scientific research laboratory founded by the University of California in 1952.” LLNL EQ Lawrence Livermore National Laboratory LLNL LOC-IN California Livermore LOC-IN California LLNL IS-A scientific research laboratory LLNL FOUNDED-BY University of California LLNL FOUNDED-IN 1952
Goal: Machine-readable summaries Structured knowledge extraction: Summary for machine Textual abstract: Summary for human
From Unstructured Text to Structured Knowledge Unstructured Text News articles... slide from Rion Snow
From Unstructured Text to Structured Knowledge Unstructured Text Blog posts.... slide from Rion Snow
From Unstructured Text to Structured Knowledge Unstructured Text Scientific journal articles... slide from Rion Snow
From Unstructured Text to Structured Knowledge Unstructured Text Tweets, instant messages, chat logs... slide from Rion Snow
From Unstructured Text to Structured Knowledge Unstructured Text slide from Rion Snow
From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
From Unstructured Text to Structured Knowledge Structured Knowledge Unstructured Text slide from Rion Snow
Relation Extraction • What is relation extraction? • Founded in 1801 as South Carolina College, USC is the flagship institution of the University of South Carolina System and offers more than 350 programs of study leading to bachelor's, master's, and doctoral degrees from fourteen degree-granting colleges and schools to an enrollment of approximately 45,251 students, 30,967 on the main Columbia campus. … [wiki] • complex relation = summarization • focus on binary relation predicate(subject, object) or triples <subj predicate obj>
Wiki Info Box – structured data • template • standard things about Universities • Established • type • faculty • students • location • mascot
Focus on extracting binary relations • predicate(subject, object) from predicate logic • triples <subj relation object> • Directed graphs
Why relation extraction? • create new structured KB • Augmenting existing: words -> wordnet, facts -> FreeBase or DBPedia • Support question answering: Jeopardy • Which relations • Automated Content Extraction (ACE) http://www.itl.nist.gov/iad/mig//tests/ace/ • 17 relations • ACE examples
Unified Medical Language System (UMLS) • UMLS: Unified Medical 134 entities, 54 relations http://www.nlm.nih.gov/research/umls/
Current Relations in the UMLS Semantic Network • isaassociated_withphysically_related_topart_ofconsists_of contains connected_to interconnects branch_oftributary_ofingredient_ofspatially_related_tolocation_ofadjacent_to surrounds traverses functionally_related_to affects … • … • temporally_related_to co-occurs_with precedes • conceptually_related_to • evaluation_ofdegree_of analyzes assesses_effect_ofmeasurement_of measures diagnoses property_ofderivative_ofdevelopmental_form_ofmethod_of …
Databases of Wikipedia Relations • DBpedia is a crowd-sourced community effort • to extract structured information from Wikipedia • and to make this information readily available • DBpedia allows you to make sophisticated queries http://dbpedia.org/About
English version of the DBpedia knowledge base • 3.77 million things • 2.35 million are classified in an ontology • including: • including 764,000 persons, • 573,000 places (including 387,000 populated places), • 333,000 creative works (including 112,000 music albums, 72,000 films and 18,000 video games), • 192,000 organizations (including 45,000 companies and 42,000 educational institutions), • 202,000 species and • 5,500 diseases.
freebase • google (freebase wiki) http://wiki.freebase.com/wiki/Main_Page
Ontological relations • Ontological relations • IS-A hypernym • Instance-of • has-Part • hyponym (opposite of hypernym)
Slide from Chris Manning Reminder: Task 1: Named Entity Tagging • General NER or Biomedical NER <PER> John Hennessy</PER> is a professor at <ORG> Stanford University </ORG>, in <LOC> Palo Alto </LOC>. <RNA> TAR </RNA> independent transactivation by <PROTEIN> Tat </PROTEIN>in cells derived from the <CELL> CNS </CELL>- a novel mechanism of <DNA> HIV-1 gene </DNA>regulation.
Slide from Chris Manning Reminder: Maximum Entropy Markov Model DNA O DNA O regulation HIV−1 gene of
Relations between words • Language Understanding Applications needs word meaning! • Question answering • Conversational agents • Summarization • One key meaning component: word relations • Hierarchical (ontological) relations • “San Francisco” ISA “city” • Other relations between words • “alternator” is a part of a “car”
Relation Prediction “...works by such authors as Herrick, Goldsmith, and Shakespeare.” “If you consider authors like Shakespeare...” “Some authors (including Shakespeare)...” “Shakespeare was the author of several...” “Shakespeare, author of The Tempest...” ShakespeareIS-A author(0.87) How can we capture the variability of expression of a relation in natural text from a large, unannotated corpus?
Hyponymy • One sense is a hyponym of another if the first sense is more specific, denoting a subclass of the other • car is a hyponym of vehicle • dog is a hyponym of animal • mango is a hyponym of fruit • Conversely • vehicle is a hypernym/superordinate of car • animal is a hypernym of dog • fruit is a hypernym of mango
WordNet relations X is-a-kind-of Y (hyponym / hypernym) X is-a-part-of Y (meronym / holonym)
WordNet is incomplete • Ontological relations are missing for many words • Especially true for specific domains (restaurants, auto parts, finance)
Slide from Eugene Agichtein Other kinds of Relations: Disease Outbreaks • Extract structured information from text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Information Extraction System (e.g., NYU’s Proteus)
interact complex CBF-A CBF-C CBF-B CBF-A-CBF-C complex associates More relations: Protein Interactions „We show that CBF-A and CBF-C interact with each other to form a CBF-A-CBF-C complex and that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“ Slide from Rosario and Hearst
Slide from Jim Martin Yet More Relations CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York
Slide from Jim Martin Relation Types • For generic news texts...
More relations: UMLS • Unified Medical Language System • integrates linguistic, terminological and semantic information • Semantic Network consists of 134 semantic types and 54 relations between types Pharmacologic Substance affects Pathologic Function Pharmacologic Substance causes Pathologic Function Pharmacologic Substance complicates Pathologic Function Pharmacologic Substance diagnoses Pathologic Function Pharmacologic Substance prevents Pathologic Function Pharmacologic Substance treats Pathologic Function Slide from Paul Buitelaar
Relations in Ontologies: GO (Gene Ontology) • GO (Gene Ontology) • Aligns descriptions of gene products in different databases, including plant, animal and microbial genomes • Organizing principles are molecular function, biological process and cellular component Accession: GO:0009292 Ontology: biological process Synonyms: broad: genetic exchange Definition: In the absence of a sexual life cycle, the processes involved in the introduction of genetic information to create a genetically different individual. Term Lineage all : all (164142) GO:0008150 : biological process (115947) GO:0007275 : development (11892) GO:0009292 : genetic transfer (69) Slide from Paul Buitelaar
F-Logic Ontology similar Relations in Ontologies: geographical Geographical Entity (GE) is-a flow_through Inhabited GE Natural GE capital_of city mountain river country instance_of located_in Zugspitze Neckar Germany capital_of height (m) length (km) flow_through located_in Stuttgart Berlin 2962 367 flow_through Design: Philipp Cimiano Slide from Paul Buitelaar