170 likes | 278 Views
Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE. Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University of Pennsylvania, Philadelphia, PA Scott Winters Yang Jin Pete White
E N D
Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University of Pennsylvania, Philadelphia, PA Scott Winters Yang Jin Pete White Division of Oncology, Children’s Hospital of Pennsylvania, Philadelphia,PA ACL 2005
Abstract • Simple two-stage method for extracting complex relations between named entities in text. • n-ary relation • first stage: create a graph from pairs of entities • two stage: maximal cliques in the graph • Experiment on biomedical text
Introduction - 1/2 • n-ary relation • The relation is definded by the schema (t1,…, tn) • ti is entity types • The tuple in the relations is a list of entities (e1,...,en) • Type(e1)=t1 or ei= • Example : • Type : {person, job, company} • “John Smith is the CEO at Inc. Corp. “ • (John Smith, CEO, Inc. Corp.) • “Everyday John Smith goes to his office at Inc. Corp.” • (John Smith, , Inc. Corp.)
Introduction - 2/2 • Application : • Question answer • Automatic database generation • Intelligent document searching and indexing • Most relation extraction systems focus on: • Binary relation : Such as • employee of relation • protein-protein interaction relation • Extracting keyphrases to represent relation in social networks from Web. (Matsuo et al., IJCAI-07)
Previous Work • Zelenko et al., 2003 • Binary relation in news text • “John Smith, not Jane Smith, works at IBM.” • (John Smith, IBM) : positive • (Jane Smith, IBM) : negative • Miller et al., 2000 • Identify all relations • Relation extraction from probabilistic parsing tree • Rosario and Hearst, 2004 • Extracting seven relationships between treatments and diseases
Definitions • n-ary relation • The relation is definded by the schema (t1,…, tn) • ti is entity types • The tuple in the relations is a list of entities (e1,...,en) • Type(e1)=t1 or ei= • A maximal clique • An undirected graph G=(V,E) • V: vertices , E: a set of edges • A clique C of G is a subgraph of G in which there is an edge between every pair of vertices. • A maximal clique of G is a clique C=(Vc, Ec) such that there is no other clique C’=(Vc’, Ec’) such that VcVc’.
Example : {person, job, company} John and Jane are CEOs at Inc. Corp. and Biz. Corp. respectively. 12 possible tuples Problems with building a classifier Exponential run time How to manage incomplete but correct instances (John, ,Inc. Corp.) If it is marked as negative, the model might incorrectly disfavor features that correlate John to Inc.Corp.. If it is labeled as positive , the model may tend to prefer the shorter and more compact incomplete relations. If we ignore instances of this form, the data would be heavily skewed towards negative instances. Methods : Classifying Binary Relations-1/3
Solution : The set of all possible pairs is much smaller then the set of all possible complex relation instances. To train a classifier to identify pairs of related entities. Positive : (John,CEO), (John, Inc. Corp.), (CEO, Inc. Corp.), (CEO, Biz. Corp.), (Jane,CEO) and (Jane, Biz. Corp.). Negative : (John, Biz. Corp.) and (Jane, Inc. Corp.) Methods : Classifying Binary Relations-2/3
Methods : Classifying Binary Relations-3/3 • Learning a binary relation classifier : • A standard maximum entropy classifier (Berger et al., 1996) implemented as part of MALLET (McCallum, 2002)
Methods : Reconstructing Complex Relations • Example : According to binary classifier • (John,CEO), (John, Inc. Corp.), (John, Biz. Corp.), (CEO, Inc. Corp.), (CEO, Biz. Corp.) and (Jane,CEO). • Relation Graph : Figure 2a • Cliques : Figure 2b • Algorithm for finding all maximal cliques : • Born and Kerbosch, 1973
Methods : Probabilistic Cliques • The above approach has a major shortcoming in that it assumes the output of the binary classifier to be absolutely correct. • Weight of a clique (C) • w(e) : weight (probabilistic) of edge e • A vaild tuple : (C) 0.5
Experiments-1/2 • Extracting genomic variation events from biomedical text (Mcdonal et al., 2004) • (var-type, location, initial-state, altered-state) • “At codons 12 and 61, the occurrence of point mutations from G/A to T/G were observed” • (point mutation, codon 12, G, T) • (point mutation, codon 61, A, G) • 447 abstracts selected from MEDLINE • 4691 sentences • 4773 entities and 1218 relations • Of the 1218 relations : • 760 have two , 283 have one , 175 have no arguments • 38% cannot be handled using binary relations • 4% of the relations annotated are non-sentential • Maximum recall : 96%
Experiments-2/2 • MC: • Uses the maximum entropy binary classifier coupled with the maximal clique complex relation reconstructor. • PC: • Same as above, except it uses the probabilistic clique complex relation reconstructor. • NE: • A maximum entropy classifier that naively enumerates all possible relation instances as described in Page 7.
Conclusions and Future Work • Complex relation extraction: • Binary relation learning: Maximum Entropy Classifier • Finding maximal cliques in graph • Genomic variation relations • Future work • Parse trees • Learn how to cluster vertices into relational groups • A vertex/entity can participate in one or more relation
Learning Field Compatibilities to Extract Database Records from Unstructured Text • M Wick, A Culotta, A McCallum - (EMNLP 2006) • Using Dependency Parsing and Probabilistic Inference to Extract Rela-tionships between Genes • B Goertzel, H Pinto, A Heljakka, IF Goertzel, M –(BioNLP 2006) • Relation Extraction for Semantic Intranet Annotations • L Specia, C Baldassarre, E Motta - kmi.open.ac.uk • Relation Extraction for Semantic Intranet Annotations Technical Report