190 likes | 438 Views
4. Relationship Extraction. Part 4 of Information Extraction Sunita Sarawagi. The Problem. Relate extracted entities – unstructured text not partitioned into records Various competitions MUC ACE BioCreAtIvE II Protein-Protein Interaction. Groups of Relationships. ACE:
E N D
4. Relationship Extraction Part 4 of Information Extraction SunitaSarawagi CS 652, Peter Lindes
The Problem • Relate extracted entities – unstructured text not partitioned into records • Various competitions • MUC • ACE • BioCreAtIvE II Protein-Protein Interaction CS 652, Peter Lindes
Groups of Relationships • ACE: • located at, near, part, role, social for entities: • person, organization, facility, location, and geo-political entity • Biomedical: gene-disease, protein-protein, subcellular regularizations • NAGA knowledge base: 26 relationships such as: isA, bornInYear, establishedInYear, hasWonPrize, locatedIn, politicianOf, … CS 652, Peter Lindes
Three Problem Levels • First case: • Entities preidentified in unstructured text • Given a pair of entities, find type of relationship • Second case: • Given relationship type r, entity name e • Extract entities with which e has relationship r • Third case: • Open-ended corpus – the web • Given relationship type r, find entity pairs CS 652, Peter Lindes
Given Entity Pair, Find Relationship • R: set of relationship types • : R plus a special member for “other” • x: a “snippet” of text (might be a sentence) • E1 and E2 in x • Identify relationships in between E1 and E2 • Resources available: • Surface Tokens • Part of Speech tags • Syntactic Parse Tree Structure • Dependency Graph • Use these clues to classify (x, E1,E2) into one of CS 652, Peter Lindes
Parse Tree CS 652, Peter Lindes
Dependency Graph CS 652, Peter Lindes
Methods to Extract Relationships • Feature-based methods • String form, orthographic type, POS tag, etc. • Features from Dependency Graph • Features from Word Sequence • Features from Parse Trees • Kernel-based methods • Kernel function K(X, X’) captures similarity • Support Vector Machine (SVM) classifier • Rule-based methods CS 652, Peter Lindes
Given Relationship, Find Entity Pairs • Given one or more relationship types • Find all occurrences in a corpus • Open document collection • No labeled unstructured training data • Instead, seeding for each relationship type is used CS 652, Peter Lindes
Seed Data for Relationship Type r • The types of entities that are arguments of r • Often specified at a high level, eg. proper noun, common noun, numeric, etc. • Types such as “Person” or “Company” require patterns to recognize them • A seed database S of entities that have r • May include negative examples • A seed set or manually coded patterns • Easy for generic relationships, eg. hypernym or meronym (part-of) CS 652, Peter Lindes
3 Steps for Relationship Extraction • Start with above seeding data • A corpus D • Relationship types r1,…,rk • Entity types Tr1, Tr2 for each r • A set S of examples (Ei1,Ei2,ri) 1 ≤ i ≤ N • 1: Use S to learn extraction patterns M • 2: Use a subset of patterns to create candidates • 3: Validation: select a subset based on statistical tests CS 652, Peter Lindes
Example Data • Relationships: “IsPhDAdvisorOf”, “Acquired” • Entity types: “(Person, Person)”, “(Company, Company)” CS 652, Peter Lindes
Learn Patterns from Seed Triples • Assume only one relationship for each pair • Thus each example for r is negative for r’ • 1: Find sentences with entity pairs • For (E1,E2,r) query for “E1 NEAR E2” • Filter out where E1, E2 don’t match Tr1, Tr2 • 2: Filter sentences for the relationship • 3: Learn patterns from sentences CS 652, Peter Lindes
Filtering Sentences • Example: • Banko: a simple heuristic using the length of dependency links • This fails for above example CS 652, Peter Lindes
Learn Patterns from Sentences • Formulate as a standard classification problem • Two practical problems: • No guarantee of positive examples • Bunescu and Mooney: use SVM • Many sentences for each pair • Bunescu and Mooney: down-weight correlated terms CS 652, Peter Lindes
Extract Candidate Entity Pairs • Learned model M: (x,E1,E2) -> r • Simple method: sequential scan over D • Look for Tr1, Tr2, then apply M • Large, indexed corpus: retrieve relevant sentences • Use keyword search • Pattern-based • Keyword-based • Agichtein and Gravano: iterative solution CS 652, Peter Lindes
Validate Extracted Relationships • Extraction has high error rates • Validation based on corpus-wide statistics • Probabilities based on count of occurrences • Extract only high-confidence relationships • Rare relationships: • Use contextual pattern • Alternative: correct entity boundary errors CS 652, Peter Lindes
Summary • Setting 1: entities already marked • Feature-based and kernel-based methods • Clues from word sequence, parse trees, and dependency graphs • Training data with labeled relationships • Setting 2: open corpus, given relationship types • No labeled unstructured data • Seed database of (E1,E2,r) examples • Bootstrapping from seed data • Filter based on relevancy • Accuracy: • 50%-70% for closed benchmark datasets • Lots of special case handling for the web CS 652, Peter Lindes
Further Readings • Concentrated here on binary relationships • Natural extension: records with multi-way relationships • Requires cross-sentence analysis: • Co-reference resolution • Discourse analysis • Much literature on this topic • Future research: discovering relevant relationship types CS 652, Peter Lindes