210 likes | 342 Views
Taxonomy-Aided Massive Relationship Extraction from the Web. The Task. Goal: Extract instance pairs that satisfy a certain relationship from the Web. E.g., (Microsoft, Redmond) satisfies the HeadquarterOf relationship Input : Web corpus Probase Taxonomy: classes/instances .
E N D
The Task • Goal: • Extract instance pairs that satisfy a certain relationship from the Web. • E.g., (Microsoft, Redmond) satisfies the HeadquarterOf relationship • Input: • Web corpus • Probase Taxonomy: classes/instances
Traditional Approaches • Traditional approach • Manually supply a set of seed pairs for a given relationship • Find patterns from the seed pairs • Use the patterns to find more pairs • Problem: Scalability • Cannot manually supply seed pairs for millions of relationships • Relationship extraction is time consuming even for one relationship. • Problem: Quality • Semantics is not guaranteed: no guarantee that is a company entity, and is a location entity.
Our goal: targeting all the relationships we can think of … • How large is ? • Each is in the form of • Example:
Challenge: where does R come from? • From Probase? • From Freebase? • From Tables? • Language patterns? • What is the <attr> of <instance>?
Challenge: What are the seed pairs for each relationship? • Not enough seed pairs • Possible to miss some useful patterns • Hard to evaluate quality of patterns • Two-phase approach • From one to several pairs • Extract high-quality patterns, then extract seed pairs • From several to more pairs • Extract discriminative terms, then extract more pairs • Terms are more general and have more semantics
Challenge: What are the candidate entities? • Cannot afford to scan the web corpus for each relationship • Scan web texts to find candidates for millions of relationships simultaneously • Need to avoid generating massive candidate pairs • Massive relations * massive candidate pairs = infeasible • Solution: • use instances in a taxonomy
Approach overview • Phase 1: from one pair to several seed pairs • Extract high-quality patterns • Extract seed pairs • Phase 2: from seed pairs to more pairs • Map classes in seed pairs and taxonomy • Extract discriminative terms • Extract more pairs
Overview Web pages: Page 1: You can view the upcoming Toy Story 3 trailer here!Page 2: View company contact information for Pretty Woman on IMDbPro …… Seed pairs (Natural Born Killers ,Oliver Stone) (Army of Darkness, Sam Raimi) (Toy Story, John Lasseter) (Microsoft, Redmond) (IBM, Armonk) (Google, Mountain View) (telephone, Bell) (polygraph machine, Mackenzie ) (dry cleaning, Baptiste) …… relations (spider-man, director, Sam Raimi) (Microsoft, headquarter, Redmond) (telephone, invented, Bell) …… patterns (The director of, is, .) (The headquarter of, is, .) (, is invented by, .) …… taxonomy Movies: Toy Story, Matrix,… Companies: Microsoft, IBM,… Directors: Stone, Lasseter,.. …… Classes R1: (movies, directors) R2: (companies, locations) R3: (products, persons) …… terms (director, directed, movie,…) (location, located, headquarter,…) (inventor, inverted,…) …… More pairs (Titanic,Cameron) … Phase 1 Phase 2
Phase 1: from one to several • Step 1: extract patterns for each relation • E.g. relation: (Microsoft, headquarter, Redmond) • Pattern: (The headquarter of eliser.) • General pattern: (pl,pm,pr) • Step 2: extract seed pairs based on patterns
Step 1: extract patterns • Input: • One tuple for each relation (el,r,er) • Web corpus; • Output: • Pattern for each relation • Approach: • MAP: • Scan web texts and find sentences which contain all three elements of certain relation • For each selected sentence, output candidate pattern (pl,pm,pr) • REDUCE: • Rank patterns in each relation by tfidf value, and select the first one
Step 2: extract seed pairs • Input: • One pattern for each relation • Suchas dataset • Web corpus; • Output: • k seed pairs for each relation • Approach • MAP: • Scan web texts and find sentences in which any pattern occurs • output candidate seed pairs • REDUCE: • Rank candidate pairs in each relation by frequency
Phase 2: from several to more • Approach • Step 1: match classes in seed pairs and taxonomy • Step 2: extract terms for each relation • Step 3: extract pairs based on terms • Advantage of term-based approach • More efficient • More general
“Suchas” dataset • Extract information from web corpus like: “NNS including/such as I1,I2,…, and In” • Eg. “companies including MS, IBM and Google” • Class label: companies • Instances: MS, IBM and Google • Combine all instances of a class label and rank them with tfidf • Instance as term, and class label as document • companies: IBM 0.1, MS 0.09, Google 0.08, Intel 0.078, … • countries: China 0.2, USA 0.18, France 0.15,… • Millions of instances and classes
Step 1: map classes • For each relation, obtain left class and right class • find most relevance class in taxonomy to left/right class • Measurement of “relevance” • For left/right class c and c’ in Suchas
Step 2: extract terms • Input: • k seed pairs for each relation • Web corpus; • Output: • Terms for each relation • A term can appear in multiple relations • Approach: • MAP: • Scan web texts and find sentences in which certain seed pair appears • Take each word (except the words in instance pairs) as a candidate term • REDUCE: • Rank terms in each relation by its tfidf value
Step 3: extract more pairs • Input: • Terms for each relation • Suchas dataset • Web corpus; • Output: • More instance pairs for each relation • Approach • MAP: • Scan web texts and find sentences in which any term(s) occur(s) • Generate candidate pairs • REDUCE: • Rank candidate pairs in each relation
Implement • MAP operation • Build hash table for one pair or seed pairs • Build hash table for patterns or terms • Retrieve value in hash table cost O(1) nearly
Implement • Challenge: when taxonomy is too big • Partition terms and classes into groups • Reduce sentences by group ID of term • For each group, use instances in corresponding classes terms classes director directed actor group1 role movie company group2 CEO CEO
Efficiency • Number of instances • More than 1 million • Runtime • Step 1: less than 30 min • Step 2: about 1 hour • Step 3: about 3 hours • Analysis • Good scalability • Scan the whole web corpus twice • Process all relations simultaneously • All instances are organized in a dictionary O(1) access cost