Taxonomy-Aided Massive Relationship Extraction from the Web

Taxonomy-Aided Massive Relationship Extraction from the Web

The Task • Goal: • Extract instance pairs that satisfy a certain relationship from the Web. • E.g., (Microsoft, Redmond) satisfies the HeadquarterOf relationship • Input: • Web corpus • Probase Taxonomy: classes/instances

Traditional Approaches • Traditional approach • Manually supply a set of seed pairs for a given relationship • Find patterns from the seed pairs • Use the patterns to find more pairs • Problem: Scalability • Cannot manually supply seed pairs for millions of relationships • Relationship extraction is time consuming even for one relationship. • Problem: Quality • Semantics is not guaranteed: no guarantee that is a company entity, and is a location entity.

Our goal: targeting all the relationships we can think of … • How large is ? • Each is in the form of • Example:

Challenge: where does R come from? • From Probase? • From Freebase? • From Tables? • Language patterns? • What is the <attr> of <instance>?

Challenge: What are the seed pairs for each relationship? • Not enough seed pairs • Possible to miss some useful patterns • Hard to evaluate quality of patterns • Two-phase approach • From one to several pairs • Extract high-quality patterns, then extract seed pairs • From several to more pairs • Extract discriminative terms, then extract more pairs • Terms are more general and have more semantics

Challenge: What are the candidate entities? • Cannot afford to scan the web corpus for each relationship • Scan web texts to find candidates for millions of relationships simultaneously • Need to avoid generating massive candidate pairs • Massive relations * massive candidate pairs = infeasible • Solution: • use instances in a taxonomy

Approach overview • Phase 1: from one pair to several seed pairs • Extract high-quality patterns • Extract seed pairs • Phase 2: from seed pairs to more pairs • Map classes in seed pairs and taxonomy • Extract discriminative terms • Extract more pairs

Overview Web pages: Page 1: You can view the upcoming Toy Story 3 trailer here!Page 2: View company contact information for Pretty Woman on IMDbPro …… Seed pairs (Natural Born Killers ,Oliver Stone) (Army of Darkness, Sam Raimi) (Toy Story, John Lasseter) (Microsoft, Redmond) (IBM, Armonk) (Google, Mountain View) (telephone, Bell) (polygraph machine, Mackenzie ) (dry cleaning, Baptiste) …… relations (spider-man, director, Sam Raimi) (Microsoft, headquarter, Redmond) (telephone, invented, Bell) …… patterns (The director of, is, .) (The headquarter of, is, .) (, is invented by, .) …… taxonomy Movies: Toy Story, Matrix,… Companies: Microsoft, IBM,… Directors: Stone, Lasseter,.. …… Classes R1: (movies, directors) R2: (companies, locations) R3: (products, persons) …… terms (director, directed, movie,…) (location, located, headquarter,…) (inventor, inverted,…) …… More pairs (Titanic,Cameron) … Phase 1 Phase 2

Phase 1: from one to several • Step 1: extract patterns for each relation • E.g. relation: (Microsoft, headquarter, Redmond) • Pattern: (The headquarter of eliser.) • General pattern: (pl,pm,pr) • Step 2: extract seed pairs based on patterns

Step 1: extract patterns • Input: • One tuple for each relation (el,r,er) • Web corpus; • Output: • Pattern for each relation • Approach: • MAP: • Scan web texts and find sentences which contain all three elements of certain relation • For each selected sentence, output candidate pattern (pl,pm,pr) • REDUCE: • Rank patterns in each relation by tfidf value, and select the first one

Step 2: extract seed pairs • Input: • One pattern for each relation • Suchas dataset • Web corpus; • Output: • k seed pairs for each relation • Approach • MAP: • Scan web texts and find sentences in which any pattern occurs • output candidate seed pairs • REDUCE: • Rank candidate pairs in each relation by frequency

Phase 2: from several to more • Approach • Step 1: match classes in seed pairs and taxonomy • Step 2: extract terms for each relation • Step 3: extract pairs based on terms • Advantage of term-based approach • More efficient • More general

“Suchas” dataset • Extract information from web corpus like: “NNS including/such as I1,I2,…, and In” • Eg. “companies including MS, IBM and Google” • Class label: companies • Instances: MS, IBM and Google • Combine all instances of a class label and rank them with tfidf • Instance as term, and class label as document • companies: IBM 0.1, MS 0.09, Google 0.08, Intel 0.078, … • countries: China 0.2, USA 0.18, France 0.15,… • Millions of instances and classes

Step 1: map classes • For each relation, obtain left class and right class • find most relevance class in taxonomy to left/right class • Measurement of “relevance” • For left/right class c and c’ in Suchas

Step 2: extract terms • Input: • k seed pairs for each relation • Web corpus; • Output: • Terms for each relation • A term can appear in multiple relations • Approach: • MAP: • Scan web texts and find sentences in which certain seed pair appears • Take each word (except the words in instance pairs) as a candidate term • REDUCE: • Rank terms in each relation by its tfidf value

Step 3: extract more pairs • Input: • Terms for each relation • Suchas dataset • Web corpus; • Output: • More instance pairs for each relation • Approach • MAP: • Scan web texts and find sentences in which any term(s) occur(s) • Generate candidate pairs • REDUCE: • Rank candidate pairs in each relation

Implement • MAP operation • Build hash table for one pair or seed pairs • Build hash table for patterns or terms • Retrieve value in hash table cost O(1) nearly

Implement • Challenge: when taxonomy is too big • Partition terms and classes into groups • Reduce sentences by group ID of term • For each group, use instances in corresponding classes terms classes director directed actor group1 role movie company group2 CEO CEO

Results

Efficiency • Number of instances • More than 1 million • Runtime • Step 1: less than 30 min • Step 2: about 1 hour • Step 3: about 3 hours • Analysis • Good scalability • Scan the whole web corpus twice • Process all relations simultaneously • All instances are organized in a dictionary O(1) access cost

Taxonomy-Aided Massive Relationship Extraction from the Web

Taxonomy-Aided Massive Relationship Extraction from the Web

Presentation Transcript

Web Data Extraction

Information Extraction from Web Documents

Information Extraction from the World Wide Web

Open Information Extraction from the Web Oren Etzioni

Open Information Extraction from the Web

Extraction from the FFAG

Massive Effective Search from the Web

Web Spam Taxonomy

Schema-Driven Relationship Extraction from Unstructured Text

Web Frameworks Taxonomy

Information Extraction from the World Wide Web

Information Extraction from the World Wide Web

Computer Aided Diagnosis: Feature Extraction

Combining Resources: Taxonomy Extraction from Multiple Dictionaries

Schema-Driven Relationship Extraction from Unstructured Text

4. Relationship Extraction

Information extraction from web pages using extraction ontologies

Information Extraction from Multimedia Content on the Social Web

Information Extraction from the World Wide Web

Information extraction from web pages using extraction ontologies

Schema-Driven Relationship Extraction from Unstructured Text

The Data Records Extraction from Web Pages