1 / 21

Taxonomy-Aided Massive Relationship Extraction from the Web

Taxonomy-Aided Massive Relationship Extraction from the Web. The Task. Goal: Extract instance pairs that satisfy a certain relationship from the Web. E.g., (Microsoft, Redmond) satisfies the HeadquarterOf relationship Input : Web corpus Probase Taxonomy: classes/instances .

halona
Download Presentation

Taxonomy-Aided Massive Relationship Extraction from the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Taxonomy-Aided Massive Relationship Extraction from the Web

  2. The Task • Goal: • Extract instance pairs that satisfy a certain relationship from the Web. • E.g., (Microsoft, Redmond) satisfies the HeadquarterOf relationship • Input: • Web corpus • Probase Taxonomy: classes/instances

  3. Traditional Approaches • Traditional approach • Manually supply a set of seed pairs for a given relationship • Find patterns from the seed pairs • Use the patterns to find more pairs • Problem: Scalability • Cannot manually supply seed pairs for millions of relationships • Relationship extraction is time consuming even for one relationship. • Problem: Quality • Semantics is not guaranteed: no guarantee that is a company entity, and is a location entity.

  4. Our goal: targeting all the relationships we can think of … • How large is ? • Each is in the form of • Example:

  5. Challenge: where does R come from? • From Probase? • From Freebase? • From Tables? • Language patterns? • What is the <attr> of <instance>?

  6. Challenge: What are the seed pairs for each relationship? • Not enough seed pairs • Possible to miss some useful patterns • Hard to evaluate quality of patterns • Two-phase approach • From one to several pairs • Extract high-quality patterns, then extract seed pairs • From several to more pairs • Extract discriminative terms, then extract more pairs • Terms are more general and have more semantics

  7. Challenge: What are the candidate entities? • Cannot afford to scan the web corpus for each relationship • Scan web texts to find candidates for millions of relationships simultaneously • Need to avoid generating massive candidate pairs • Massive relations * massive candidate pairs = infeasible • Solution: • use instances in a taxonomy

  8. Approach overview • Phase 1: from one pair to several seed pairs • Extract high-quality patterns • Extract seed pairs • Phase 2: from seed pairs to more pairs • Map classes in seed pairs and taxonomy • Extract discriminative terms • Extract more pairs

  9. Overview Web pages: Page 1: You can view the upcoming Toy Story 3 trailer here!Page 2: View company contact information for Pretty Woman on IMDbPro …… Seed pairs (Natural Born Killers ,Oliver Stone) (Army of Darkness, Sam Raimi) (Toy Story, John Lasseter) (Microsoft, Redmond) (IBM, Armonk) (Google, Mountain View) (telephone, Bell) (polygraph machine, Mackenzie ) (dry cleaning, Baptiste) …… relations (spider-man, director, Sam Raimi) (Microsoft, headquarter, Redmond) (telephone, invented, Bell) …… patterns (The director of, is, .) (The headquarter of, is, .) (, is invented by, .) …… taxonomy Movies: Toy Story, Matrix,… Companies: Microsoft, IBM,… Directors: Stone, Lasseter,.. …… Classes R1: (movies, directors) R2: (companies, locations) R3: (products, persons) …… terms (director, directed, movie,…) (location, located, headquarter,…) (inventor, inverted,…) …… More pairs (Titanic,Cameron) … Phase 1 Phase 2

  10. Phase 1: from one to several • Step 1: extract patterns for each relation • E.g. relation: (Microsoft, headquarter, Redmond) • Pattern: (The headquarter of eliser.) • General pattern: (pl,pm,pr) • Step 2: extract seed pairs based on patterns

  11. Step 1: extract patterns • Input: • One tuple for each relation (el,r,er) • Web corpus; • Output: • Pattern for each relation • Approach: • MAP: • Scan web texts and find sentences which contain all three elements of certain relation • For each selected sentence, output candidate pattern (pl,pm,pr) • REDUCE: • Rank patterns in each relation by tfidf value, and select the first one

  12. Step 2: extract seed pairs • Input: • One pattern for each relation • Suchas dataset • Web corpus; • Output: • k seed pairs for each relation • Approach • MAP: • Scan web texts and find sentences in which any pattern occurs • output candidate seed pairs • REDUCE: • Rank candidate pairs in each relation by frequency

  13. Phase 2: from several to more • Approach • Step 1: match classes in seed pairs and taxonomy • Step 2: extract terms for each relation • Step 3: extract pairs based on terms • Advantage of term-based approach • More efficient • More general

  14. “Suchas” dataset • Extract information from web corpus like: “NNS including/such as I1,I2,…, and In” • Eg. “companies including MS, IBM and Google” • Class label: companies • Instances: MS, IBM and Google • Combine all instances of a class label and rank them with tfidf • Instance as term, and class label as document • companies: IBM 0.1, MS 0.09, Google 0.08, Intel 0.078, … • countries: China 0.2, USA 0.18, France 0.15,… • Millions of instances and classes

  15. Step 1: map classes • For each relation, obtain left class and right class • find most relevance class in taxonomy to left/right class • Measurement of “relevance” • For left/right class c and c’ in Suchas

  16. Step 2: extract terms • Input: • k seed pairs for each relation • Web corpus; • Output: • Terms for each relation • A term can appear in multiple relations • Approach: • MAP: • Scan web texts and find sentences in which certain seed pair appears • Take each word (except the words in instance pairs) as a candidate term • REDUCE: • Rank terms in each relation by its tfidf value

  17. Step 3: extract more pairs • Input: • Terms for each relation • Suchas dataset • Web corpus; • Output: • More instance pairs for each relation • Approach • MAP: • Scan web texts and find sentences in which any term(s) occur(s) • Generate candidate pairs • REDUCE: • Rank candidate pairs in each relation

  18. Implement • MAP operation • Build hash table for one pair or seed pairs • Build hash table for patterns or terms • Retrieve value in hash table cost O(1) nearly

  19. Implement • Challenge: when taxonomy is too big • Partition terms and classes into groups • Reduce sentences by group ID of term • For each group, use instances in corresponding classes terms classes director directed actor group1 role movie company group2 CEO CEO

  20. Results

  21. Efficiency • Number of instances • More than 1 million • Runtime • Step 1: less than 30 min • Step 2: about 1 hour • Step 3: about 3 hours • Analysis • Good scalability • Scan the whole web corpus twice • Process all relations simultaneously • All instances are organized in a dictionary O(1) access cost

More Related