1 / 39

The Intelligence in Wikipedia Project

The Intelligence in Wikipedia Project Daniel S. We ld Department of Computer Science & Engineering University of Washington Seattle, WA, USA Joint Work with Fei Wu, Eytan Adar, Saleema Amershi, Oren Etzioni, James Fogarty, Raphael Hoffmann, Kayur Patel,

emily
Download Presentation

The Intelligence in Wikipedia Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Intelligence in Wikipedia Project Daniel S. Weld Department of Computer Science & Engineering University of Washington Seattle, WA, USA Joint Work with Fei Wu, Eytan Adar, Saleema Amershi, Oren Etzioni, James Fogarty, Raphael Hoffmann, Kayur Patel, Stef Schoenmackers & Michael Skinner

  2. Wikipedia for AI Benefit to AI • Fantastic Corpus • Dynamic Environment • E.g., Bot Framework • Powers Reasoning • Semantic Distance Measure [Ponzetto&Strube07] • Word-Sense Disambiguation [Bunescu&Pasca06, Mihalcea07] • Coreference Resolution [Ponzetto&Strube06, Yang&Su07] • Ontology / Taxonomy [Suchanek07, Muchnik07] • Multi-Lingual Alignment [Adafre&Rijke06] • Question Answering [Ahn et al.05, Kaisser08] • Basis of Huge KB[Auer et al.07]

  3. AI for Wikipedia Benefit to Wikipedia: Tools Internal link maintenance Infobox Creation Schema Management Reference suggestion & fact checking Disambiguation page maintenance Translation across languages Vandalism Alerts ComscoreMediaMetrix – August 2007

  4. Albert Einstein was a German-born theoretical physicist … … Einstein was a guest lecturer at the Institute for Advanced Study in New Jersey … … New Jersey is a state in the Northeastern region of the United States … Motivating Vision Next-Generation Search = Information Extraction + Ontology + Inference Which German Scientists Taught at US Universities?

  5. Next-Generation Search Information Extraction <Einstein, Born-In, Germany> <Einstein, ISA, Physicist> <Einstein, Lectured-At, IAS> <IAS, In, New-Jersey> <New-Jersey, In, United-States> … Ontology Physicist (x)  Scientist(x) … Inference Einstein = Einstein … … Albert Einstein was a German-born theoretical physicist … … New Jersey is a state in the Northeastern region of the United States … Scalable Means Self-Supervised

  6. Why Mine Wikipedia? Pros Cons Natural-language Missing data Inconsistent Low redundancy • High-quality, comprehensive • UID for key concepts • First sentence as definition • Infoboxes • Categories & lists • Redirection pages • Disambiguation pages • Revision history • Multilingual corpus

  7. The Intelligence in Wikipedia Project Outline 1. Self-supervised extraction from Wikipedia text (and the greater Web) 2. Automatic ontology generation 3. Scalable probabilistic inference for Q/A

  8. Outline 1. Self-supervised extraction from Wikipedia text 2. Automatic ontology generation 3. Scalable probabilistic inference for Q/A

  9. Building on • SNOWBALL [Agichtein&Gravano 00] • MULDER [Kwok et al. TOIS01] • AskMSR [Brill et al. EMNLP02] • KnowItAll [Etzioni et al. AAAI04] Outline 1. Self-supervised extraction from Wikipedia text 2. Automatic ontology generation 3. Scalable probabilistic inference for Q/A

  10. Kylin: Self-Supervised Information Extraction from Wikipedia [Wu & Weld CIKM 2007] From infoboxes to a training set Clearfield County was created in 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812. Its county seat is Clearfield. 2,972 km² (1,147 mi²) of it is land and 17 km²  (7 mi²) of it (0.56%) is water. As of 2005, the population density was 28.2/km².

  11. Kylin Architecture

  12. Preliminary Evaluation • Kylin Performed Well on Popular Classes: • Precision: mid 70% ~ high 90% • Recall: low 50% ~ mid 90% • ... Floundered on Sparse Classes – Little Training Data 82% < 100 instances; 40% <10 instances

  13. Shrinkage? person (1201) .birth_place performer (44) .location .birthplace .birth_place .cityofbirth .origin actor (8738) comedian (106)

  14. Outline 1. Self-Supervised Extraction from Wikipedia Text • Training on Infoboxes • Improving Recall – Shrinkage, Retraining, Web Extraction • Community Content Creation 2. Automatic Ontology Generation 3. Scalable Probabilistic Inference for Q/A

  15. KOG: Kylin Ontology Generator[Wu & Weld, WWW08]

  16. Subsumption Detection • Binary Classification Problem • Nine Complex Features E.g., String Features … IR Measures … Mapping to Wordnet … Hearst Pattern Matches … Class Transitions in Revision History • Learning Algorithm SVM & MLN Joint Inference Person 6/07: Einstein Scientist Physicist

  17. KOG Architecture

  18. Schema Mapping Performer Person birth_date birth_place name other_names … birthdate location name othername … • Heuristics • Edit History • String Similarity • Experiments • Precision: 94% Recall: 87% • Future • Integrated Joint Inference

  19. Outline 1. Self-Supervised Extraction from Wikipedia Text • Training on Infoboxes • Improving Recall – Shrinkage, Retraining, Web Extraction • Community Content Creation 2. Automatic Ontology Generation 3. Scalable Probabilistic Inference for Q/A

  20. Improving Recall on Sparse Classes [Wu et al. KDD-08] Shrinkage Extra Training Examples from Related Classes How Weight New Examples? Retraining Compare Kylin Extractions with Ones from Textrunner [Banko et al. IJCAI-07] Additional Positive Examples Eliminate False Negatives Extraction from Broader Web person (1201) performer (44) actor (8738) comedian (106)

  21. Effect of Shrinkage & Retraining

  22. Effect of Shrinkage & Retraining 1755% improvement for a sparse class 13.7% improvement for a popular class

  23. Related Work on Ontology Driven Information Extraction • SemTag and Seeker [Dill WWW03] • PANKOW [Cimiano WWW05] • OntoSyphon [McDowell & Cafarella ISWC06]

  24. Improving Recall on Sparse Classes [Wu et al. KDD-08] Shrinkage Retraining Extract from Broader Web 44% of Wikipedia Pages = “stub” Extractor quality irrelevant Query Google & Extract How maintain high precision? Many Web pages noisy, describe multiple objects How integrate with Wikipedia extractions?

  25. Combining Wikipedia & Web

  26. Outline 1. Self-Supervised Extraction from Wikipedia Text • Training on Infoboxes • Improving Recall – Shrinkage, Retraining, Web Extraction • Community Content Creation 2. Automatic Ontology Generation 3. Scalable Probabilistic Inference for Q/A

  27. Problem • Information Extraction is Imprecise • Wikipedians Don’t Want 90% Precision • How Improve Precision? • People!

  28. Accelerate

  29. Contributing as a Non-Primary Task • Encourage contributions • Without annoying or abusing readers • Compared 5 different interfaces

  30. Adwords Deployment Study [Hoffman et al. 2008] • 2000 articles containing writer infobox • Query for “ray bradbury” would show • Redirect to mirror with injected JavaScript • Round-robin interface selection: • baseline, popup, highlight, icon • Track clicks, load, unload, and show survey

  31. Results • Contribution Rate • 1.6%  13% • 90% of positive labels were correct

  32. Outline 1. Self-Supervised Extraction from Wikipedia Text 2. Automatic Ontology Generation 3. Scalable Probabilistic Inference for Q/A

  33. Scalable Probabilistic Inference [Schoenmackeret al. 2008] • Eight MLN Inference Rules • Transitivity of predicates, etc. • Knowledge-Based Model Construction • Tested on 100 Million Tulples • Extracted by Textrunner from Web

  34. Effect of Limited Inference

  35. Cost of Inference Approximately Pseudo-Functional Relations

  36. Conclusion • Wikipedia is a Fantastic Platform & Corpus • Self-Supervised Extraction from Wikipedia Training on Infoboxes Works well on popular classes Improving Recall – Shrinkage, Retraining, Web Extraction High precision & recall - even on sparse classes, stub articles Community Content Creation • Automatic Ontology Generation Probabilistic Joint Inference • Scalable Probabilistic Inference for Q/A Simple Inference - Scales to Large Corpora Tested on 100 M Tuples

  37. Future Work • Improved Ontology Generation • Joint Schema Mapping • Incorporate Freebase, etc. • Multi-Lingual Extraction • Automatically Learn Inference Rules • Make Available as Web Service • Integrate Back Into Wikipedia

  38. The End AI

More Related