1 / 46

Automating Wikipedia Semantics with Kylin Project

Explore autonomously semantifying Wikipedia, addressing the chicken-egg problem in Semantic Web. Join Fei Wu & Dan Weld's research to create Semantic Data Applications effortlessly.

ameliad
Download Presentation

Automating Wikipedia Semantics with Kylin Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 454 Project Ideas

  2. Administrivia • Office Hours 11-noon, Fridays in 588 • Or by email • Project proposals due today • Not binding (at least not yet) • To be elaborated • In-person project reviews next week. • HW 1 – due next Tues @ noon

  3. Autonomously Semantifying Wikipedia Fei Wu Dept. Computer Science & Eng. University of Washington (Joint work with Dan Weld)

  4. Motivation • Semantic Web [Berners-Lee 01] is great. • Web content machine readable • Software agents find, share and integrate information

  5. Motivation • Semantic Web [Berners-Lee 01] is great. • Web content machine readable • Software agents find, share and integrate information Chicken-egg problem: Semantic Data Applications

  6. Motivation • Semantic Web [Berners-Lee 01] • Web content machine readable • Software agents find, share and integrate information Chicken-egg problem: Semantic Data Applications Bootstrapping: Automatically Semantifying Data

  7. Idea: “Semantify” Wikipedia • Wikipedia [http://wikipedia.org] • Comprehensive • (1.7 million English articles) • High-quality • Important • 6th most popular web-site & growing • Benefits: • User-tagged data • (links, infobox, lists, categories, etc.) • Large, but not too large

  8. Wikipedia Challenges • Much natural-language text • Missing data • Inconsistency • Low information redundancy

  9. [Wu & Weld CIKM-07] Kylin: Autonomously Semantifying Wikipedia • Totally autonomous with no additional human efforts • Information extraction from both semi-structured and unstructured data Kylin: a mythical hooved Chinesechimerical creature that is said to appear in conjunction with the arrival of a sage. ------ Wikipedia

  10. Outline • Semantics in Wikipedia • Opportunities • Challenges • Kylin System • Infobox Generation • Link Creation • Conclusion

  11. Semantics in Wikipedia {{Infobox U.S. County| county = Clearfield County| state = Pennsylvania | seal = | map = Map of Pennsylvania highlighting Clearfield County.svg | map size = 225| founded = [[March 26]], [[1804]]| seat = [[Clearfield, Pennsylvania|Clearfield]] | area = 2,988 [[km²]] (1,154 [[square mile|mi²]]) | area water = 17 km² (6 mi²) | area percentage = 0.56% | census yr = 2000| pop = 83,382 | density = 28| |}} • Infobox • Link • List • Category • Redirection • Disambiguation • ……

  12. Self-Supervised Learning of Infoboxes

  13. Infobox Challenges • Incompleteness • US County: ~50% of articles have infoboxes • Inconsistency • Manual process -> contradictions between text & infobox • 16% of US County articles had an error (revision) • Schema Drift • U.S. County (1428), US County (574), Counties (50), County (19) • Attribute drift & duplication, • Rare attributes: only 29% used by 30% or more articles

  14. Infobox Challenges (Continued) • Type-free System • Deliberate low-tech design • “King county” has the following attributes: • Land area = 2126 sq miles • Land area (km) = 5506 sq km • Irregular lists • Some separate information in items • Others use tables with different schemata • Others are hierarchical List of cities & towns in US Places in Florida List of counties in Florida

  15. Infobox Challenges (Continued) • Infoboxes hierarchical themselves • Country leader – instead of name, has nested element listing title to be “king” with name at lower level

  16. Semantics in Wikipedia • Infobox • Link • List • Category • Redirection • Disambiguation Why are these useful?

  17. Semantics in Wikipedia • Infobox • Link • List • Category • Redirection • Disambiguation Why useful? Why challenging?

  18. Semantics in Wikipedia “Seattle, Washington” • Infobox • Link • List • Category • Redirection • Disambiguation • Challenge: crappy • flattened • “to be merged since 3/06

  19. Semantics in Wikipedia • Infobox • Link • List • Category • Redirection • Disambiguation Why useful?

  20. Semantics in Wikipedia • Infobox • Link • List • Category • Redirection • Disambiguation Why useful?

  21. Semantics in Wikipedia • Infobox • Link • List • Category • Redirection • Disambiguation • Opportunities • Semantic source • Training dataset Challenges • Missing data • Inconsistency

  22. Semantics in Wikipedia • Infobox • Link • List • Category • Redirection • Disambiguation • Opportunities • Semantic source • Training dataset Challenges • Missing data • Inconsistency Kylin: Autonomously Semantifying Wikipedia

  23. Outline • Semantics in Wikipedia • Opportunities • Challenges • Kylin System • Infobox Generation • Link Creation • Conclusion

  24. Infobox Generation

  25. Classifier Preprocessor Extractor Infobox Preprocessor • Schema Refinement Free edit -> schema drift • Duplicate templates: U.S.County(1428), US County(574), Counties(50), County(19) • Duplicate attributes: “Census Yr”, “Census Estimate Yr”, “Census Est.”, “Census Year” • Low usage of attribute • Kylin: • Strict name match • ???? • >15% occurrences

  26. Classifier Preprocessor Extractor Infobox Preprocessor • Training Dataset Construction Clearfield County was created on 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812. Its county seat is Clearfield. 2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water. As of 2005, the population density was 28.2/km². Steps: Segment to sentences Find unique match (heuristics) • Problems: • Missing data • Noise

  27. Classifier Preprocessor Extractor Infobox Classifier • Document Classifiers (1 per article type) List & Category • Fast • Precision(98.5%) – with no learning! • Recall(68.8%) • Sentence Classifier (1 per article type x attribute) • Trained on preprocessor output • Features: bag of words, POS tags • Maximum Entropy Classifier with Bagging: multi-class, multi-label, missing data

  28. Classifier Preprocessor Extractor Infobox Extractor • Input • A sentence predicted to contain an attribute: “After considerable debate, the county was incorporated on September 13, 1852” • Output • <founding date, September 13, 1852>

  29. Classify Pre-segmentedCandidates Sliding Window Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Classifier Classifier which class? which class? Try alternatewindow sizes: Context Free Grammars Boundary Models Finite State Machines Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. Abraham Lincoln was born in Kentucky. BEGIN Most likely state sequence? NNP NNP V V P NP Most likely parse? Classifier PP which class? VP NP VP BEGIN END BEGIN END S Landscape of Extraction Techniques Lexicons Abraham Lincoln was born in Kentucky. member? Alabama Alaska … Wisconsin Wyoming …and beyond Any of these models can be used to capture words, formatting or both. Slides from Cohen & McCallum

  30. Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Slides from Cohen & McCallum

  31. Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Slides from Cohen & McCallum

  32. Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Slides from Cohen & McCallum

  33. Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Slides from Cohen & McCallum

  34. A “Naïve Bayes” Sliding Window Model [Freitag 1997] 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun … … w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefix contents suffix Estimate Pr(LOCATION|window) using Bayes rule Try all “reasonable” windows (vary length, position) Assume independence for length, prefix words, suffix words, content words Estimate from data quantities like: Pr(“Place” in prefix|LOCATION) If P(“Wean Hall Rm 5409” = LOCATION)is above some threshold, extract it. Slides from Cohen & McCallum

  35. “Naïve Bayes” Sliding Window Results Domain: CMU UseNet Seminar Announcements GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Field F1 Person Name: 30% Location: 61% Start Time: 98% Slides from Cohen & McCallum

  36. State of the Art Performance • Named entity recognition • Person, Location, Organization, … • F1 in high 80’s or low- to mid-90’s • Binary relation extraction • Contained-in (Location1, Location2)Member-of (Person1, Organization1) • F1 in 60’s or 70’s or 80’s • Wrapper induction • Extremely accurate performance obtainable • Human effort (~30min) required on each site Slides from Cohen & McCallum

  37. Classifier Preprocessor Extractor Infobox CRF Extractor • Conditional Random Fields Model [Lafferty 01] Attribute value extraction: sequential data labeling • CRF model for each attribute independently • Relabel – filter false negative training examples 2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water. Preprocessor: Water_area Classifier: Water_area; Land_area • Pipeline – prune irrelevant sentences Precision + Recall -

  38. Infobox Generation Experiments • Dataset 2007.02.06 Wikipedia Dump Data • 4 popular classes: • U.S.County (1245) Actor(3819) • Airline(791) University(4025) • 50 random test articles per class

  39. Kylin performance

  40. Kylin performance (detailed view) • U.S.County (better than manual labeling) • Strict expression • Number-typed Abbeville County is a county located in the U.S. state of South Carolina. The county has a total area of 2,988 square kilometers (1,154 mi²). 2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water.

  41. Kylin performance (detailed view) • University (worse than manual labeling) • Flexible expression: The College began first in 1855 as a one room schoolhouse. UCL was founded in 1826 under the name “University of London”. The college opened in 1973 with the Charlestown campus. • Global context: Former U.S. President Dwight D. Eisenhower served as President of the University. • Implicit: Eg: students at 3 campus sum up to the total student number

  42. Effect of Relabel, Pipeline

  43. Default Project • Reimplement Kylin (or build on Fei’s code) • Improve it • See how much information we can extract • Post on web: Dbpedia • Merge back into Wikipedia? • Bot issues • Associate javascript • Extraction from the Greater WWW • Self-verify accuracy by external extraction • Add infobox facts which are missing from articles

  44. Extensions • Semi-automated bot interface • Firefox plugin • Displays improved infobox – user checks & says ok • Safer than a bot • For general Wikipedia authors • Extraction in real-time & error checking • Attribute values • Guide towards best schema & attribute • Typing & microformats

  45. Extensions • Other wikipedia issues • Learn author reputation • Watch for changes • Look for framing or biased language • Recognize vandalism • Auto-generate disambiguation pages • Extract events & create a timeline view • Citation assistance • identify correspondence between text and citation • Semiautomatic article generation  

  46. Extensions • Where could this be applied besides Wikipedia? • Broader Questions • Internet enables generation of structured content • How integrate methods? • Overwrite, training data, ???

More Related