1 / 160

Harvesting Knowledge from Web Data and Text CIKM 2010 Tutorial (1/2 Day)

Harvesting Knowledge from Web Data and Text CIKM 2010 Tutorial (1/2 Day). Hady W. Lauw 1 , Ralf Schenkel 2 , Fabian Suchanek 3 , Martin Theobald 4 , and Gerhard Weikum 4 1 Institute for Infocomm Research, Singapore 2 Saarland University, Saarbruecken 3 INRIA Saclay , Paris

glen
Download Presentation

Harvesting Knowledge from Web Data and Text CIKM 2010 Tutorial (1/2 Day)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Harvesting Knowledge from Web Data and TextCIKM 2010 Tutorial(1/2 Day) Hady W. Lauw1, Ralf Schenkel2, Fabian Suchanek3, Martin Theobald4, and Gerhard Weikum4 1Institute for Infocomm Research, Singapore 2Saarland University, Saarbruecken 3INRIA Saclay, Paris 4Max Planck Institute Informatics, Saarbruecken

  2. All slides for download… http://www.mpi-inf.mpg.de/yago-naga/CIKM10-tutorial/ Harvesting Knowledge from Web Data

  3. Outline • Part I • What and Why • Available Knowledge Bases • Part II • Extracting Knowledge • Part III • Ranking and Searching • Part IV • Conclusion and Outlook Harvesting Knowledge from Web Data

  4. Motivation Elvis, when I need you, I can hear you! Elvis Presley 1935 - 1977 Will there ever be someone like him again?

  5. Motivation Another Elvis Elvis Presley: The Early Years Elvis spent more weeks at the top of the charts than any other artist. www.fiftiesweb.com/elvis.htm

  6. Motivation Another singer called Elvis, young Personal relationships of Elvis Presley – Wikipedia ...when Elvis was a young teen.... another girl whom the singer's mother hoped Presley would .... The writer called Elvis "a hillbilly cat”en.wikipedia.org/.../Personal_relationships_of_Elvis_Presley

  7. Motivation Dear Mr. Page, you don’t understand me. I just... Elvis Presley - Official page for Elvis Presley Welcome to the Official Elvis Presley Web Site, home of the undisputed King of Rock 'n' Roll and his beloved Graceland ... www.elvis.com/

  8. Motivation Other (more serious?) queries: • when is Madonna’s next concert in Europe? • which protein inhibits atherosclerosis? • who was king of England when Napoleon I was emperor of France? King George III • has any scientist ever won the Nobel Prize in Literature? Bertrand Russel • which countries have a HDI comparable to Sweden’s? • which scientific papers have led to patents? • is there another famous singer named “Elvis”?

  9. This Tutorial Mr. Page, let’s try this again. Is there another singer named Elvis? In this tutorial, we will explain • how the knowledge is organized • what knowledge bases exist already • how we can construct knowledge bases • how we can query knowledge bases Ontology/Semantic Knowledge Base singer type type ? “Elvis” “Elvis”

  10. Ontologies entity subclassOf subclassOf Classes person location subclassOf subclassOf scientists singer city type type Relations type bornIn ? Instances Tupelo The same entity has two labels: synonymy label The same label for two entities: homonymy label Labels/words “Elvis” “The King”

  11. Classes entity entity subclassOf person subclassOf person scientists singer singer type type scientist ? Transitivity: type(x,y) /\ subclassOf(y,z) => type(x,z)

  12. Relations entity subclassOf subclassOf person location domain range subclassOf bornIn singer city type type bornIn Tupelo Domain and range constraints: domain(r,c) /\ r(x,y) => type(x,c) range(r,c) /\ r(x,y) => type(y,c) Looks like higher order, but is not. Consider introducing a predicate fact(r,x,y)

  13. Event Entities 1967 An event entity is an artificial entity introduced to represent an n-ary relationship year ElvisGrammy winner prize Event entities allow representing arbitrary relational data as binary graphs won Grammy Award Row42 Row43

  14. Reification Reification is the method of creating an entity that represents a fact. 1967 year source #42 Wikipedia won bornIn Grammy Award Tupelo #42 #43 There are different ways to reify a fact, this is the one used in this talk.

  15. RDF resource subclassOf The Resource Description Format (RDF) is a W3C standard that provides a standard vocabulary to model ontologies. location subclassOf • An RDF ontology can be seen as a • directed labeled multi-graph where • the nodes are entities • the edges are labeled with relations city type • Edges (facts) are commonly written • as triples • <Elvis, bornIn, Tupelo> • as literals • bornIn(Elvis, Tupelo) bornIn Tupelo [W3C recommendation: RDF, 2004]

  16. Outline • Part I • What and Why ✔ • Available Knowledge Bases • Part II • Extracting Knowledge • Part III • Ranking and Searching • Part IV • Conclusion and Outlook Harvesting Knowledge from Web Data

  17. Cyc What if we could make all common sense knowledge computer-processable? Cyc project • started in 1984 • driven by • staff of 20 • goal: formalize knowledge • manually Douglas Lenat [Lenat, Comm. ACM, 1995]

  18. Cyc: Language CycL is the formal language that Cyc uses to represent knowledge. (Semantics based on First Order Logic, syntax based on LISP) (#$forall ?A (#$implies (#$isa ?A #$Animal) (#$thereExists ?M (#$mother ?A ?M)))) Cyc project (#$arity#$GovernmentFn 1) (#$arg1Isa #$GovernmentFn #$GeopoliticalEntity) (#$resultIsa #$GovernmentFn #$RegionalGovernment) (#$governs (#$GovernmentFn #$Canada) #$Canada) + a logical reasoner http://cyc.com/cycdoc/ref/cycl-syntax.html

  19. Cyc: Knowledge #$Love Strong affection for another agent arising out of kinship or personal ties. Love may be felt towards things, too: warm attachment, enthusiasm, or devotion. #$Love is a collection, as further explained under #$Happiness. Specialized forms of #$Love are #$Love-Romantic, platonic love, maternal love, infatuation, agape, etc. guid: bd589433-9c29-11b1-9dad-c379636f7270 direct instance of: #$FeelingType direct specialization of: #$Affection direct generalization of: #$Love-Romantic Cyc project http://cyc.com/cycdoc/vocab/emotion-vocab.html#Love Facts and axioms about: Transportation, Ecology, everyday living, chemistry, healthcare, animals, law, computer science... “If a computer network implements IEEE 802.11 Wireless LAN Protocol and some computer is a node in that computer network, then that computer is vulnerable to decryption. “ http://cyc.com/cyc/technology/whatiscyc_dir/maptest

  20. Cyc: Summary http://ontologyportal.org http://cyc.com/cyc/technology/whatiscyc_dir/whatsincyc SUMO (the Suggested Upper Model Ontology) is a research project in a similar spirit, driven by Adam Pease of Articulate Software

  21. WordNet What if we could make the English language computer-processable? • started in 1985 • Cognitive Science Laboratory, • Princeton University • written by lexicographers • goal: support automatic text • analysis and AI applications George Miller [Miller, CACM 1995]

  22. WordNet: Lexical Database Word Sense synonymous words photographic camera sense1 camera polysemous words sense2 television camera

  23. WordNet

  24. WordNet: Semantic Relations Hypernymy Meronymy Is-value-of Camera Speed Kitchen Appliances Slow Fast Optical Lens Toaster

  25. WordNet: Semantic Relations

  26. WordNet: Hierarchy Hypernymy Is-A relations

  27. WordNet: Size http://wordnet.princeton.edu/wordnet/man2.1/wnstats.7WN.html Downloadable at http://wordnet.princeton.edu

  28. Wikipedia If a small number of people can create a knowledge base, how about a LARGE number of people? • started in 2001 • driven by Wikimedia Foundation, • and a large number of volunteers • goal: build world’s largest • encyclopedia Jimmy Wales

  29. Wikipedia: Entities and Attributes Entities Attributes

  30. Wikipedia: Synonymy and Polysemy Redirection (synonyms) Disambiguation (polysemy)

  31. Wikipedia: Classes/Categories Class hierarchy different from WordNet

  32. Wikipedia: Others Inter-lingual Links Navigation/ Topic box

  33. Wikipedia: Numbers • English: • 1B words, • 2.8M articles, • 152K contributors • All (250 languages): • 1.74B words, • 9.25M articles, • 283K contributors • vs. Britannica: • 25X as many words • ½ avg article length • License: Creative Commons Attribution-ShareAlike (CC-BY-SA) Growth 2001 - 2008 Downloadable at http://download.wikimedia.org/

  34. Automatically Constructed Knowledge Bases • Manual approaches (Cyc, WordNet, Wikipedia) • produce high quality knowledge bases • labor-intensive and limited in scope Can we construct the knowledge bases automatically? YAGO … , etc.

  35. YAGO Can we exploit Wikipedia and WordNet to build an ontology? • started as PhD thesis in 2007 • now major project at • the Max Planck Institute • for Informatics in Germany • goal: extract ontology from • Wikipedia with high accuracy • and consistency YAGO [Suchanek et al., WWW 2007]

  36. YAGO: Construction WordNet Person Person subclassOf subclassOf Singer Singer subclassOf Rock Singer Elvis Presley type ~Infobox~ Born: 1935 ... born Blah blah blub fasel (do not read this, better listen to the talk) blah blah Elvis blub (you are still reading this) blah Elvis blah blub later became astronaut blah 1935 Exploit Infoboxes Exploit conceptual categories Add WordNet Categories: Rock singer

  37. YAGO: Consistency Checks Person subclassOf Singer subclassOf Guitar Guitarist Rock Singer type born born Physics 1935 Check uniqueness of entities and functional arguments Check domains and ranges of relations Check type coherence

  38. YAGO: Relations ca. 100 relations with range and domain

  39. YAGO: Numbers Downloadable at http://mpii.de/yago incl. converters for RDF, XML, databases

  40. DBpedia Can we harvest facts more exhaustively with community effort? • community effort started in 2007 • driven by Free U. Berlin, U. Leipzig, • OpenLink • goal: "extract structured information • from Wikipedia and to make this • information available on the Web" [Bizer et al., Journal of Web Semantics 2009]

  41. DBPedia: Ontology • In YAGO, the taxonomy is based on WordNet classes. • Dbpedia: • places entities extracted from Wikipedia into its own ontology. • hand-crafted: 259 classes, 6 levels, 1200 properties • emphasizes recall • only half of extracted entities are currently placed in its own ontology • alternative classifications: Wikipedia, YAGO, UMBEL (OpenCyc)

  42. DBPedia: Mapping Rules • DBpedia mapping rules: • maps Wikipedia infoboxes and tables to its ontology • target datatypes (normalize units, ignore deviant values) • Community effort: • hand-craft mapping rules • expand ontology < http://en.wikipedia.org/wiki/Elvis_Presley > {{Infobox musical artist |Name = Elvis Presley |Background = solo_singer |Birth_name = Elvis Aaron Presley }} < http://dbpedia.org/page/Elvis_Presley > foaf:name “Elvis Presley”; background “solo_singer”; foaf:givenName “Elvis Aaron Presley”; Note that the values do not change.

  43. DBPedia: Numbers • plus • 5.5 m links to external Web pages • 1.5 m links to images • 5 m links to other RDF data sets Downloadable at http://dbpedia.org

  44. Freebase What if we could harvest both automatic extraction and user contribution? • started in 2000 • driven by Metaweb, • part of Google since Jul 2010 • goals: • “an open shared database of the • world's knowledge” • “a massive, collaboratively-edited • database of cross-linked data”

  45. Freebase • Like DBpedia and YAGO, Freebase imports data from Wikipedia. • Differently: • also imports from other sources (e.g., ChefMoz, NNDB, and MusicBrainz) • including individually contributed data • users can collaboratively edit its data (without having to edit Wikipedia).

  46. Freebase: User Contribution • create new entities • assign a new type/class to an entity • add/change attributes • connect to other entities • upload/edit images • define new class, specifying the • attributes of the class • class definition can only be • changed by creator/admin • class not part of commons until • peer-reviewed & promoted by • staff/admin Edit Schema Edit Entities • flag vandalism • flag entities to be merged/deleted • vote on flagged content • (3 unanimous vote, or an expert • has to be tie-breaker) Review • finding aliases in Wikipedia • redirects • extracts dates of events from • Wikipedia articles • uses the Yahoo image search API to • find candidates Data Game

  47. Freebase: Community • tie breaker in reviews • split entities • “rewind” changes • New experts inducted by current experts. Experts • create new classes and attributes • respond to community suggestions • Promoted by staff or other admins. Admins • contribute • (edit, review, vote) • Anyone can be a member. Members

  48. Freebase: Numbers Downloadable at http://download.freebase.com

  49. Question Answering Systems Objective is to answer user queries from an underlying knowledge base. • data from Wikipedia and user edits • natural language translation of queries • 9 m entities, 300 m facts • computes answers from an internal • knowledge base of curated, structured data. • stores not just facts, but also algorithms • and models

  50. Application: Semantic Similarity • Task: determine similarity between two words • topological distance of two words in the graph • taxonomic distance: hierarchical is-a relations • Example application: correct real-word spelling errors Tofu is made from soy jeans. [Hirst et al., Natural Language Engineering 2001]

More Related