520 likes | 678 Views
For a Few Triples More. Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/. Acknowledgements. LOD: RDF Triples on the Web. http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png. LOD: Linked RDF Triples on the Web.
E N D
For a FewTriples More Gerhard Weikum Max Planck Institute forInformatics http://www.mpi-inf.mpg.de/~weikum/
LOD: RDF Triples on the Web http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png
LOD: Linked RDF Triples on the Web yago/wordnet: Artist109812338 rdfs:subclassOf rdfs:subclassOf yago/wordnet:Actor109765278 rdf:type yago/wikicategory:ItalianComposer rdf:type imdb.com/name/nm0910607/ dbpedia.org/resource/Ennio_Morricone prop:actedIn prop: composedMusicFor imdb.com/title/tt0361748/ dbpprop:citizenOf dbpedia.org/resource/Rome owl:sameAs owl:sameAs rdf.freebase.com/ns/en.rome data.nytimes.com/51688803696189142301 owl:sameAs geonames.org/3169070/roma Coord N 41° 54' 10'' E 12° 29' 2''
LOD: Linked RDF Triples on the Web • Size: 30 Billion triples • Linkage: 500 Million links • Dynamics: encyclopedic • referencedata
For a Few Triples More 30 billiontriples – still not enough ? • No! Consider: • Dynamics • Linkage • Ubiquity
Outline Explain Title Why More Triples: Dynamics, Linkage, Ubiquity Linkage & Ubiquity: Named-EntityDisambiguation Web-ScaleLinkage Wrap-up
1. Dynamics: in a Fast Paced World • Anecdoticexamples: • <… rdf:about="http://dbpedia.org/resource/Steve_Jobs"> • <dbpprop:occupation…> Chairman and CEO, Apple Inc. • <… "http://... Ellen_Johnson_Sirleaf"> <dcterms:subjectrdf:resource= • "http://... Category:Nobel_Peace_Prize_laureates”/> • <… "http://... Scarlett_Johansson"> • <dbpprop:spouserdf:resource="http://... Ryan_Reynolds"/> • <… "http://... Clint_Eastwood"> • <dbpprop:spouse…>Dina Ruiz 1 child • <… rdf:about="http://... Clint_Eastwood"> • <dbpprop:spouse…>Maggie Johnson 2 children • <… rdf:about="http://... Paul_McCartney"> • <dbpprop:spouserdf:resource=« http://… Linda_McCartney"/> • <… rdf:about="http://... Paul_McCartney"> • <dbpprop:spouserdf:resource=« http://… Heather_Mills"/> • <… rdf:about="http://... Paul_McCartney"> • <dbpprop:spouse…>Nancy Shevell still there not there neverthere boththere nonethere
1. Dynamics: As Fresh As Possible http://data.gov.uk/openspending
1. Dynamics: Updates in the Web of Data http://sindice.com
1. Dynamics: Closer to the Sources • RDF Data on the Web producedby: • Maintained, but mostly„static“ • referencecollections (e.g. geo) • Periodicexportsfromcurateddatabases • (e.g. gov, bio, music) • Periodicextractionfrom Web sources • (e.g. encyclopedia, news) • Tags in socialstreamsandadvertisements mostlyfresh oftenstale oftenstale verynoisy • Getclosertothedataorigin: • RDF engines (Sparql APIs) forproduction DBs • view-maintenancebypub-sub push (feeds) • Deep-Web crawl/queryforsurfacingof RDF data
1. Dynamics: Nothing Lasts Forever Even oldand „static“ dataoftenneedstemporal scope (timepoint, timespan) for proper interpretation 1: 2: 3: PaulMcCartney hasSpouseHeatherMills PaulMcCartney hasSpouseNancyShevell PaulMcCartney gotHonorSirPaul [11-Jun-2002, 2008] [Oct-2011, now] [1999] 1 validFrom 11-Jun-2002 1 validUntil 2008 2 validFrom Oct-2011 3 happendOn 1999 withreification, orusequads (quints, pints, etc.) Need toaddtemporal propertiesto RDF andSPARQL Select ?w Where { ?id1: PM gotHonorSirPaul . ?id1 happendOn ?t . ?id2: PM hasSpouse ?w . ?id2 validFrom ?b . ?id2 validUntil ?e . ?t containedIn [?b,?e] . } but: principled, expressive, easy-to-use
1. Dynamics: Nothing Lasts Forever http://www.mpi-inf.mpg.de/yago-naga/yago/
2. Linkage: sameAs Links dbpedia.org/resource/Linda_Louise_Eastmanowl:sameAs yago-knowledge.org/resource/Linda_McCartney www.freebase.com/view/en/man_with_no_name owl:SameAs dbpedia.org/page/Clint_Eastwood data.linkedmdb.org/page/film/38166 owl:sameAs de.dbpedia.org/page/Zwei_glorreiche_Halunken • LOD statistics: 30 Bio. triples, 500 Mio. links • 330 Mio. links trivial (ID-based) withinpub, withinbio • 10‘s Mio. links near-trivial Dbpedia Freebase Yago GeoNames • sameas.org: 17 Mio. bundlesfor 50 Mio. URIs • data.nytimes.com: 5000 people, 2000 locations • Way toofewfora worldwith: • 1 Mio. people, 10 Mio. locations, 10‘s Mio. species, • 6 Mio. books, 2 Mio. movies, 10 Mio. songs, etc. etc.
2. Linkage: sameAs Accuracy http://sameas.org
3. Ubiquity: Web of Data & Other Contents RDF dataand Web contentsneedtobeinterconnected RDFa & microformatsprovidethemechanism How do wegetthe Web RDF-annotated (at large scale)? Largelyautomated, but allowhumans in theloop
3. Ubiquity: Web of Data & Other Contents <html … May 2, 2011 <div typeof=event:music> <span id="Maestro_Morricone"> Maestro Morricone <a rel="sameAs" resource="dbpedia…/Ennio_Morricone "/> </span> … <span property = "event:location" > Smetana Hall </span> … <span property="rdf:type" resource="yago:performance"> The concert</span> will feature … <span property="event:date" content="14-07-2011"></span> July 1 </div> May 2, 2011 Maestro Morricone will perform on the stage of the Smetana Hall to conduct the Czech National Symphony Orchestra and Choir. The concert will featureboth Classicalcompositionsand soundtracks such as the Ecstasy of Gold. In programme two concerts for July 14th and 15th.
Why a Few Triples More? Linked Data isgreat! But still in itsinfancy Need toaddtriplestocapturefurtherissues: • Dynamics: • Whereisthe live data? • Linkage: • Wherearethe links in LinkedData? • Ubiquity: • Wherearethepathsbetweenthe Web-of-Data andthe Web?
Outline Explain Title Why More Triples: Dynamics, Linkage, Ubiquity Linkage & Ubiquity: Named-EntityDisambiguation Web-ScaleLinkage Wrap-up
Entities on the Web http://sig.ma
Named-Entity Disambiguation (NED) Harry fought with you know who. He defeats the dark lord. Dirty Harry Harry Potter Prince Harry of England The Who (band) Lord Voldemort Three NLP tasks: 1) named-entity detection: segment & label by HMM or CRF (e.g. Stanford NER tagger) 2) co-reference resolution: link to preceding NP (trained classifier over linguistic features) 3) named-entity disambiguation: map each mention (name) to canonical entity (entry in KB)
Mentions, Meanings, Mappings Eli (bible) Sergio talkedto Ennio about Eli‘srole in the Ecstasy scene. Thissequence on thegraveyard was a highlight in Sergio‘strilogy of western films. KB Eli Wallach Ecstasy (drug) ? Benny Goodman Ecstasy of Gold Sergio meansSergio_Leone Sergio meansSerge_Gainsbourg Ennio meansEnnio_Antonelli Ennio meansEnnio_Morricone Eli means Eli_(bible) Eli meansExtremeLightInfrastructure Eli meansEli_Wallach Ecstasy means Ecstasy_(drug) Ecstasy meansEcstasy_of_Gold trilogymeansStar_Wars_Trilogy trilogymeansLord_of_the_Rings trilogymeansDollars_Trilogy … … … Benny Andersson Star WarsTrilogy Lord ofthe Rings Dollars Trilogy Entities (meanings) Mentions (surface names) D5 Overview May 30, 2011
Mention-Entity Graph weighted undirected graph with two types of nodes bag-of-wordsor languagemodel: words, bigrams, phrases Sergio talkedto Ennio about Eli‘srole in the Ecstasy scene. Thissequence on thegraveyard was a highlight in Sergio‘strilogy of western films. Eli (bible) Eli Wallach Ecstasy (drug) Ecstasy of Gold Star Wars Lord ofthe Rings Dollars Trilogy • Popularity • (m,e): • freq(e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) KB+Stats
Mention-Entity Graph weighted undirected graph with two types of nodes Sergio talkedto Ennio about Eli‘srole in the Ecstasy scene. Thissequence on thegraveyard was a highlight in Sergio‘strilogy of western films. Eli (bible) Eli Wallach Ecstasy (drug) joint mapping Ecstasy of Gold Star Wars Lord ofthe Rings Dollars Trilogy • Popularity • (m,e): • freq(e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) KB+Stats
Mention-Entity Graph weighted undirected graph with two types of nodes Sergio talkedto Ennio about Eli‘srole in the Ecstasy scene. Thissequence on thegraveyard was a highlight in Sergio‘strilogy of western films. Eli (bible) Eli Wallach Ecstasy(drug) Ecstasy of Gold Star Wars Lord ofthe Rings Dollars Trilogy • Popularity • (m,e): • freq(m,e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) • Coherence • (e,e‘): • dist(types) • overlap(links) • overlap • (anchor words) KB+Stats 28 / 20
Mention-Entity Graph weighted undirected graph with two types of nodes American Jews film actors artists Academy Award winners Sergio talkedto Ennio about Eli‘srole in the Ecstasy scene. Thissequence on thegraveyard was a highlight in Sergio‘strilogy of western films. Eli (bible) Eli Wallach Ecstasy (drug) Metallicasongs Ennio Morricone songs artifacts soundtrackmusic Ecstasy of Gold Star Wars spaghettiwesterns film trilogies movies artifacts Lord ofthe Rings Dollars Trilogy • Popularity • (m,e): • freq(m,e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) • Coherence • (e,e‘): • dist(types) • overlap(links) • overlap • (anchor words) KB+Stats 29 / 20
Mention-Entity Graph weighted undirected graph with two types of nodes http://.../wiki/Dollars_Trilogy http://.../wiki/The_Good,_the_Bad, _the_Ugly http://.../wiki/Clint_Eastwood http://.../wiki/Honorary_Academy_Award Sergio talkedto Ennio about Eli‘srole in the Ecstasy scene. Thissequence on thegraveyard was a highlight in Sergio‘strilogy of western films. Eli (bible) Eli Wallach Ecstasy (drug) http://.../wiki/The_Good,_the_Bad,_the_Ugly http://.../wiki/Metallica http://.../wiki/Bellagio_(casino) http://.../wiki/Ennio_Morricone Ecstasy of Gold Star Wars http://.../wiki/Sergio_Leone http://.../wiki/The_Good,_the_Bad,_the_Ugly http://.../wiki/For_a_Few_Dollars_More http://.../wiki/Ennio_Morricone Lord ofthe Rings Dollars Trilogy • Popularity • (m,e): • freq(m,e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) • Coherence • (e,e‘): • dist(types) • overlap(links) • overlap • (anchor words) KB+Stats 30 / 20
Mention-Entity Graph weighted undirected graph with two types of nodes The Magnificent Seven The Good, the Bad, andtheUgly Clint Eastwood University of Texas at Austin Sergio talkedto Ennio about Eli‘srole in the Ecstasy scene. Thissequence on thegraveyard was a highlight in Sergio‘strilogy of western films. Eli (bible) Eli Wallach Ecstasy (drug) Metallica on Morricone tribute Bellagiowaterfountainshow Yo-Yo Ma Ennio Morricone composition Ecstasy of Gold Star Wars For a Few Dollars More The Good, the Bad, andtheUgly Man withNo Name trilogy soundtrackby Ennio Morricone Lord ofthe Rings Dollars Trilogy • Popularity • (m,e): • freq(m,e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) • Coherence • (e,e‘): • dist(types) • overlap(links) • overlap • (anchor words) KB+Stats 31 / 20
Joint Mapping 50 50 30 20 30 10 10 90 100 30 20 80 90 90 100 30 5 • Build mention-entity graph or joint-inference factor graph • from knowledge and statistics in KB • Compute high-likelihood mapping (ML or MAP) or • dense subgraph such that: • each m is connected to exactly one e (or at most one e)
Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 140 50 50 30 180 20 30 10 10 90 50 100 470 30 20 80 90 145 90 100 30 5 230 • Computedensesubgraphto • maximizemin weighteddegreeamongentitynodes • such that: • each m isconnectedtoexactlyonee (oratmostonee) • Greedyapproximation: • iterativelyremoveweakestentityanditsedges • Keep alternative solutions, thenuselocal/randomizedsearch
Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 140 140 50 50 30 170 180 30 10 90 50 100 470 470 30 80 90 145 145 90 100 30 5 230 210 • Computedensesubgraphto • maximizemin weighteddegreeamongentitynodes • such that: • each m isconnectedtoexactlyonee (oratmostonee) • Greedyapproximation: • iterativelyremoveweakestentityanditsedges • Keep alternative solutions, thenuselocal/randomizedsearch
Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 140 30 170 120 90 100 460 460 30 80 90 145 145 90 100 30 5 210 210 • Computedensesubgraphto • maximizemin weighteddegreeamongentitynodes • such that: • each m isconnectedtoexactlyonee (oratmostonee) • Greedyapproximation: • iterativelyremoveweakestentityanditsedges • Keep alternative solutions, thenuselocal/randomizedsearch
Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 30 120 90 100 380 90 145 90 100 210 • Computedensesubgraphto • maximizemin weighteddegreeamongentitynodes • such that: • each m isconnectedtoexactlyonee (oratmostonee) • Greedyapproximation: • iterativelyremoveweakestentityanditsedges • Keep alternative solutions, thenuselocal/randomizedsearch
Named-Entity Disambiguation: State-of-the-Art • Literature: • RazvanBunescu, Marius Pasca: EACL 2006 • Silviu Cucerzan: EMNLP 2007 • David Milne, Ian Witten: CIKM 2008 • S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD 2009 • G. Limaye, S. Sarawagi, S. Chakrabarti: VLDB 2010 • Paolo Ferragina, Ugo Scaella: CIKM 2010 • Mark Dredze et al.: COLING 2010 • Johannes Hoffart et al.: EMNLP 2011 • etc. etc. • Online tools: • https://d5gate.ag5.mpi-sb.mpg.de/webaida/ • http://tagme.di.unipi.it/ • http://spotlight.dbpedia.org/demo/index.html • http://viewer.opencalais.com/ • http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/ • etc.
NED: Experimental Evaluation • Benchmark: • Extended CoNLL 2003 dataset: 1400 newswirearticles • originallyannotatedwithmentionmarkup (NER), • nowwith NED mappingstoYagoandFreebase • difficulttexts: • … AustraliabeatsIndia … Australian_Cricket_Team • … White House talksto Kreml … President_of_the_USA • … EDS made a contractwith … HP_Enterprise_Services Results: Best: AIDA methodwithprior+sim+coh + robustnesstest 82% precision @100% recall, 87% meanaverageprecision Comparisontoothermethods, seepaper J. Hoffart et al.: Robust DisambiguationofNamedEntities in Text, EMNLP 2011 http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/
Interesting Research Issues • More efficientgraphalgorithms (multicore, etc.) • Allow mentions of unknown entities, mapped to null • Leverage deep-parsing structures, • leveragesemantictypes • Short and difficult texts: • tweets, headlines, etc. • fictional texts: novels, song lyrics, etc. • incoherent texts • Disambiguationbeyondentitynames: • coreferences: pronouns, paraphrases, etc. • commonnouns, verbal phrases (general WSD)
Why Named Entity Disambiguation is Key • Linkeddataisbestifithasmanygood links • New & rich contents mostly in traditional Web • Create sameAs links in (X)HTML contents, via RDFa • Links for named entities give best mileage/effort • Methods & tools greatly advanced & gradually maturing • Keep human in the loop, embed NED in authoring tools
Outline Explain Title Why More Triples: Dynamics, Linkage, Ubiquity Linkage & Ubiquity: Named-EntityDisambiguation Web-ScaleLinkage Wrap-up
Variants of NED at Web Scale Tools canmapshorttextontoentities in a fewseconds • Howtorunthis on bigbatchof 1 Mio. inputtexts? • partitioninputsacrossdistributedmachines, • organizedictionaryappropriately, … • exploitcross-documentcontexts • Howto deal withinputsfromdifferent time epochs? • consider time-dependentcontexts, • maptoentitiesof proper epoch • (e.g. harvestedfrom Wikipedia history) • Howto handle Web-scaleinputs (100 Mio. pages) • restrictedto a setofinterestingentities? • (e.g. trackingpoliticiansandcompanies)
Linked RDF Triples on the Web yago/wordnet: Artist109812338 rdfs:subclassOf rdfs:subclassOf yago/wordnet:Actor109765278 rdf:type yago/wikicategory:ItalianComposer rdf:type imdb.com/name/nm0910607/ dbpedia.org/resource/Ennio_Morricone prop:actedIn prop: composedMusicFor imdb.com/title/tt0361748/ dbpprop:citizenOf dbpedia.org/resource/Rome ? ? owl:sameAs owl:sameAs rdf.freebase.com/ns/en.rome_ny data.nytimes.com/51688803696189142301 ? referentialdataquality: automatic, dynamic, high coverage ! owl:sameAs geonames.org/5134301/city_of_rome Coord N 43° 12' 46'' W 75° 27' 20''
Outline Explain Title Why More Triples: Dynamics, Linkage, Ubiquity Linkage & Ubiquity: Named-EntityDisambiguation Web-ScaleLinkage Wrap-up
Summary Linked Data isgreat! But itneedsmoretriplestocapture: • Dynamics: • (Deep-Web) sources feeds, pub-sub, … ? • fresh & versionedtriples • Linkage: • LOD entitymapping user community • Ubiquity: • RDFa entitydisambiguation authoring