640 likes | 781 Views
Semantic Search : from Names and Phrases to Entities and Relations. Gerhard Weikum Max Planck Institute for Informatics & Saarland University http://www.mpi-inf.mpg.de/~weikum/. Acknowledgements. Big Picture: Opportunities Now !. Very Large Knowledge Bases. KB Population.
E N D
SemanticSearch: fromNamesandPhrases toEntitiesandRelations Gerhard Weikum Max Planck Institute forInformatics & Saarland University http://www.mpi-inf.mpg.de/~weikum/
Big Picture: Opportunities Now ! Very Large Knowledge Bases KB Population EntityLinkage Web of Users & Contents Web of Data Disambiguation SemanticDocs SemanticAuthoring Info Extraction
Big Picture: Opportunities Now ! Very Large Knowledge Bases KB Population EntityLinkage This talk: How Do WeSearchthis World of Knowledge, Data, and Text (andcopewithambiguity) forKnowledgeHarvesting seetalksat College de France andat VLDB School in Kunming Web of Users & Contents Web of Data Disambiguation SemanticDocs SemanticAuthoring Info Extraction
Web of Data: RDF, Tables, Microdata 30 Bio. SPO triples (RDF) andgrowing SUMO BabelNet ConceptNet5 WikiTaxonomy/ WikiNet Cyc YAGO ReadTheWeb TextRunner/ ReVerb http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png
Web of Data: RDF, Tables, Microdata 30 Bio. SPO triples (RDF) andgrowing • 4M entities in • 250 classes • 500M factsfor • 6000 properties • live updates • 10M entities in • 350K classes • 120M factsfor • 100 relations • 100 languages • 95% accuracy • 25M entities in • 2000 topics • 100M factsfor • 4000 properties • powers Google • knowledgegraph Ennio_Morricone type composer Ennio_Morriconetype GrammyAwardWinner composersubclassOfmusician Ennio_MorriconebornInRome RomelocatedInItaly Ennio_MorriconecreatedEcstasy_of_Gold Ennio_MorriconewroteMusicForThe_Good,_the_Bad_,and_the_Ugly Sergio_LeonedirectedThe_Good,_the_Bad_,and_the_Ugly YAGO http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png
Linked RDF Triples on the Web yago/wordnet: Artist109812338 rdfs:subclassOf rdfs:subclassOf yago/wordnet:Actor109765278 rdf:type yago/wikicategory:ItalianComposer rdf:type imdb.com/name/nm0910607/ dbpedia.org/resource/Ennio_Morricone prop:actedIn prop: composedMusicFor imdb.com/title/tt0361748/ dbpprop:citizenOf dbpedia.org/resource/Rome owl:sameAs owl:sameAs rdf.freebase.com/ns/en.rome data.nytimes.com/51688803696189142301 owl:sameAs 500 Mio. links geonames.org/3169070/roma Coord N 41° 54' 10'' E 12° 29' 2''
Embedding (RDF)Microdata in HTML Pages <html … May 2, 2011 <div typeof=event:music> <span id="Maestro_Morricone"> Maestro Morricone <a rel="sameAs" resource="dbpedia/Ennio_Morricone "/> </span> … <span property = "event:location" > Smetana Hall </span> … <span property="rdf:type" resource="yago:performance"> The concert</span> will feature … <span property="event:date" content="14-07-2011"></span> July 1 </div> SupportedbyRDFa andmicroformats likeschema.org May 2, 2011 Maestro Morricone will perform on the stage of the Smetana Hall to conduct the Czech National Symphony Orchestra and Choir. The concert will featureboth Classicalcompositionsand soundtracks such as the Ecstasy of Gold. In programme two concerts for July 14th and 15th.
Outline OpportunitiesNow Semantic Search Today Entity Name Disambiguation QuestionAnswering DisambiguationReloaded Wrap-Up
Semantic Search Today (2) Select ?x Where { ?x type composer[western movie] . ?x wasBornIn ?y . ?y locatedIn Europe . }
Semantic Search Today (2) Select ?x Where { ?x type composer . ?x participatedIn ?y . ?y type western_film . }
Semantic Search Today (4) Key problem in semanticsearch: diversityandambiguity ofnamesandphrases !
Outline OpportunitiesNow Semantic Search Today Entity Name Disambiguation QuestionAnswering DisambiguationReloaded Wrap-Up
Three Different NLP Problems Harry fought with you know who. He defeats the dark lord. Dirty Harry Harry Potter Prince Harry of England The Who (band) Lord Voldemort Three NLP tasks: 1) named-entity detection: segment & label by HMM or CRF (e.g. Stanford NER tagger) 2) co-reference resolution: link to preceding NP (trained classifier over linguistic features) 3) named-entity disambiguation: map each mention (name) to canonical entity (entry in KB)
Named Entity Disambiguation Eli (bible) Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. KB Eli Wallach Ecstasy (drug) ? Benny Goodman Ecstasy of Gold Sergio means Sergio_Leone Sergio means Serge_Gainsbourg Ennio means Ennio_Antonelli Ennio means Ennio_Morricone Eli means Eli_(bible) Eli means ExtremeLightInfrastructure Eli means Eli_Wallach Ecstasy means Ecstasy_(drug) Ecstasy means Ecstasy_of_Gold trilogy means Star_Wars_Trilogy trilogy means Lord_of_the_Rings trilogy means Dollars_Trilogy … … … Benny Andersson Star Wars Trilogy Lord of the Rings Dollars Trilogy Entities (meanings) Mentions (surface names) D5 Overview May 30, 2011
Mention-Entity Graph weighted undirected graph with two types of nodes bag-of-words or language model: words, bigrams, phrases Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach Ecstasy (drug) Ecstasy of Gold Star Wars Lord of the Rings Dollars Trilogy • Popularity • (m,e): • freq(e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) KB+Stats
Mention-Entity Graph weighted undirected graph with two types of nodes Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach Ecstasy (drug) joint mapping Ecstasy of Gold Star Wars Lord of the Rings Dollars Trilogy • Popularity • (m,e): • freq(e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) KB+Stats
Mention-Entity Graph weighted undirected graph with two types of nodes Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach Ecstasy(drug) Ecstasy of Gold Star Wars Lord of the Rings Dollars Trilogy • Popularity • (m,e): • freq(m,e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) • Coherence • (e,e‘): • dist(types) • overlap(links) • overlap • (anchor words) KB+Stats 27 / 20
Mention-Entity Graph weighted undirected graph with two types of nodes American Jews film actors artists Academy Award winners Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach Ecstasy (drug) Metallica songs Ennio Morricone songs artifacts soundtrack music Ecstasy of Gold Star Wars spaghetti westerns film trilogies movies artifacts Lord of the Rings Dollars Trilogy • Popularity • (m,e): • freq(m,e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) • Coherence • (e,e‘): • dist(types) • overlap(links) • overlap • (anchor words) KB+Stats 28 / 20
Mention-Entity Graph weighted undirected graph with two types of nodes http://.../wiki/Dollars_Trilogy http://.../wiki/The_Good,_the_Bad, _the_Ugly http://.../wiki/Clint_Eastwood http://.../wiki/Honorary_Academy_Award Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach Ecstasy (drug) http://.../wiki/The_Good,_the_Bad,_the_Ugly http://.../wiki/Metallica http://.../wiki/Bellagio_(casino) http://.../wiki/Ennio_Morricone Ecstasy of Gold Star Wars http://.../wiki/Sergio_Leone http://.../wiki/The_Good,_the_Bad,_the_Ugly http://.../wiki/For_a_Few_Dollars_More http://.../wiki/Ennio_Morricone Lord of the Rings Dollars Trilogy • Popularity • (m,e): • freq(m,e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) • Coherence • (e,e‘): • dist(types) • overlap(links) • overlap • (anchor words) KB+Stats 29 / 20
Mention-Entity Graph weighted undirected graph with two types of nodes The Magnificent Seven The Good, the Bad, and the Ugly Clint Eastwood University of Texas at Austin Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach Ecstasy (drug) Metallica on Morricone tribute Bellagio water fountain show Yo-Yo Ma Ennio Morricone composition Ecstasy of Gold Star Wars For a Few Dollars More The Good, the Bad, and the Ugly Man with No Name trilogy soundtrack by Ennio Morricone Lord of the Rings Dollars Trilogy • Popularity • (m,e): • freq(m,e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) • Coherence • (e,e‘): • dist(types) • overlap(links) • overlap • (anchor words) KB+Stats 30 / 20
Joint Mapping 50 50 30 20 30 10 10 90 100 30 20 80 90 90 100 30 5 • Build mention-entity graph or joint-inference factor graph • from knowledge and statistics in KB • Compute high-likelihood mapping (ML or MAP) or • dense subgraph such that: • each m is connected to exactly one e (or at most one e)
Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 140 50 50 30 180 20 30 10 10 90 50 100 470 30 20 80 90 145 90 100 30 5 230 • Computedensesubgraphto • maximizemin weighteddegreeamongentitynodes • such that: • each m isconnectedtoexactlyonee (oratmostonee) • Greedyapproximation: • iterativelyremoveweakestentityanditsedges • Keep alternative solutions, thenuselocal/randomizedsearch
Mention-Entity Popularity Weights [Milne/Witten 2008, Spitkovsky/Chang 2012] • Need dictionarywithentities‘ names: • fullnames: Arnold Alois Schwarzenegger, Los Angeles, Microsoft Corp. • shortnames: Arnold, Arnie, Mr. Schwarzenegger, New York, Microsoft, … • nicknames & aliases: Terminator, City of Angels, Evil Empire, … • acronyms: LA, UCLA, MS, MSFT • rolenames: the Austrian actionhero, Californiangovernor, CEO of MS, … • … • plus genderinfo (usefulforresolvingpronouns in context): • Bill and Melinda metat MS. Theyfell in loveandhekissedher. • Collecthyperlinkanchor-text / link-target pairsfrom • Wikipediaredirects • Wikipedia links betweenarticles • Interwiki links betweenWikipediaeditions • Web links pointingtoWikipediaarticles • … • Buildstatisticstoestimate P[entity | name]
Mention-Entity Similarity Edges Precomputecharacteristickeyphrases qforeachentity e: anchortextsornounphrases in e pagewithhigh PMI: „Metallicatributeto Ennio Morricone“ Matchkeyphrase q ofcandidate e in contextofmention m Extent of partial matches Weight of matched words The Ecstasy piece was coveredbyMetallica on the Morricone tributealbum. Computeoverallsimilarityofcontext(m) andcandidate e
Entity-Entity Coherence Edges Precomputeoverlapofincoming links forentities e1 and e2 Alternativelycomputeoverlapofanchortextsfor e1 and e2 oroverlapofkeyphrases, orsimilarityofbag-of-words, or … Optionallycombinewithtype distanceof e1 and e2 (e.g., Jaccardindexfor type instances) Forspecialtypesof e1 and e2 (locations, people, etc.) usespatialor temporal distance
AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: VeryDifficultExample http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: VeryDifficultExample http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/
AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/
Some NED Online Tools for • J. Hoffart et al.: EMNLP 2011, VLDB 2011 • https://d5gate.ag5.mpi-sb.mpg.de/webaida/ • P. Ferragina, U. Scaella: CIKM 2010 • http://tagme.di.unipi.it/ • R. Isele, C. Bizer: VLDB 2012 • http://spotlight.dbpedia.org/demo/index.html • Reuters Open Calais • http://viewer.opencalais.com/ • S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD 2009 • http://www.cse.iitb.ac.in/soumen/doc/CSAW/ • D. Milne, I. Witten: CIKM 2008 • http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/ • perhapsmore • someuse Stanford NER taggerfordetectingmentions • http://nlp.stanford.edu/software/CRF-NER.shtml
NED: Experimental Evaluation • Benchmark: • Extended CoNLL 2003 dataset: 1400 newswirearticles • originallyannotatedwithmentionmarkup (NER), • nowwith NED mappingstoYagoandFreebase • difficulttexts: • … AustraliabeatsIndia … Australian_Cricket_Team • … White House talksto Kreml … President_of_the_USA • … EDS made a contractwith … HP_Enterprise_Services Results: Best: AIDA methodwithprior+sim+coh + robustnesstest 82% precision @100% recall, 87% meanaverageprecision Comparisontoothermethods, seepaper J. Hoffart et al.: Robust DisambiguationofNamedEntities in Text, EMNLP 2011 http://www.mpi-inf.mpg.de/yago-naga/aida/
Ongoing Research & Remaining Challenges • More efficient graph algorithms (multicore, etc.) • Allow mentions of unknown entities, mapped to null • Leverage deep-parsing structures, • leverage semantic types • Example: Page played Kashmir on his Gibson obj subj mod • Short and difficult texts: • tweets, headlines, etc. • fictional texts: novels, song lyrics, etc. • incoherent texts • Structured Web data: tablesandlists • Disambiguationbeyondentitynames: • coreferences: pronouns, paraphrases, etc. • commonnouns, verbal phrases (general WSD)
Variants of NED at Web Scale Tools canmapshorttextontoentities in a fewseconds • Howtorunthis on bigbatchof 1 Mio. inputtexts? • partitioninputsacrossdistributedmachines, • organizedictionaryappropriately, … • exploitcross-documentcontexts • Howto handle Web-scaleinputs (100 Mio. pages) • restrictedto a setofinterestingentities? • (e.g. trackingpoliticiansandcompanies)
Outline OpportunitiesNow Semantic Search Today Entity Name Disambiguation QuestionAnswering DisambiguationReloaded Wrap-Up
Deep Question Answering William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel This town is known as "Sin City" & its downtown is "Glitter Gulch" As of 2010, this is the only former Yugoslav republic in the EU 99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain question classification & decomposition knowledge back-ends YAGO D. Ferrucci et al.: Building Watson. AI Magazine, Fall 2010. IBM Journal of R&D 56(3/4), 2012: Thisis Watson.
Semantic Keyword Search [Ilyas et al. Sigmod‘10] Need tomap (groupsof) keywordsontoentities & relationships based on name-entitysimilarities/probabilities q: composerRomescoreswesterns composer (creator ofmusic) Media Composer videoeditor western world western movies Western (airline) Western Digital Western (NY) goal in football Lazio Roma AS Roma … playsfor … … born in … … recordedat … … used in … Rome (NY) Rome (Italy) film music
Natural Language Questions are Natural • translatequestionintoSparqlquery: • dependencyparsing to decomposequestion • mapping of questionunitsontoentities, classes, relations Who composed scoresforwesterns andisfromRome? Who composedscoresforwesternsandisfromRome? mapresults intotabular orvisual presentation orspeech
From Questions to Queries Dependencyparsingexposesstructureofquestion „triploids“ (sub-cues) NL question: Who composedscoresforwesternsandisfromRome? • Who composedscores • isfromRome • scoresforwesterns