580 likes | 771 Views
Big Text: f rom Language ( Names and Phrases ) t o Knowledge ( Entities and Relations ). Gerhard Weikum Max Planck Institute for Informatics Saarbrücken, Germany http://www.mpi-inf.mpg.de/~weikum/. From Natural-Language Text to Knowledge. m ore knowledge , analytics , insight.
E N D
Big Text: from Language (NamesandPhrases) to Knowledge (EntitiesandRelations) Gerhard Weikum Max Planck Institute forInformatics Saarbrücken, Germany http://www.mpi-inf.mpg.de/~weikum/
From Natural-Language Text to Knowledge moreknowledge, analytics, insight knowledge acquisition Web Contents Knowledge intelligent interpretation
Web of Data & Knowledge (Linked Open Data) > 50 Bio. subject-predicate-objecttriplesfrom > 1000 sources SUMO BabelNet WikiTaxonomy/ WikiNet ConceptNet5 Cyc ReadTheWeb TextRunner/ ReVerb http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png
Web of Data & Knowledge > 50 Bio. subject-predicate-objecttriplesfrom > 1000 sources • 4M entities in • 250 classes • 500M factsfor • 6000 properties • live updates • 600M entities in • 15000 topics • 20B facts • 10M entities in • 350K classes • 120M factsfor • 100 relations • 100 languages • 95% accuracy • 40M entities in • 15000 topics • 1B factsfor • 4000 properties • coreofGoogle • KnowledgeGraph
Web of Data & Knowledge > 50 Bio. subject-predicate-objecttriplesfrom > 1000 sources Bob_Dylantype songwriter Bob_Dylantype civil_rights_activist songwritersubclassOfartist Bob_DylancomposedHurricane HurricaneisAboutRubin_Carter Steve_JobsmarriedToSara_Lownds validDuring[Sep-1965, June-1977] Bob_DylanknownAs„voiceof a generation“ Steve_Jobs„was bigfanof“Bob_Dylan Bob_Dylan„brieflydated“Joan_Baez taxonomicknowledge factualknowledge temporal knowledge terminologicalknowledge evidence& belief knowledge
Knowledge forIntelligent Applications • Enablingtechnologyfor: • disambiguation • in written & spokennaturallanguage • deepreasoning • (e.g. QA towinquizgame) • machinereading • (e.g. tosummarizebookorcorpus) • semanticsearch • in termsofentities&relations (not keywords&pages) • entity-level linkage • for Big Data & Big Text analytics
Use-Case: Semantic Search Politicians who are also scientists? European composers who have won film music awards? Internet companiesfoundedbyBrazilianprofessors? Enzymes thatinhibit HIV? Influenza drugsforteenswith high bloodpressure? ...
Use-Case: Question Answering This town is known as "Sin City" & its downtown is "Glitter Gulch" Q: Sin City ? movie, graphicalnovel, nicknameforcity, … A: Vegas ? Vega ? Strip ? Vega (star), Suzanne Vega, Vincent Vega, Las Vegas, … comicstrip, striptease, Las Vegas Strip, … This American city has two airports named after a war hero and a WW II battle question classification & decomposition knowledge back-ends D. Ferrucci et al.: Building Watson. AI Magazine, Fall 2010. IBM Journal of R&D 56(3/4), 2012: This is Watson.
Big Text Analytics: Who Covered Whom? 1000‘s of Databases 100 Mio‘sof Web Tables 100 Bio‘sof Web & Social Media Pages in different language, country, key, … withmoresales, awards, mediabuzz, … ..... Musician Original Title Elvis Presley Frank Sinatra My Way Robbie Williams Frank Sinatra My Way Sex Pistols Frank Sinatra My Way Frank Sinatra Claude Francois Commed‘Habitude Claudia Leitte Bruno Mars Famo$a (Billionaire) . . . . . . . . . . . . . . .
Big Text Analytics: Who Covered Whom? 1000‘s of Databases 100 Mio‘sof Web Tables 100 Bio‘sof Web & Social Media Pages in different language, country, key, … withmoresales, awards, mediabuzz, … ..... MusicianPerformedTitle Sex PistolsMy Way Frank Sinatra My Way Claudia LeitteFamo$a Petula Clark Boy fromIpanema Name Show PetulaC. Muppets Claudia L. FIFA 2014 Name Group Sid Vicious Sex Pistols Bono U2 MusicianCreatedTitle Francis Sinatra My Way Paul Anka My Way Bruno Mars Billionaire Astrud Gilberto Garota de Ipanema
Big Text Analytics: Who Covered Whom? 1000‘s of Databases 100 Mio‘sof Web Tables 100 Bio‘sof Web & Social Media Pages in different language, country, key, … withmoresales, awards, mediabuzz, … ..... Big Data & Big Text Volume Velocity Variety Veracity Big Data Volume Velocity Variety Veracity MusicianPerformedTitle Sex PistolsMy Way Frank Sinatra My Way Claudia LeitteFamo$a Petula Clark Boy fromIpanema MusicianCreatedTitle Francis Sinatra My Way Paul Anka My Way Bruno Mars Billionaire Astrud Gilberto Garota de Ipanema
Big Data & Big Text Analytics Entertainment: Who coveredwhichothersinger? Who influencedwhichothermusicians? Health: Drugs (combinations) andtheirsideeffects Politics: Politicians‘ positions on controversialtopics andtheirinvolvementwithindustry Business: Customer opinions on small-company products, gatheredfromsocialmedia Culturomics: Trends in society, culturalfactors, etc. General Design Pattern: • Identify relevant contentssources • Identifyentitiesofinterest & theirrelationships • Position in time & space • Group andaggregate • Find insightfulpatterns & predicttrends
Outline Introduction Lovely NERD The New Chocolate The Dark Side Conclusion
NamedEntity Recognition & Disambiguation (NERD) Hurricane, about Carter, is on Bob‘s Desire. Itisplayed in the film with Washington. contextualsimilarity: mention vs. entity (bag-of-words, languagemodel) priorpopularity of name-entitypairs
NamedEntity Recognition & Disambiguation (NERD) Hurricane, about Carter, is on Bob‘s Desire. Itisplayed in the film with Washington. • Coherenceofentitypairs: • semanticrelationships • sharedtypes (categories) • overlapof Wikipedia links
NamedEntity Recognition & Disambiguation racismprotestsong boxingchampion wrongconviction Hurricane, about Carter, is on Bob‘s Desire. Itisplayed in the film with Washington. racismvictim middleweightboxing nicknameHurricane falselyconvicted Grammy Award winner protestsongwriter film musiccomposer civilrightsadvocate Academy Award winner African-American actor Cry for Freedom film Hurricane film Coherence: (partial) overlap of (statisticallyweighted) entity-specifickeyphrases
NamedEntity Recognition & Disambiguation Hurricane, about Carter, is on Bob‘s Desire. Itisplayed in the film with Washington. • KB providesbuildingblocks: • name-entitydictionary, • relationships, types, • textdescriptions, keyphrases, • statisticsforweights NED algorithmscompute mention-to-entitymapping overweightedgraphofcandidates bypopularity& similarity& coherence
Joint Mapping e1 50 m1 50 30 20 e2 30 10 10 90 m2 e3 100 e4 30 m3 20 80 90 90 e5 100 m4 30 5 e6 • Buildmention-entitygraphorjoint-inferencefactorgraph • fromknowledgeandstatistics in KB • Computehigh-likelihoodmapping(ML or MAP) or • densesubgraph(with high total edgeweight) such that: • each m isconnectedtoexactlyonee (orat mostonee) 19
Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11, VLDB‘12] e1 140 50 m1 50 30 180 20 e2 30 10 10 90 m2 50 e3 100 470 e4 30 m3 20 80 90 145 90 e5 100 m4 30 5 230 e6 • Computedensesubgraphto • maximizemin weighteddegreeamongentitynodes • such that: • each m isconnectedtoexactlyonee (orat mostonee) • Approx. algorithms (greedy, randomized, …), hashsketches, … • 82% precision on CoNLL‘03 benchmark • Open-sourcesoftware & online service AIDA http://www.mpi-inf.mpg.de/yago-naga/aida/ D5 Overview May 14, 2013 20
NERD Online Tools • J. Hoffart et al.: EMNLP 2011, VLDB 2011 • https://d5gate.ag5.mpi-sb.mpg.de/webaida/ • P. Ferragina, U. Scaella: CIKM 2010 • http://tagme.di.unipi.it/ • R. Isele, C. Bizer: VLDB 2012 • http://spotlight.dbpedia.org/demo/index.html • D. Milne, I. Witten: CIKM 2008 • http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/ • L. Ratinov, D. Roth, D. Downey, M. Anderson: ACL 2011 • http://cogcomp.cs.illinois.edu/page/demo_view/Wikifier • Reuters Open Calais: http://viewer.opencalais.com/ • Alchemy API: http://www.alchemyapi.com/api/demo.html
NERD at Work https://gate.d5.mpi-inf.mpg.de/webaida/
NERD at Work https://gate.d5.mpi-inf.mpg.de/webaida/
NERD at Work https://gate.d5.mpi-inf.mpg.de/webaida/
NERD at Work https://gate.d5.mpi-inf.mpg.de/webaida/
Entity Matching in Structured Data Variety & Veracity ! Hurricane Dylan Like a Hurricane Young HurricaneEverette. Hurricane Katrina New Orleans 2005 Hurricane Sandy New York 2012 ………. ? Hurricane 1975 Forever Young 1972 Like a Hurricane 1975 ………. Dylan Bob 1941 Thomas Dylan Swansea 1914 Young Brigham 1801 Young Neil Toronto 1945 Denny Sandy London 1947 • entitylinkage: • keytodataintegration • long-standing problem, verydifficult, unsolved H.L. Dunn: Record Linkage. American Journal of Public Health 36 (12), 1946 H.B. Newcombe et al.: Automatic Linkage of Vital Records. Science 130 (3381), 1959
Entity Matching in Structured Data e1 f1 e2 f2 f3 e3 f4 sameAslinking: similarityofcontexts
Entity Matching in Structured Data e1 f1 g1 e2 g2 f2 f3 g3 e3 f4 sameAslinking: similarityofcontexts & coherenceofneighborhoods & constraints (transitivityetc.) jointinferenceover (probabilistic) graph !
Linking Big Data & Big Text Musician Song Year Listeners Charts . . . Sinatra My Way 1969 435 420 Sex PistolsMy Way 1978 87 729 Pavarotti My Way 1993 4 239 C. LeitteFamo$a 2011 272 468 B. Mars Billionaire 2010 218 116 . . . . . . . . . . 30
Research Challenges & Opportunities Efficientinteractive & high-throughputbatch NERD aday‘snews, a month‘spublications, a decade‘sarchive Entitynamedisambiguation in difficultsituations Short andnoisytextsaboutlong-tailentities in socialmedia Handling long-tailandemergingentities tocomplementandcontinuously update KB keyfor KB life-cyclemanagement Web-scaleentitylinkagewith high quality acrosstextsources, linkeddata, KB‘s, Web tables, …
Outline Introduction Lovely NERD The New Chocolate The Dark Side Conclusion
Semantic Search over News https://stics.mpi-inf.mpg.de
Semantic Search over News https://stics.mpi-inf.mpg.de
Semantic Search over News https://stics.mpi-inf.mpg.de
Semantic Search over News https://stics.mpi-inf.mpg.de
Semantic Search over News https://stics.mpi-inf.mpg.de
Semantic Search over News https://stics.mpi-inf.mpg.de
Entity Analytics over News https://stics.mpi-inf.mpg.de
Entity Analytics over News https://stics.mpi-inf.mpg.de
Machine Reading of Scholarly Papers https://gate.d5.mpi-inf.mpg.de/knowlife/
Machine Reading of Health Forums https://gate.d5.mpi-inf.mpg.de/knowlife/ [P. Ernst et al.: ICDE‘14]
Big Data & Text Analytics:Side Effects of Drug Combinations • Deeperinsightfromboth • expert data & socialmedia: • actualsideeffectsofdrugs • … anddrugcombinations • riskfactorsandcomplications • of (wide-spread) diseases • alternative therapies • aggregation & comparisonby • age, gender, life style, etc. Structured Expert Data Social Media http://www.patient.co.uk http://dailymed.nlm.nih.gov
Credibility of Statements in Health Communities [S. Mukherjee et al.: KDD‘14] I tookthewholemed cocktail at once. Xanaxgaveme wild hallucinations and a demonicfeel. Xanaxmademe dizzyandsleepless. XanaxandProzac areknownto causedrowsiness. p3 u1 p1 p2 s1 s2 u2 u3 Language Objectivity User Trustworthiness Statement Credibility jointreasoningwithprobabilisticgraphicalmodel
Machine Reading: fromNamesandPhrasestoEntities, Classes, and Relations The Maestro fromRomewrotescoresforwesterns. Ma playedhisversionofthe Ecstasy. Maestro Card Rome (Italy) Jack Ma MDMA Leonard Bernstein AS Roma Yo-Yo Ma l‘Estasi dell‘Oro Lazio Roma Ennio Morricone plays sport western movie goal in football coverof born in Western Digital plays music storyabout film music playsfor
Paraphrases of Relations composed: musiciansong covered: musician song Dylan wrote a sad song Knockin‘ on Heaven‘s Door, a cover song by the Dead Morricone‘s masterpiece is the Ecstasy of Gold, covered by Yo-Yo Ma Amy‘ssoulyinterpretationofCupid, a classic pieceofSam Cooke Nina Simone‘ssingingofDon‘tExplainrevivedHoliday‘soldsong CatPower‘svoiceishauntingin her versionofDon‘tExplain CaleperformedHallelujahwrittenbyL. Cohen • SOL patterns over words, wildcards, POS tags, semantic types: <musician> wroteADJ piece <song> Sequence Mining with Type Lifting (N. Nakashole et al.: EMNLP’12, ACL’13, VLDB‘12) • Relational phrases are typed: <singer> covered <song> <book> covered <event> • Relational synsets (and subsumptions): covered:coversong, interpretationof, singingof, voice in version, … composed:wrote,classic pieceof, ‘s oldsong, writtenby, composed, … 350 000 SOL patternsfromWikipedia: http://www.mpi-inf.mpg.de/yago-naga/patty/
DisambiguationforEntities, Classes & Relations • (M. Yahya et al.: EMNLP’12, CIKM‘13) e: MaestroCard Maestro e: Ennio Morricone c: conductor c: musician r: actedIn from r: bornIn e: Rome (Italy) ILP optimizers likeGurobi solvethis in seconds Rome weightededges (coherence, similarity, etc.) e: Lazio Roma r: composed wrotescores r: giveExam c:soundtrack scoresfor r: soundtrackFor r: shootsGoalFor c: western movie westerns e: Western Digital CombinatorialOptimizationby ILP (with type constraints etc.)
Outline Introduction Lovely NERD The New Chocolate The Dark Side Conclusion
The Dark Side of Big Data Nobody interested in yourresearch? Wereadyourpapers!