1 / 42

Big Text: from Language to Knowledge

Big Text: from Language to Knowledge. Gerhard Weikum Max Planck Institute for Informatics & Saarland University Saarbrücken, Germany http://www.mpi-inf.mpg.de/~weikum/. From Natural-Language Text to Knowledge. m ore knowledge , analytics , insight. knowledge acquisition. Web

dareh
Download Presentation

Big Text: from Language to Knowledge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Text: from Language to Knowledge Gerhard Weikum Max Planck Institute forInformatics & Saarland University Saarbrücken, Germany http://www.mpi-inf.mpg.de/~weikum/

  2. From Natural-Language Text to Knowledge moreknowledge, analytics, insight knowledge acquisition Web Contents Knowledge intelligent interpretation

  3. Web of Data & Knowledge (Linked Open Data) > 50 Bio. subject-predicate-objecttriplesfrom > 1000 sources SUMO BabelNet WikiTaxonomy/ WikiNet ConceptNet5 Cyc ReadTheWeb TextRunner/ ReVerb http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png

  4. Web of Data & Knowledge > 50 Bio. subject-predicate-objecttriplesfrom > 1000 sources • 4M entities in • 250 classes • 500M factsfor • 6000 properties • live updates • 600M entities in • 15000 topics • 20B facts • 10M entities in • 350K classes • 120M factsfor • 100 relations • 100 languages • 95% accuracy • 40M entities in • 15000 topics • 1B factsfor • 4000 properties • coreofGoogle • KnowledgeGraph

  5. Web of Data & Knowledge > 50 Bio. subject-predicate-objecttriplesfrom > 1000 sources Bob_Dylantype songwriter Bob_Dylantype civil_rights_activist songwritersubclassOfartist Bob_DylancomposedHurricane HurricaneisAboutRubin_Carter Steve_JobsmarriedToSara_Lownds validDuring[Sep-1965, June-1977] Bob_DylanknownAs„voiceof a generation“ Steve_Jobs„was bigfanof“Bob_Dylan Bob_Dylan„brieflydated“Joan_Baez taxonomicknowledge factualknowledge temporal knowledge terminologicalknowledge evidence& belief knowledge

  6. Knowledge forIntelligent Applications • Enablingtechnologyfor: • disambiguation • in written & spokennaturallanguage • deepreasoning • (e.g. QA towinquizgame) • machinereading • (e.g. tosummarizebookorcorpus) • semanticsearch • in termsofentities&relations (not keywords&pages) • entity-level linkage • for Big Data & Big Text analytics

  7. Big Text Analytics: Who Covered Whom? 1000‘s of Databases 100 Mio‘sof Web Tables 100 Bio‘sof Web & Social Media Pages in different language, country, key, … withmoresales, awards, mediabuzz, … ..... Musician Original Title Hannes Wader Elvis Presley Wooden Heart Elvis Presley F. Silcher Muss i denn Tote Hosen Hannes Wader Heute hier morgen dort . . . . . . . . . . . . . . .

  8. Big Text Analytics: Who Covered Whom? 1000‘s of Databases 100 Mio‘sof Web Tables 100 Bio‘sof Web & Social Media Pages in different language, country, key, … withmoresales, awards, mediabuzz, … ..... Big Data & Big Text: challenge Variety & Veracity MusicianPerformedTitle Hannes WaderWooden Heart Hannes Wader Heute Hier Tote Hosen Morgen Dort Name Place U2 Dublin DagstuhlWadern Name Group Bono U2 Campino Tote Hosen Wadern MusicianCreatedTitle Elvis Wood Heart F. Silcher Muss i denn Hans E. Wader Heute Hier . . . . . . . . . .

  9. Big Data & Big Text Analytics Entertainment: Who coveredwhichothersinger? Who influencedwhichothermusicians? Health: Drugs (combinations) andtheirsideeffects Politics: Politicians‘ positions on controversialtopics andtheirinvolvementwithindustry Business: Customer opinions on small-company products, gatheredfromsocialmedia Culturomics: Trends in society, culturalfactors, etc. General Design Pattern: • Identify relevant contentssources • Identifyentitiesofinterest & theirrelationships • Position in time & space • Group andaggregate • Find insightfulpatterns & predicttrends 9

  10. Outline  Introduction Lovely NERD The New Chocolate The Dark Side Conclusion

  11. Lovely NERD

  12. NamedEntity Recognition & Disambiguation (NERD) Hurricane, about Carter, is on Bob‘s Desire. Itisplayed in the film with Washington. contextualsimilarity: mention vs. entity (bag-of-words, languagemodel) priorpopularity of name-entitypairs

  13. NamedEntity Recognition & Disambiguation (NERD) Hurricane, about Carter, is on Bob‘s Desire. Itisplayed in the film with Washington. • Coherenceofentitypairs: • semanticrelationships • sharedtypes (categories) • overlapof Wikipedia links

  14. NamedEntity Recognition & Disambiguation racismprotestsong boxingchampion wrongconviction Hurricane, about Carter, is on Bob‘s Desire. Itisplayed in the film with Washington. racismvictim middleweightboxing nicknameHurricane falselyconvicted Grammy Award winner protestsongwriter film musiccomposer civilrightsadvocate Academy Award winner African-American actor Cry for Freedom film Hurricane film Coherence: (partial) overlap of (statisticallyweighted) entity-specifickeyphrases

  15. NamedEntity Recognition & Disambiguation Hurricane, about Carter, is on Bob‘s Desire. Itisplayed in the film with Washington. • KB providesbuildingblocks: • name-entitydictionary, • relationships, types, • textdescriptions, keyphrases, • statisticsforweights NED algorithmscompute mention-to-entitymapping overweightedgraphofcandidates bypopularity& similarity& coherence

  16. Joint Mapping e1 50 m1 50 30 20 e2 30 10 10 90 m2 e3 100 e4 30 m3 20 80 90 90 e5 100 m4 30 5 e6 • Buildmention-entitygraphorjoint-inferencefactorgraph • fromknowledgeandstatistics in KB • Computehigh-likelihoodmapping(ML or MAP) or • densesubgraph(with high total edgeweight) such that: • each m isconnectedtoexactlyonee (orat mostonee) 16

  17. Coherence Graph Algorithm e1 140 50 m1 50 30 180 20 e2 30 10 10 90 m2 50 e3 100 470 e4 30 m3 20 80 90 145 90 e5 100 m4 30 5 230 e6 • Computedensesubgraphto • maximizemin weighteddegreeamongentitynodes • such that: • each m isconnectedtoexactlyonee (orat mostonee) • Approx. algorithms (greedy, randomized, …), hashsketches, … • 82% precision on CoNLL‘03 benchmark • Open-sourcesoftware & online service AIDA http://www.mpi-inf.mpg.de/yago-naga/aida/ D5 Overview May 14, 2013 17

  18. NERD at Work https://gate.d5.mpi-inf.mpg.de/webaida/

  19. NERD at Work https://gate.d5.mpi-inf.mpg.de/webaida/

  20. NERD auf Deutsch

  21. NERD on Tables

  22. Entity Matching in Structured Data Variety & Veracity ! Hurricane Dylan Like a Hurricane Young HurricaneEverette. Hurricane Katrina New Orleans 2005 Hurricane Sandy New York 2012 ………. ? Hurricane 1975 Forever Young 1972 Like a Hurricane 1975 ………. Dylan Bob 1941 Thomas Dylan Swansea 1914 Young Brigham 1801 Young Neil Toronto 1945 Denny Sandy London 1947 • entitylinkage: • keytodataintegration • long-standing problem, verydifficult, unsolved H.L. Dunn: Record Linkage. American Journal of Public Health 36 (12), 1946 H.B. Newcombe et al.: Automatic Linkage of Vital Records. Science 130 (3381), 1959

  23. Linking Big Data & Big Text Musician Song Year Listeners Charts . . . Bob Dylan Death is not … 1988 14 218 Bob Dylan Don‘tthinktwice 1962 319 588 Bob Dylan Makeyoufeel … 1997 72 468 Nick Cave Death is not … 1996 85 333 Kronos Q. Don‘tthinktwice 2012 679 Adele Makeyoufeel … 2008 559 715 H. Wader Heute hier ... 1972 2 630 Tote Hosen Heute hier … 2012 6 432 . . . . . . . . . . 23

  24. Outline  Introduction  Lovely NERD The New Chocolate The Dark Side Conclusion

  25. Big Text: the New Chocolate

  26. Semantic Search over News https://stics.mpi-inf.mpg.de

  27. Semantic Search over News https://stics.mpi-inf.mpg.de

  28. Entity Analytics over News https://stics.mpi-inf.mpg.de

  29. Entity Analytics over News https://stics.mpi-inf.mpg.de

  30. Machine Reading of Scholarly Papers https://gate.d5.mpi-inf.mpg.de/knowlife/

  31. Machine Reading of Health Forums https://gate.d5.mpi-inf.mpg.de/knowlife/

  32. Big Data & Text Analytics:Side Effects of Drug Combinations • Deeperinsightfromboth • expert data & socialmedia: • actualsideeffectsofdrugs • … anddrugcombinations • riskfactorsandcomplications • of (wide-spread) diseases • alternative therapies • aggregation & comparisonby • age, gender, life style, etc. Structured Expert Data Social Media http://www.patient.co.uk http://dailymed.nlm.nih.gov

  33. Machine Reading: fromNamesandPhrasestoEntities, Classes, and Relations The Maestro fromRomewrotescoresforwesterns. Ma playedhisversionofthe Ecstasy. Maestro Card Rome (Italy) Jack Ma MDMA Leonard Bernstein AS Roma Yo-Yo Ma l‘Estasi dell‘Oro Lazio Roma Ennio Morricone plays sport western movie goal in football coverof born in Western Digital plays music storyabout film music playsfor

  34. DisambiguationforEntities, Classes & Relations • (M. Yahya et al.: EMNLP’12, CIKM‘13) e: MaestroCard Maestro e: Ennio Morricone c: conductor c: musician r: actedIn from r: bornIn e: Rome (Italy) ILP optimizers likeGurobi solvethis in seconds Rome weightededges (coherence, similarity, etc.) e: Lazio Roma r: composed wrotescores r: giveExam c:soundtrack scoresfor r: soundtrackFor r: shootsGoalFor c: western movie westerns e: Western Digital CombinatorialOptimizationby ILP (with type constraints etc.)

  35. Outline  Introduction  Lovely NERD  The New Chocolate The Dark Side Conclusion

  36. The Dark Side of Big Data

  37. Zoe search discuss & seek help publish & recommend Entity Linking: Privacy at Stake female 25-30 Somalia female 29y Jamame Levothroid shaking Addison’s disease ……… Nive concert Greenland singers Somalia elections Steve Biko Synthroidtremble ………. Addison disorder ………. Nive Nielsen Cry Freedom social network online forum search engine Internet

  38. Linkability Threats: • Weak cues: profiles, friends, etc. • Semantic cues: health, taste, queries • Statistical cues: correlations search discuss & seek help publish & recommend Privacy Adversaries female 25-30 Somalia female 29y Jamame Levothroidshaking Addison’s disease ……… Niveconcert Greenlandsingers Somalia elections Steve Biko Synthroidtremble ………. Addison disorder ………. Nive Nielsen Cry Freedom social network online forum search engine Internet

  39. Goal: Automated Privacy Advisor Privacy Adviser (PA): • Software toolthat • analyses risk • alerts user • advises user search discuss & seek help publish & recommend • explains consequences • recommendspolicy changes female 25-30 Somalia female 29y Jamame Levothroidshaking Addison’s disease ……… Niveconcert Greenlandsingers Somalia elections Synthroidtremble ………. Addison disorder ………. Yourqueriesmayleadtolinkingyouridenties in Facebook and patient.co.uk ! …………. Wouldyouliketouse an anonymizationtool foryoursearchrequests? ……….. Nive Nielsen Cry Freedom social network social network online forum search engine ERC Project imPACT (Backes/Druschel/Majumdar/Weikum) Internet

  40. Outline  Introduction Lovely NERD The New Chocolate The Dark Side Conclusion

  41. Big Text & Big Data Big Text & NERD: valuablecontentaboutentities liftedtowardsknowledge & analyticinsight Machine Reading: discoverandinterpretnames & phrasesas entities, classes, relations, spatio-temporal modifiers, sentiments, beliefs, …. Big Data: interlink natural-languagetext, socialmedia, structureddata & knowledgebases,images, videos andhelpuserscopingwithprivacyrisks

  42. Take-Home Message: From Language to Knowledge moreknowledge, analytics, insight Knowledge knowledge acquisition Web Contents Knowledge intelligent interpretation „Who CoveredWhom?“ andMore! (Entities, Classes, Relations)

More Related