720 likes | 986 Views
From Information to Knowledge. Harvesting Entities and Relationships From Web Sources. Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/. Martin Theobald Max Planck Institute for Informatics http://www.mpi-inf.mpg.de /~mtb/.
E N D
From Information toKnowledge HarvestingEntitiesandRelationships From Web Sources Gerhard Weikum Max Planck Institute forInformatics http://www.mpi-inf.mpg.de/~weikum/ Martin Theobald Max Planck Institute forInformatics http://www.mpi-inf.mpg.de/~mtb/
Goal: Turn Web into Knowledge Base Source: DB & IR methods for knowledge discovery. Communications of the ACM 52(4), 2009 • comprehensiveDB ofhuman knowledge • everythingthatWikipediaknows • everythingmachine-readable • capturingentities, classes, relationships
Approach: Harvesting Facts from Web PoliticianPosition Angela Merkel Chancellor Germany Karl-Theodor zu Guttenberg Minister of Defense Germany Christoph Hartmann Minister of Economy Saarland … ActorAward Christoph Waltz Oscar Sandra Bullock Oscar Sandra Bullock Golden Raspberry … PoliticianPolitical Party Angela Merkel CDU Karl-Theodor zu Guttenberg CDU Christoph Hartmann FDP … CompanyCEO Google Eric Schmidt Yahoo Overture Facebook FriendFeed Software AG IDS Scheer … MovieReportedRevenue Avatar $ 2,718,444,933 The Reader $ 108,709,522 Facebook FriendFeed Software AG IDS Scheer … PoliticalPartySpokesperson CDU Philipp Wachholz Die Grünen Claudia Roth FacebookFriendFeed Software AG IDS Scheer … CompanyAcquiredCompany Google YouTube Yahoo Overture FacebookFriendFeed Software AG IDS Scheer … Cyc IWP ReadTheWeb TextRunner YAGO-NAGA
Knowledge as Enabling Technology • entityrecognition & disambiguation • understandingnaturallanguage& speech • knowledgeservices & reasoningforsemanticapps • (e.g. deep QA) • semanticsearch: preciseanswersto advancedqueries • (byscientists, students, journalists, analysts, etc.) US presidentwhenBarackObamawas born? Indy 500 winnerswhoare still alive? Politicians who are also scientists? Relationshipbetween Angela Merkel, Jim Gray, Dalai Lama? Enzymes thatinhibit HIV? Influenza drugsforteenswithhighbloodpressure? ...
Knowledge Search (1) Whowas US president whenBarackObama was born? http://www.wolframalpha.com
Knowledge Search (1) Whowas mayor of Indianapolis whenBarackObama was born? not enough facts in KB ! http://www.wolframalpha.com
Knowledge Search (2) Indy500 winners? http://www.google.com/squared/
Knowledge Search (2) Indy500 winners? http://www.google.com/squared/
Knowledge Search (2) Indy500 winners from Europe? notypes noinference ! http://www.google.com/squared/
Related Work Yago-Naga EntityRank Cazoodle Text2Onto Powerset ReadTheWeb Avatar System T Hakia Cyc information extraction ontologies UIMA Kylin KOG WebTables (Semantic Web) (Statistical Web) kosmix KnowItAll TextRunner WolframAlpha SWSE StatSnowball EntityCube sig.ma communities DBpedia (Social Web) Cimple DBlife PSOX TrueKnowledge GoogleSquared Freebase Answers START WorldWideTables Cyc IWP ReadTheWeb TextRunner YAGO-NAGA
Outline What and Why Framework EntitiesandClasses Relationships Temporal Knowledge Wrap-up ...
Framework: Types of Knowledge • facts / assertions: bornIn (JohnDillinger, Indianapolis) • hasWon (JimGray, TuringAward), … • taxonomic: instanceOf (JohnDillinger, bankRobbers), • subclassOf (bankRobbers, criminals), … • lexical / terminology: means (“Big Apple“, NewYorkCity), • means (“Big Mike“, MichaelStonebraker) • means (“MS“, Microsoft) , means (“MS“, MultipleSclerosis) … • common-senseproperties: • applesaregreen, red, juicy, sweet, sour … - but not fast, smart … • ballsareround, smooth, slippery … - but not square, funny … • common-senseaxioms: • x: human(x) male(x) female(x) • x: (male(x) female(x)) (female(x) ) male(x)) • x: animal(x) (hasLegs(x) isEven(numberOfLegs(x)) … • procedural: howto fix/install/prepare/remove … • epistemic / beliefs: believes (Ptolemy, shape(Earth, disc)), • believes (Copernicus, shape(Earth, sphere)) … ...
Framework: Information Extraction (IE) Surajit obtainedhis PhD in CS from Stanford University underthesupervision of Prof. Jeff Ullman. He laterjoined HP and workedcloselywith Umesh Dayal … instanceOf (Surajit, scientist) inField (Surajit, computerscience) hasAdvisor (Surajit, Jeff Ullman) almaMater (Surajit, Stanford U) workedFor (Surajit, HP) friendOf (Surajit, Umesh Dayal) … source- centric IE 1) recall ! 2) precision onesource yield-centric harvesting hasAdvisor StudentAdvisor StudentUniversity StudentAdvisor StudentAdvisor Surajit Chaudhuri Jeffrey Ullman Alon Halevy Jeffrey Ullman Jim Gray Mike Harrison … … 1) precision ! 2) recall almaMater StudentUniversity Surajit Chaudhuri Stanford U Alon Halevy Stanford U Jim Gray UC Berkeley … … near-human quality ! manysources
Framework: Knowledge Representation • RDF (Resource Description Framework, W3C): • subject-property-object (SPO) triples, binaryrelations • structure, but no (prescriptive) schema • Relations, frames • Description logics: OWL, DL-lite • Higher-order logics, epistemic logics facts (RDF triples): (JimGray, hasAdvisor, MikeHarrison) (SurajitChaudhuri, hasAdvisor, JeffUllman) (Madonna, marriedTo, GuyRitchie) (NicolasSarkozy, marriedTo, CarlaBruni) facts (RDF triples) 1: 2: 3: 4: factsaboutfacts: 5: (1, inYear, 1968) 6: (2, inYear, 2006) 7: (3, validFrom, 22-Dec-2000) 8: (3, validUntil, Nov-2008) 9: (4, validFrom, 2-Feb-2008) 10: (2, source, SigmodRecord) temporal & provenanceannotations canrefertoreifiedfacts via factidentifiers (approx. equiv. to RDF quadruples: “Color“ Sub Prop Obj) ...
KB‘s: Example YAGO (Suchanek et al.: WWW‘07) 2 Mio. entities, 20 Mio. facts 40 Mio. RDF triples ( entity1-relation-entity2, subject-predicate-object ) Entity subclass subclass subclass Organization Person Location subclass subclass subclass Accuracy 95% subclass subclass Country Scientist Politician subclass subclass State instanceOf instanceOf Biologist instanceOf Physicist City instanceOf Germany instanceOf instanceOf locatedIn Erwin_Planck Oct 23, 1944 diedOn locatedIn Kiel Schleswig-Holstein FatherOf bornIn Nobel Prize hasWon instanceOf citizenOf diedOn Oct 4, 1947 Max_Planck Society Max_Planck Angela Merkel Apr 23, 1858 bornOn means(0.9) means means means means(0.1) “Max Planck” “Max Karl Ernst Ludwig Planck” “Angela Merkel” “Angela Dorothea Merkel” http://www.mpi-inf.mpg.de/yago-naga/
KB‘s: Example YAGO (F. Suchanek et al.: WWW‘07) http://www.mpi-inf.mpg.de/yago-naga/
KB‘s: ExampleDBpedia(Auer, Bizer, et al.: ISWC‘07) • 3 Mio. entities, • 1 Bio. facts (RDF triples) • 1.5 Mio. entitiesmappedto • hand-craftedtaxonomyof • 259 classeswith 1200 properties http://www.dbpedia.org
Outline What and Why Framework EntitiesandClasses Relationships Temporal Knowledge Wrap-up ...
Entities & Classes Whichentitytypes (classes, unarypredicates) arethere? scientists, doctoralstudents, computerscientists, … femalehumans, male humans, marriedhumans, … Whichsubsumptionsshould hold (subclass/superclass, hyponym/hypernym, inclusiondependencies)? subclassOf (computerscientists, scientists), subclassOf (scientists, humans), … Whichindividual entitiesbelongtowhichclasses? instanceOf (Surajit Chaudhuri, computerscientists), instanceOf (BarbaraLiskov, computerscientists), instanceOf (Barbara Liskov, femalehumans), … Whichnamesdenotewhichentities? means (“Lady Di“, Diana Spencer), means (“Diana Frances Mountbatten-Windsor”, Diana Spencer), … means (“Madonna“, Madonna Louise Ciccone), means (“Madonna“, Madonna(paintingby Edward Munch)), … ...
WordNet Thesaurus [Miller/Fellbaum 1998] 3 concepts / classes & theirsynonyms (synset‘s) http://wordnet.princeton.edu/
WordNet Thesaurus [Miller/Fellbaum 1998] subclasses (hyponyms) superclasses (hypernyms) http://wordnet.princeton.edu/
WordNet Thesaurus [Miller & Fellbaum 1998] • > 100 000 classes and lexical relations; • canbecastinto • descriptionlogicsor • graph, withweightsforrelationstrengths • (derivedfromco-occurrencestatistics) but: onlyfewindividual entities (instancesofclasses) scientist, man of science (a personwithadvancedknowledge) => cosmographer, cosmographist => biologist, life scientist => chemist => cognitivescientist => computerscientist ... => principalinvestigator, PI … HAS INSTANCE => Bacon, Roger Bacon … http://wordnet.princeton.edu/
Mapping: Wikipedia WordNet [Suchanek: WWW‘07, Ponzetto&Strube: AAAI‘07] Missing Person Sailor, Crewman American Computer Scientist Scientist Jim Gray (computer specialist) Chemist Artist
Mapping: Wikipedia WordNet [Suchanek: WWW‘07, Ponzetto&Strube: AAAI‘07] Missing Person Sailor, Crewman ? People Lost atSea Computer Scientists by Nation American instanceOf American Computer Scientists Computer Scientist Scientist subclassOf Jim Gray (computer specialist) Databases Data- base ? Database Researcher ? Engineering Societies Fellow (1), Comrade ? Fellowsof the ACM ? Fellow (2), Colleague ACM namesimilarity (editdist., n-gram overlap) ? Fellow (3) (of Society) Members ofLearned Societies contextsimilarity (word/phraselevel) ? Member (1), Fellow ? machinelearning ? Member (2), Extremity
Mapping: Wikipedia WordNet [Suchanek: WWW‘07, Ponzetto & Strube:AAAI‘07] Given: entitye in Wikipediacategoriesc1, …, ck Wanted: instanceOf(e,c) and subclassOf(ci,c) for WN classc Problem: vagueness& ambiguity of names c1, …, ck Analyzingcategorynames noungroupparser: American MusiciansofItalianDescent pre-modifier head post-modifier American Folk Music ofthe 20th Century pre-modifier head post-modifier American Indy 500 Drivers on Pole Positions pre-modifier head post-modifier Head wordiskey, shouldbe in pluralforinstanceOf
Mapping Wikipedia Entities to WordNet Classes [Suchanek: WWW‘07, Ponzetto & Strube: AAAI‘07] Given: entitye in Wikipediacategoriesc1, …, ck Wanted: instanceOf(e,c) and subclassOf(ci,c) for WN classc Problem: vagueness& ambiguity of names c1, …, ck Heuristic Method: foreachci do ifheadword w ofcategorynameciis plural { 1) match w againstsynsetsofWordNetclasses 2) choosebestfittingclassc andsete c 3) expandw bypre-modifierandsetci w+ c } tunedconservatively: highprecision, reducedrecall • can also derivefeaturesthisway • feedintosupervisedclassifier
Learning More Mappings [ Wu & Weld: WWW‘08 ] • KylinOntology Generator (KOG): • learnclassifierforsubclassOfacrossWikipedia & WordNetusing • YAGO astrainingdata • advanced ML methods (MLN‘s, SVM‘s) • richfeaturesfromvarioussources • category/classnamesimilaritymeasures • categoryinstancesandtheirinfoboxtemplates: • templatenames, attributenames (e.g. knownFor) • Wikipediaedithistory: • refinementofcategories • Hearst patterns: • C such as X, X and Y andotherC‘s, … • othersearch-enginestatistics: • co-occurrencefrequencies > 3 Mio. entities > 1 Mio. w/ infoboxes > 500 000 categories
Goal: Comprehensive & Consistent ! Telecomm. History Knuth Prize Laureate Doctoral Students American … Bell Labs Known For Princeton Alumni Academic American People by Occupation Jeffrey Ullman Alma Mater American Computer Scientists Scientist Notable Awards Databases Jim Gray (computer specialist) Database Researcher Fellow(1) Computer Data Fellow(2) Born Fellowsof the ACM Members ofLearned Societies Award Winner Years Active Madonna (entertainer) U Michigan Alumni Athlete Genres Americansof ItalianDescent World Record Holders Artist Also Known As Bob Dylan People by Status Musician American Songwriters … Hall ofFame Inductees Singer Website Guitar Players Italian
Goal: Comprehensive & Consistent ! Telecomm. History Knuth Prize Laureate Doctoral Students American … Bell Labs Known For Princeton Alumni Academic American People by Occupation Jeffrey Ullman Alma Mater American Computer Scientists Scientist Notable Awards Databases Jim Gray (computer specialist) Database Researcher Fellow(1) Computer Data Fellow(2) Born Fellowsof the ACM Members ofLearned Societies Award Winner Years Active U Michigan Alumni Madonna (entertainer) Athlete Genres Americansof ItalianDescent World Record Holders Artist Also Known As Bob Dylan People by Status American Songwriters Musician … Hall ofFame Inductees Singer Website Guitar Players Italian
Goal: Comprehensive & Consistent ! Telecomm. History Knuth Prize Laureate Doctoral Students American … Bell Labs Known For Princeton Alumni Academic American People by Occupation Jeffrey Ullman Alma Mater American Computer Scientists Scientist Notable Awards Databases Jim Gray (computer specialist) Database Researcher Fellow(1) Computer Data Fellow(2) Born Fellowsof the ACM Members ofLearned Societies Award Winner Years Active U Michigan Alumni Madonna (entertainer) Athlete Genres Americansof ItalianDescent World Record Holders Artist Also Known As Bob Dylan People by Status American Songwriters Musician … Hall ofFame Inductees Singer Website Guitar Players Italian
Goal: Comprehensive & Consistent ! Telecomm. History Knuth Prize Laureate Doctoral Students American … Bell Labs Known For Princeton Alumni Academic American People by Occupation Jeffrey Ullman Alma Mater American Computer Scientists Scientist Clean upthe mess: • graphalgorithms ? • random walk withrestart • densesubgraphs … • statisticalmachinelearning ? • logicalconsistencyreasoning ? • giganticschemaintegration? • ontologymerging Notable Awards Databases Jim Gray (computer specialist) Database Researcher Fellow(1) Computer Data Fellow(2) Born Fellowsof the ACM Members ofLearned Societies Award Winner Years Active U Michigan Alumni Madonna (entertainer) Athlete Genres Americansof ItalianDescent World Record Holders Artist Also Known As Bob Dylan People by Status American Songwriters Musician … Hall ofFame Inductees Singer Website Guitar Players Italian
Long Tail of Class Instances [Etzioni et al. 2004, Cohen et al. 2008, Mitchell et al. 2010] • State-of-the-Art Approach (e.g. SEAL): • Start withseeds: a fewclassinstances • Find lists, tables, textsnippets(“forexample: …“), … • thatcontainoneormoreseeds • Extractcandidates: nounphrasesfromvicinity • Gatherco-occurrencestats(seed&cand, cand&classNamepairs) • Rankcandidates • point-wise mutual information, … • random walk (PR-style) on seed-candgraph But: Precision dropsforclasseswithsparsestatistics(DB profs, …) Harvesteditemsarenames, not entities Canonicalization (de-duplication) unsolved
Individual Entity Disambiguation Names Entities Sean Penn “Penn“ ? University of Pennsylvania “U Penn“ Pennsylvania State University “Penn State“ Pennsylvania (US State) „PSU“ Passenger Service Unit • ill-definedwithzerocontext • knownasrecordlinkagefornames in recordfields • Wikipediaoffersrichcandidatemappings: • disambiguationpages, re-directs, inter-wiki links, • anchortextsofhref links
Collective Entity Disambiguation [McCallum 2003, Doan 2005, Getoor 2006. Domingos 2007, Chakrabarti 2009, …] • Consider a setofnames {n1, n2, …} in same context • andsetsofcandidateentities • E1 = {e11, e12, …}, E2 = {e21, e22, …}, … • Definejointobjectivefunction(e.g. likelihoodfor prob. model) • thatrewardscoherence of mappingsni eij • Solveoptimizationproblem Stuart Russell (DJ) Stuart Russell Stuart Russell (computerscientist) Michael Jordan Michael Jordan (computerscientist) Michael Jordan (NBA)
Problems and Challenges Wikipediacategoriesreloaded comprehensive & consistentinstanceOfandsubClassOf acrossWikipediaandWordNet (via consistency reasoning ?) Long tail of entities beyondWikipedia: domain-specificentitycatalogs discovernewentities, detectnewnamesforknownentities Tags, tables, topics tap on othersources: Web2.0, Web tables, directories, etc. Robust disambiguation near-real-time mappingofnamestoentities withnear-human quality
Outline What and Why Framework EntitiesandClasses Relationships Temporal Knowledge Wrap-up ...
Relationships Whichinstances (pairs of individualentities) arethere forgivenbinaryrelationswithspecifictypesignatures? hasAdvisor (JimGray, MikeHarrison) hasAdvisor (HectorGarcia-Molina, Gio Wiederhold) hasAdvisor (Susan Davidson, Hector Garcia-Molina) graduatedAt (JimGray, Berkeley) graduatedAt (HectorGarcia-Molina, Stanford) hasWonPrize (JimGray, TuringAward) bornOn (JohnLennon, 9Oct1940) diedOn (JohnLennon, 8Dec1980) marriedTo (JohnLennon, YokoOno) Which additional & interestingrelationtypesarethere betweengivenclassesofentities? competedWith(x,y), nominatedForPrize(x,y), … divorcedFrom(x,y), affairWith(x,y), … assassinated(x,y), rescued(x,y), admired(x,y), …
Deterministic Pattern Matching [Kushmerick 97, Califf & Mooney 99, Gottlob 01, …] • Regular expressionsmatching • Wrapper induction • (grammarlearningfor • restrictedregularlanguages) • Well understood ...
French Marriage Problem facts in KB: newfactsorfactcandidates: married(Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) married (Michelle, Barack) married (Yoko, John) married (Kate, Leonardo) married (Carla, Sofie) married (Larry, Google) married (Hillary, Bill) married (Carla, Nicolas) married (Angelina, Brad) forrecall: pattern-basedharvesting forprecision: consistencyreasoning
Pattern-BasedHarvesting (Hearst 92, Brin98, Agichtein 00, Etzioni 04, …) Facts & Fact Candidates Patterns (Hillary, Bill) X and her husband Y (Carla, Nicolas) X and Y on their honeymoon (Angelina, Brad) (Victoria, David) X and Y and their children (Hillary, Bill) X has been dating with Y (Carla, Nicolas) X loves Y (Yoko, John) … • good for recall • noisy, drifting • not robust enough • for high precision (Kate, Pete) (Carla, Benjamin) (Larry, Google) (Angelina, Brad) (Victoria, David)
Reasoningabout Fact Candidates Useconsistencyconstraintstoprunefalsecandidates groundatoms: FOL rules (restricted): spouse(Hillary,Bill) spouse(Carla,Nicolas) spouse(Cecilia,Nicolas) spouse(Carla,Ben) spouse(Carla,Mick) Spouse(Carla, Sofie) spouse(x,y) diff(y,z) spouse(x,z) spouse(x,y) diff(w,y) spouse(w,y) spouse(x,y) f(x) spouse(x,y) m(y) spouse(x,y) (f(x)m(y)) (m(x)f(y)) f(Hillary) f(Carla) f(Cecilia) f(Sofie) m(Bill) m(Nicolas) m(Ben) m(Mick) Rules revealinconsistencies Find consistentsubset(s) ofatoms (“possibleworld(s)“, “thetruth“) • Rules canbeweighted • (e.g. byfractionofgroundatomsthatsatisfy a rule) • uncertain / probabilistic data • compute prob. distr. ofsubsetofatomsbeingthetruth
MarkovLogic Networks (MLN‘s) (M. Richardson / P. Domingos 2006) Maplogicalconstraints & factcandidates intoprobabilisticgraph model: Markov Random Field (MRF) s(x,y) diff(y,z) s(x,z) s(x,y) f(x) f(x) m(x) s(Carla,Nicolas) s(Cecilia,Nicolas) s(Carla,Ben) s(Carla,Sofie) … s(x,y) diff(w,y) s(w,y) s(x,y) m(y) M(x) f(x) Grounding: Literal Boolean Var Literal binary RV s(Ca,Nic) s(Ce,Nic) s(Ca,Nic) s(Ca,Ben) s(Ca,Nic) m(Nic) s(Ca,Nic) s(Ca,So) s(Ce,Nic) m(Nic) s(Ca,Ben) s(Ca,So) s(Ca,Ben) m(Ben) s(Ca,Ben) s(Ca,So) s(Ca,So) m(So)
MarkovLogic Networks (MLN‘s) (M. Richardson / P. Domingos 2006) Maplogicalconstraints & factcandidates intoprobabilisticgraph model: Markov Random Field (MRF) s(x,y) diff(y,z) s(x,z) s(x,y) f(x) f(x) m(x) s(Carla,Nicolas) s(Cecilia,Nicolas) s(Carla,Ben) s(Carla,Sofie) … s(x,y) diff(w,y) s(w,y) s(x,y) m(y) M(x) f(x) s(Ce,Nic) RVs coupled by MRF edge iftheyappear in same clause m(Nic) s(Ca,Nic) s(Ca,Ben) m(Ben) s(Ca,So) MRF assumption: P[Xi|X1..Xn]=P[Xi|N(Xi)] m(So) Varietyofalgorithmsforjointinference: Gibbs sampling, other MCMC, belief propagation, randomized MaxSat, … jointdistribution hasproduct form over all cliques
Related Alternative Probabilistic Models ConstrainedConditional Models [D. Roth et al. 2007] log-linear classifierswithconstraint-violationpenalty mappedinto Integer Linear Programs Factor Graphs with Imperative Variable Coordination [A. McCallum et al. 2008] s(Ce,Nic) RV‘sshare “factors“ (jointfeaturefunctions) generalizes MRF, BN, CRF, … inference via advanced MCMC flexible coupling & constrainingofRV‘s m(Nic) s(Ca,Nic) s(Ca,Ben) m(Ben) s(Ca,So) m(So) softwaretools: alchemy.cs.washington.edu code.google.com/p/factorie/ research.microsoft.com/en-us/um/cambridge/projects/infernet/
Reasoning for KB Growth: Direct Route (F. Suchanek et al.: WWW‘09) newfactcandidates: facts in KB: married(Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) married (Carla, Sofie) married (Larry, Google) ? married (Hillary, Bill) married (Carla, Nicolas) married (Angelina, Brad) + patterns: X and her husband Y X and Y andtheirchildren X hasbeendatingwith Y X loves Y Directapproach: • factsaretrue; factcandidates& patterns hypotheses • groundedconstraints clauseswithhypothesesasvars • castintoWeighted Max-Satwithweightsfrompatternstats • customizedapproximationalgorithm • unifies: factcandconsistency, patterngoodness, entitydisambig. www.mpi-inf.mpg.de/yago-naga/sofie/
Facts & Patterns Consistency (F. Suchanek et al.: WWW‘09) constraintstoconnectfacts, factcandidates, patterns functionaldependencies: relationproperties: spouse(X,Y): X Y, Y X asymmetry, transitivity, acyclicity, … pattern-factduality: type constraints, inclusiondependencies: occurs(p,x,y) expresses(p,R) R(x,y) spouse Person Person capitalOfCountry cityOfCountry occurs(p,x,y) R(x,y) expresses(p,R) domain-specificconstraints: name(-in-context)-to-entitymapping: bornInYear(x) + 10years ≤ graduatedInYear(x) means(n,e1) means(n,e2) … hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t www.mpi-inf.mpg.de/yago-naga/sofie/