Interactive Reasoning in Large and Uncertain RDF Knowledge Bases

Interactive Reasoning in Large and Uncertain RDF Knowledge Bases Martin Theobald Joint work with: Maximilian Dylla, TimmMeiser, NdapaNakashole, Christina Tefliuodi, Yafang Wang, Mohamed Yahya, Mauro Sozio, and Fabian Suchanek Max Planck Institute Informatics

French Marriage Problem marriedTo: person person x,y,z: marriedTo(x,y)  marriedTo(x,z)  y=z marriedTo_French: person person ...

French Marriage Problem Facts in KB: New facts or fact candidates: marriedTo (Cecilia, Nicolas) marriedTo (Carla, Benjamin) marriedTo (Carla, Mick) marriedTo (Michelle, Barack) marriedTo (Yoko, John) marriedTo (Kate, Leonardo) marriedTo (Carla, Sofie) marriedTo (Larry, Google) marriedTo (Hillary, Bill) marriedTo (Carla, Nicolas) marriedTo (Angelina, Brad) • forrecall: pattern-basedharvesting • forprecision: consistencyreasoning x,y,z: marriedTo(x,y)  marriedTo(x,z)  y=z

Agenda • URDF: Reasoning in Uncertain Knowledge Bases • Resolving uncertainty at query-time • Lineage of answers • Propositional vs. probabilistic reasoning • Temporal reasoning extensions • UViz: The URDF Visualization Frontend • Demo!

URDF: Reasoning in Uncertain KB’s [Theobald,Sozio,Suchanek,Nakashole: MPII Tech-Report‘10] • Knowledge harvesting from the Web may yield knowledge bases which are • Incomplete bornIn(Albert_Einstein,?x)  {} • Incorrect bornIn(Albert_Einstein,?x)  {Stuttgart} • Inconsistent bornIn(Albert_Einstein,?x)  {Ulm, Stuttgart} • Combine groundingof first-order logic rules with additional step of consistency reasoning • Propositional – Constrained Weighted MaxSat • Probabilistic – Lineage & Possible Worlds Semantics  At query time! 0.2 0.7

Soft Rules vs. Hard Constraints (Soft) Inference Rules vs. (Hard) Consistency Constraints • People may live inmore than one place livesIn(x,y)  marriedTo(x,z)  livesIn(z,y) livesIn(x,y)  hasChild(x,z)  livesIn(z,y) • People are not born indifferent places/on different dates bornIn(x,y)  bornIn(x,z)  y=z • People are not married to more than one person (at the same time, in most countries?) marriedTo(x,y,t1) marriedTo(x,z,t2)  y≠z  disjoint(t1,t2) [0.6] [0.2]

Soft Rules vs. Hard Constraints (ct’d) Enforce FD‘s (e.g., mutual exclusion) as hard constraints: livesIn(x,y)  type(y,City)  locatedIn(y,z)  type(z,Country)  livesIn(x,z) hasAdvisor(x,y)  hasAdvisor(x,z)  y=z Combine soft andhardconstraints NolongerregularMaxSat Constrained (weighted) MaxSatinstead Generalize to other forms of constraints: Hard constraint Soft constraint hasAdvisor(x,y)  graduatedInYear(x,t)  graduatedInYear(y,s)  s < t firstPaper(x,p)  firstPaper(y,q)  author(p,x)  author(p,y)  inYear(p) > inYear(q)+5years  hasAdvisor(x,y)[0.6] Datalog-style grounding (deductive & potentiallyrecursive soft rules)

Deductive Grounding (SLD Resolution/Datalog) Query livesIn(Bill, ?x) \/ Answers (derived facts): livesIn(Bill, Arkansas) livesIn(Bill, New_York) R1 R2 R3 • F3 /\ X • F1 \/ R1 R3 R2 • RDF Base Facts • F1: marriedTo(Bill, Hillary) • F2: represents(Hillary, New_York) • F3: governorOf(Bill, Arkansas) … X • F2 X First-Order Rules (Horn clauses) R1: livesIn(?x, ?y) :- marriedTo(?x, ?z), livesIn(?z, ?y) R2: livesIn(?x, ?y) :- represents(?x, ?y) R3: livesIn(?x, ?y) :- governorOf(?x, ?y) 8

URDF: Reasoning Example KB:Base Facts Computer Scientist Rules hasAdvisor(x,y)  worksAt(y,z) graduatedFrom(x,z) [0.4] graduatedFrom(x,y)  graduatedFrom(x,z)  x=z type[1.0] type[1.0] type[1.0] hasAdvisor[0.7] hasAdvisor[0.8] Jeff Surajit David graduatedFrom[0.9] graduatedFrom[0.6] graduatedFrom[?] graduatedFrom[?] graduatedFrom[0.7] Stanford Princeton • Derived Facts • gradFr(Surajit,Stanford) • gradFr(David,Stanford) worksAt[0.9] type[1.0] type[1.0] University

URDF: CNF Construction & MaxSat Solving [Theobald,Sozio,Suchanek,Nakashole: MPII Tech-Report‘10] Query graduatedFrom(?x,?y) • 1) Deductive Grounding • Yields only facts and rules which are relevant for answering the query (dependency graph D) • 2) Boolean Formula in CNF consisting of • Grounded hard rules • Grounded soft rules (weighted) • Base facts (weighted) • 3) Propositional Reasoning • Compute truth assignment for all facts in D such that the sum of weights is maximized  Compute “most likely” possible world CNF (graduatedFrom(Surajit, Stanford) graduatedFrom(Surajit, Princeton)) (graduatedFrom(David, Stanford) graduatedFrom(David, Princeton))  (hasAdvisor(Surajit, Jeff) worksAt(Jeff, Stanford) graduatedFrom(Surajit, Stanford))  (hasAcademicAdvisor(David, Jeff) worksAt(Jeff, Stanford) graduatedFrom(David, Stanford)) worksAt(Jeff, Stanford) hasAdvisor(Surajit, Jeff) hasAdvisor(David, Jeff) graduatedFrom(Surajit, Princeton) graduatedFrom(Surajit, Stanford)  graduatedFrom(David, Princeton) graduatedFrom(David, Stanford)   0.4 0.4 0.9 0.8 0.7 0.6 0.7 0.9 0.0

URDF: Lineage & Possible Worlds Query graduatedFrom(Surajit,?y) 1) Deductive Grounding • Same as before, but trace lineage of query answers 2) Lineage DAG (not CNF!) consisting of • Grounded hard rules • Grounded soft rules • Base facts plus: derivation structure 3) Probabilistic Inference • Marginalization: aggregate probabilities of all possible worlds where the answer is “true” • Drop “impossible worlds” 0.7x(1-0.888)=0.078 (1-0.7)x0.888=0.266 graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Stanford) 1-(1-0.72)x(1-0.6) =0.888 0.7  \/ 0.8x0.9 =0.72 0.6 graduatedFrom (Surajit, Princeton)[0.7] graduatedFrom (Surajit, Stanford)[0.6] /\ 0.8 0.9 hasAdvisor (Surajit,Jeff)[0.8] worksAt (Jeff,Stanford)[0.9]

Classes & Complexities Grounding first-order Horn formulas (Datalog) • Decidable • EXPTIME-complete, PSPACE-complete (including recursion, but in P w/o recursion) Max-Sat (Constrained & Weighted) • NP-complete Probabilistic inference in graphical models • #P-complete FOL OWL OWL-DL/lite Horn

Monte Carlo Simulation (I) [Karp,Luby,Madras: J.Alg.’89] Boolean formula: F = X1X2 X1X3 X2X3 X1X2 X1X3 Naïve sampling: X2X3 cnt = 0 repeat N times randomly choose X1, X2, X3 {0,1}if F(X1, X2, X3) = 1 thencnt = cnt+1 P = cnt/N return P /* Pr'(F) */ May be very big for small Pr(F) Zero/One-estimatortheorem Works for any F (not in PTIME) Theorem: If N ≥ (1/ Pr(F)) × (4 ln(2/d)/e2) then: Pr[ | P/Pr(F) - 1 | > e ] < d

Monte Carlo Simulation (II) [Karp,Luby,Madras: J.Alg.’89] Boolean formula in DNF: F = C1 C2 . . .  Cm Improved sampling: cnt = 0; S = Pr(C1) + … + Pr(Cm) repeat N times randomly choose i {1,2,…, m}, with prob. Pr(Ci)/S randomly choose X1, …, Xn {0,1} s.t. Ci = 1if C1=0 and C2=0 and … and Ci-1= 0 thencnt = cnt+1 P = cnt/N return P /* Pr'(F) */ Now it’s better Only for F in DNF in PTIME Theorem: If N ≥ (1/m) × (4 ln(2/d)/e2) then: Pr[ |P/Pr(F) - 1| > e ] < d

Learning “Soft” Rules Extend Inductive Logic Programming (ILP) techniques to large and incomplete knowledge bases Goal: learn livesIn(?x,?y)  bornIn(?x,?y) Li Positive Examples livesIn(?x,?y)  bornIn(?x,?y) livesIn(x,z) livesIn(x,y) • Negative Examples • livesIn(?x,?y)  bornIn(?x,?y) •  livesIn(?x,?z) bornIn(x,y) Background knowledge Li Software tools: alchemy.cs.washington.edu http://www.doc.ic.ac.uk/~shm/progol.html http://dtai.cs.kuleuven.be/ml/systems/claudien

More Variants of Consistency Reasoning • Propositional Reasoning • Constrained Weighted MaxSat solver • Lineage & Possible Worlds (independent base facts) • Monte Carlo simulations (Luby-Karp) • First-Order Logic & Probabilistic Graphical Models • Markov Logic(currently via interface to Alchemy*) [Richardson & Domingos: ML’06] • Even more general: Factor Graphs [McCallum et al. 2008] • MCMC samplingfor probabilistic inference *Alchemy – Open-Source AI: http://alchemy.cs.washington.edu/

Experiments • YAGO Knowledge Base: 2 Mio entities, 20 Mio facts • Basic query answering: SLD grounding & MaxSat solving of 10 queries over 16 soft rules (partly recursive) & 5 hard rules (bornIn, diedIn, marriedTo, …) • Asymptotic runtime checks: runtime comparisons for synthetic soft rule expansions • URDF: SLD grounding & MaxSat solving • URDF vs. Markov Logic (MAP inference & MC-SAT) |C| - # literals in soft rules |S| - # literals in hard rules

French Marriage Problem (Revisited) JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC Facts in KB: 1: 2: 3: marriedTo (Hillary, Bill) marriedTo (Carla, Nicolas) marriedTo (Angelina, Brad) New fact candidates: 4: 5: 6: 7: 8: marriedTo (Cecilia, Nicolas) marriedTo (Carla, Benjamin) marriedTo (Carla, Mick) divorced (Madonna, Guy) domPartner (Angelina, Brad) validFrom (2, 2008) validFrom (4, 1996) validUntil (4, 2007) validFrom (5, 2010) validFrom (6, 2006) validFrom (7, 2008)

Challenge: Temporal Knowledge Harvesting • Consistencyconstraintsarepotentiallyhelpful: • functionaldependencies: {husband, time}  {wife, time} • inclusiondependencies: marriedPerson  adultPerson • age/time/genderrestrictions: birthdate +  < marriage < divorce For all people in Wikipedia (100,000‘s) gather all spouses, incl. divorced & widowed, and corresponding time periods! >95% accuracy, >95% coverage, in one night!

Difficult Dating

(Even More Difficult) Implicit Dating vaguedates relative dates narrative text relative order

TARSQI: Extracting Time Annotations [Verhagen et al: ACL‘05] http://www.timeml.org/site/tarsqi/ Hong Kong is poised to hold the first election in more than half<TIMEX3 tid="t3" TYPE="DURATION" VAL="P100Y">a century</TIMEX3> that includes a democracy advocate seeking high office in territory controlled by the Chinese government in Beijing. A pro-democracy politician, Alan Leong, announced <TIMEX3 tid="t4" TYPE="DATE" VAL="20070131">Wednesday</TIMEX3> that he had obtained enough nominations to appear on the ballot to become the territory’s next chief executive. But he acknowledged that he had no chance of beating the Beijing-backed incumbent, Donald Tsang, who is seeking re-election. Under electoral rules imposed by Chinese officials, only 796 people on the election committee – the bulk of them with close ties to mainland China – will be allowed to vote in the <TIMEX3 tid="t5" TYPE="DATE" VAL="20070325">March 25</TIMEX3>election. It will be the first contested election for chief executive since Britain returned Hong Kong to China in <TIMEX3 tid="t6" TYPE="DATE" VAL="1997">1997</TIMEX3>. Mr. Tsang, an able administrator who took office during the early stages of a sharp economic upturn in <TIMEX3 tid="t7" TYPE="DATE" VAL="2005">2005</TIMEX3>, is popular with the general public. Polls consistently indicate that three-fifths of Hong Kong’s people approve of the job he has been doing. It is of course a foregone conclusion – Donald Tsang will be elected and will hold office for <TIMEX3 tid="t9" beginPoint="t0" endPoint="t8“ TYPE="DURATION" VAL="P5Y">another five years </TIMEX3>, said Mr. Leong, the former chairman of the Hong Kong Bar Association. extraction errors!

13 Relations between Time Intervals [Allen, 1984; Allen & Hayes, 1989] A Before B B After A A Meets B BMetBy A A Overlaps B BOverlappedBy A A Starts B BStartedBy A A During B B Contains A A Finishes B BFinishedBy A A Equal B A B A B A B A B A B A B A B

Possible Worlds in Time (I) [Wang,Yahya,Theobald: VLDB/MUD Workshop ‘10] Derived Facts teamMates(Beckham, Ronaldo,T3)  • playsFor(Beckham, Real, T1) • playsFor(Ronaldo, Real, T2) • overlaps(T1,T2) State 0.36 0.16 0.12 0.08 ‘07 ‘05 ‘03 ‘04 0.6 0.4 1.0 0.9 0.4 0.2 0.2 0.1 Base Facts ‘05 ‘07 ‘03 ‘00 ‘02 ‘05 ‘07 ‘04 playsFor(Beckham,Real) playsFor(Ronaldo,Real) State Relation State Relation

Possible Worlds in Time (II) [Wang,Yahya,Theobald: VLDB/MUD Workshop ‘10] Derived Facts won(Beckham, ChampionsL,T3)  • playsFor(Beckham, United, T1) • wonCup(United, ChampionsL,T2) • overlaps(T1,T2) Need Lineage! • Closed and complete representation model (incl. lineage)  Stanford Trio project[Widom: CIDR’05, Benjelloun et al: VLDB’06] • Interval computation remains linear in the number of bins • Confidence computation per bin is #P-complete  In general requires possible-worlds-based sampling techniques (Luby-Karp, Gibbs sampling, etc.) Event 0.30 0.54 0.12 0.12 0.06 0.06 ‘99 ‘00 ‘01 ‘96 ‘98 Non-independent Independent 0.6 0.5 0.9 1.0 0.2 0.1 0.3 0.3 0.2 ‘98 Base Facts ‘98 ‘02 ‘95 ‘96 ‘99 ‘00 ‘01 playsFor(Beckham, United) wonCup(United, ChampionsLeague) State Event

Agenda • URDF: Reasoning in Uncertain Knowledge Bases • Resolving uncertainty at query-time • Lineage of answers • Propositional vs. probabilistic reasoning • Temporal reasoning extensions • UViz: The URDF Visualization Frontend • Demo!

UViz: The URDF Visualization Engine • UViz System Architecture • Flash client • Tomcat server (JRE) • Relational backend (JDBC) • Remote Method Invocation & Object Serialization (BlazeDS)

UViz: The URDF Visualization Engine Demo!

Interactive Reasoning in Large and Uncertain RDF Knowledge Bases

Interactive Reasoning in Large and Uncertain RDF Knowledge Bases

Presentation Transcript

Representing uncertain knowledge

Knowledge Representation and Reasoning

Knowledge Representation and Reasoning

Ch9 Reasoning in Uncertain Situations

URDF Query-Time Reasoning in Uncertain RDF Knowledge Bases

Knowledge Representation and Reasoning

Knowledge Representation and Reasoning

Knowledge Representation and Reasoning

Chapter 7 Reasoning in Uncertain Situations

Knowledge Representation and Reasoning

Knowledge Representation and Reasoning

Reasoning in Uncertain Situations

Reasoning in Uncertain Situations

Chapter 5 Reasoning in Uncertain Situations

Uncertain Knowledge Representation

Chapter 9 Reasoning in Uncertain Situations

Knowledge Representation and Reasoning

Reasoning in Uncertain Situations

Uncertain Knowledge Representation