Markov Logic: A Representation Language for Natural Language Semantics

Markov Logic:A Representation Language forNatural Language Semantics Pedro Domingos Dept. Computer Science & Eng. University of Washington (Based on joint work with Stanley Kok,Matt Richardson and Parag Singla)

Overview • Motivation • Background • Representation • Inference • Learning • Applications • Discussion

Motivation • Natural language is characterized by • Complex relational structure • High uncertainty (ambiguity, imperfect knowledge) • First-order logic handles relational structure • Probability handles uncertainty • Let’s combine the two

Markov Logic[Richardson & Domingos, 2006] • Syntax: First-order logic + Weights • Semantics: Templates for Markov nets • Inference: Weighted satisfiability + MCMC • Learning: Voted perceptron + ILP

Markov Networks A B • Undirected graphical models C D • Potential functions defined over cliques

Markov Networks A B • Undirected graphical models C D • Potential functions defined over cliques Weight of Feature i Feature i

First-Order Logic • Constants, variables, functions, predicatesE.g.: Anna, X, mother_of(X), friends(X, Y) • Grounding: Replace all variables by constantsE.g.: friends (Anna, Bob) • World (model, interpretation):Assignment of truth values to all ground predicates

Markov Logic Networks • A logical KB is a set of hard constraintson the set of possible worlds • Let’s make them soft constraints:When a world violates a formula,It becomes less probable, not impossible • Give each formula a weight(Higher weight  Stronger constraint)

Definition • A Markov Logic Network (MLN) is a set of pairs (F, w) where • F is a formula in first-order logic • w is a real number • Together with a set of constants,it defines a Markov network with • One node for each grounding of each predicate in the MLN • One feature for each grounding of each formula F in the MLN, with the corresponding weight w

Example: Friends & Smokers Suppose we have two constants: Anna (A) and Bob (B) Smokes(A) Smokes(B) Cancer(A) Cancer(B)

Example: Friends & Smokers Suppose we have two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

More on MLNs • MLN is template for ground Markov nets • Typed variables and constants greatly reduce size of ground Markov net • Functions, existential quantifiers, etc. • MLN without variables = Markov network(subsumes graphical models)

Relation to First-Order Logic • Infinite weights  First-order logic • Satisfiable KB, positive weights Satisfying assignments = Modes of distribution • MLNs allow contradictions between formulas

MPE/MAP Inference • Find most likely truth values of non-evidence ground atoms given evidence • Apply weighted satisfiability solver(maxes sum of weights of satisfied clauses) • MaxWalkSat algorithm [Kautz et al., 1997] • Start with random truth assignment • With prob p, flip atom that maxes weight sum;else flip random atom in unsatisfied clause • Repeat n times • Restart m times

Conditional Inference • P(Formula|MLN,C) = ? • MCMC: Sample worlds, check formula holds • P(Formula1|Formula2,MLN,C) = ? • If Formula2 = Conjunction of ground atoms • First construct min subset of network necessary to answer query (generalization of KBMC) • Then apply MCMC (or other)

Ground Network Construction • Initialize Markov net to contain all query preds • For each node in network • Add node’s Markov blanket to network • Remove any evidence nodes • Repeat until done

Probabilistic Inference • Recall • Exact inference is #P-complete • Conditioning on Markov blanket is easy: • Gibbs sampling exploits this

Markov Chain Monte Carlo • Gibbs Sampler 1. Start with an initial assignment to nodes 2. One node at a time, sample node given others 3. Repeat 4. Use samples to compute P(X) • Apply to ground network • Initialization: MaxWalkSat • Can use multiple chains

Learning • Data is a relational database • Closed world assumption (if not: EM) • Learning parameters (weights) • Generatively: Pseudo-likelihood • Discriminatively: Voted perceptron + MaxWalkSat • Learning structure • Generalization of feature induction in Markov nets • Learn and/or modify clauses • Inductive logic programming with pseudo-likelihood as the objective function

Generative Weight Learning • Maximize likelihood (or posterior) • Use gradient ascent • Requires inference at each step (slow!) Feature count according to data Feature count according to model

Pseudo-Likelihood [Besag, 1975] • Likelihood of each variable given its Markov blanket in the data • Does not require inference at each step • Widely used

Optimization • Parameter tying over groundings of same clause • Maximize using L-BFGS [Liu & Nocedal, 1989] where nsati(x=v) is the number of satisfied groundingsof clause i in the training data when x takes value v • Most terms not affected by changes in weights • After initial setup, each iteration takesO(# ground predicates x # first-order clauses)

Discriminative Weight Learning Gradient of Conditional Log Likelihood # true groundings of formula in DB Expected # of true groundings – slow! Approximate expected count by MAP count

Voted Perceptron[Collins, 2002] • Used for discriminative training of HMMs • Expected count in gradient approximated by count in MAP state • MAP state found using Viterbi algorithm • Weights averaged over all iterations initialize wi=0 • for t=1 to Tdo • find the MAP configuration using Viterbi • wi, =  * (training count – MAP count) • end for

Voted Perceptron for MLNs[Singla & Domingos, 2004] • HMM is special case of MLN • Expected count in gradient approximated by count in MAP state • MAP state found using MaxWalkSat • Weights averaged over all iterations initialize wi=0 • for t=1 to Tdo • find the MAP configuration using MaxWalkSat • wi, =  * (training count – MAP count) • end for

Applications to Date • Entity resolution (Cora, BibServ) • Information extraction for biology(won LLL-2005 competition) • Probabilistic Cyc • Link prediction • Topic propagation in scientific communities • Etc.

Entity Resolution • Most logical systems make unique names assumption • What if we don’t? • Equality predicate:Same(A,B), or A = B • Equality axioms • Reflexivity, symmetry, transitivity • For every unary predicate P: x1 = x2 => (P(x1) <=> P(x2)) • For every binary predicate R:x1 = x2  y1 = y2 => (R(x1,y1) <=> R(x2,y2)) • Etc. • But in Markov logic these are soft and learnable • Can also introduce reverse direction:R(x1,y1)  R(x2,y2)x1 = x2 => y1 = y2 • Surprisingly, this is all that’s needed

Example: Citation Matching

Markov Logic Formulation: Predicates • Are two bibliography records the same?SameBib(b1,b2) • Are two field values the same?SameAuthor(a1,a2)SameTitle(t1,t2)SameVenue(v1,v2) • How similar are two field strings?Predicates for ranges of cosine TF-IDF score:TitleTFIDF.0(t1,t2) is true iff TF-IDF(t1,t2)=0TitleTFIDF.2(a1,a2) is true iff 0 <TF-IDF(a1,a2) < 0.2Etc.

Markov Logic Formulation: Formulas • Unit clauses (defaults):! SameBib(b1,b2) • Two fields are same => Corresponding bib. records are same:Author(b1,a1) Author(b2,a2) SameAuthor(a1,a2) => SameBib(b1,b2) • Two bib. records are same => Corresponding fields are same:Author(b1,a1) Author(b2,a2)  SameBib(b1,b2) => SameAuthor(a1,a2) • High similarity score => Two fields are same:TitleTFIDF.8(t1,t2) =>SameTitle(t1,t2) • Transitive closure (not incorporated in experiments):SameBib(b1,b2)  SameBib(b2,b3) => SameBib(b1,b3) • 25 predicates, 46 first-order clauses

What Does This Buy You? • Objects are matched collectively • Multiple types matched simultaneously • Constraints are soft, and strengths can be learned from data • Easy to add further knowledge • Constraints can be refined from data • Standard approach still embedded

Example Subset of a Bibliography Database

Standard Approach [Fellegi & Sunter, 1969] Title Title Sim(Object Identification using CRFs, Object Identification using CRFs) Sim(Learning Boolean Formulas, Learning of Boolean Expressions) b1=b2 ? b3=b4 ? Sim(PKDD 04, 8th PKDD) Sim(PKDD 04, 8th PKDD) Venue Venue Sim(Linda Stewart, Linda Stewart) Sim(Bill Johnson, William Johnson) Author Author record-match node field-similarity node (evidence node)

What’s Missing? Title Title Sim(Object Identification using CRF, Object Identification using CRF) Sim(Learning Boolean Formulas, Learning of Boolean Expressions) b1=b2 ? b3=b4 ? Sim(PKDD 04, 8th PKDD) Sim(PKDD 04, 8th PKDD) Venue Venue Sim(Linda Stewart, Linda Stewart) Sim(Bill Johnson, William Johnson) Author Author If from b1=b2, you infer that “PKDD 04” is same as “8th PKDD”, how can you use that to help figure out if b3=b4?

Merging the Evidence Nodes Title Title Sim(Object Identification using CRFs, Object Identification using CRFs) Sim(Learning Boolean Formulas, Learning of Boolean Expressions) b3=b4 ? b1=b2 ? Sim(PKDD 04, 8th PKDD) Venue Sim(Linda Stewart, Linda Stewart) Sim(Bill Johnson, William Johnson) Author Author Author Still does not solve the problem. Why?

Introducing Field-Match Nodes Title Title Sim(Object Identification using CRFs, Object Identification using CRFs) Sim(Learning Boolean Formulas, Learning of Boolean Expressions) b1=b2 ? b3=b4 ? field-match node b1.T=b2.T? b3.T=b4.T? b1.V=b2.V? b3.V=b4.V? b1.A=b2.A? b3.A=b4.A? Sim(PKDD 04, 8th PKDD) Venue Sim(Linda Stewart, Linda Stewart) Sim(Bill Johnson, William Johnson) Author Author Full representation in Collective Model

Flow of Information Title Title Sim(Object Identification using CRFs, Object Identification using CRFs) Sim(Learning Boolean Formulas, Learning of Boolean Expressions) b1=b2 ? b3=b4 ? b1.T=b2.T? b3.T=b4.T? b1.V=b2.V? b3.V=b4.V? b1.A=b2.A? b3.A=b4.A? Sim(PKDD 04, 8th PKDD) Venue Sim(Linda Stewart, Linda Stewart) Sim(Bill Johnson, William Johnson) Author Author

Flow of Information Title Title Sim(Object Identification using CRF, Object Identification using CRF) Sim(Learning Boolean Formulas, Learning of Boolean Expressions) b1=b2 ? b3=b4 ? b1.T=b2.T? b3.T=b4.T? b1.V=b2.V? b3.V=b4.V? b1.A=b2.A? b3.A=b4.A? Sim(PKDD 04, 8th PKDD) Venue Sim(Linda Stewart, Linda Stewart) Sim(Bill Johnson, William Johnson) Author Author

Flow of Information Title Title Sim(Object Identification using CRFs, Object Identification using CRFs) Sim(Learning Boolean Formulas, Learning of Boolean Expressions) b1=b2 ? b3=b4 ? b1.T=b2.T? b3.T=b4.T? b1.V=b2.V? b3.V=b4.V? b1.A=b2.A? b3.A=b4.A? Sim(PKDD 04, 8th PKDD) Venue Sim(Linda Stewart, Linda Stewart) Sim(Bill Johnson, William Johnson) Author Author

Experiments • Databases: • Cora [McCallum et al., IRJ, 2000]:1295 records, 132 papers • BibServ.org [Richardson & Domingos, ISWC-03]:21,805 records, unknown #papers • Goal: De-duplicate bib.records, authors and venues • Pre-processing: Form canopies [McCallum et al, KDD-00 ] • Compared with naïve Bayes (standard method), etc. • Measured area under precision-recall curve (AUC) • Our approach wins across the board

Results:Matching Venues on Cora

Markov Logic: A Representation Language for Natural Language Semantics