Learning the Structure of Markov Logic Networks

Learning the Structure of Markov Logic Networks Stanley Kok

Overview • Introduction • CLAUDIEN, CRFs • Algorithm • Evaluation Measure • Clause Construction • Search Strategies • Speedup Techniques • Experiments

Introduction • Richardson & Domingoes (2004) learned MLN structure in two disjoint steps: • Learn FO clauses with off-the-shelf ILP system (CLAUDIEN) • Learn clause weights by optimizing pseudo-likelihood • Develop algorithm: • Learns FO clauses by directly optimizing pseudo-likelihood • Fast enough • Learns better structure than R&D, pure ILP, purely probabilistic and purely KB approaches

CLAUDIEN • CLAUsal DIscovery ENgine • Starts with trivially false clause • Repeatedly refine current clauses by adding literals • Adds clauses that satisfy min accuracy and coverage to KB true ) false m ) false f ) false h ) false h ) f h ) m f ) h f ) m f^h ) false m ) h m^f ) false m^h ) false m ) f h ) m v f

CLAUDIEN • language bias ´ clause template • Refine handcrafted KB • Example, • Professor(P) ( AdvisedBy(S,P) in KB • dlab_template(‘1-2:[Professor(P),Student(S)]<-AdvisedBy(S,P)’) • Professor(P) v Student(S) ( AdvisedBy(S,P)

… y1 y2 y3 yn-1 yn Misc Person Misc Org Misc x1,x2,…,xn IBM hired Alice…. Conditional Random Fields • Markov networks used to compute P(y|x) (McCallum2003) • Model: • Features, fk e.g. “current word is capitalized and next word is Inc”

CRF – Feature Induction • Set of atomic features (word=the, capitalized etc) • Starts from empty CRF • While convergence criteria is not met • Create list of new features consisting of • Atomic features • Binary conjunctions of atomic features • Conjunctions of atomic features with features already in model • Evaluate gain in P(y|x) of adding each feature to model • Add best K features to model (100s-1000s features)

Algorithm • High-level algorithm Repeat Clauses <- FindBestClauses(MLN) Add Clauses to MLN Until Clauses =  • FindBestClauses(MLN) Search for, For each candidate clause c Compute gainevaluation measureof adding c to MLN Return k clauses with highest gain and create candidate clauses

Evaluation Measure • Ideally use log-likelihood, but slow • Recall: • Value: • Gradient:

Evaluation Measure • Use pseudo-log-likelihood (R&D(2004)), but • Undue weight to predicates with large # of groundings • Recall: • E.g.:

Evaluation Measure • Use weighted pseudo-log-likelihood (WPLL) • E.g.:

Clause Construction • Add a literal (negative/positive) • All possible ways variables of new literal can be shared with those of clause • !Student(S)vAdvBy(S,P) • Remove a literal (when refining MLN) • Remove spurious conditions from rules • !Student(S)v !YrInPgm(S,5) vTA(S,C) vTmpAdvBy(S,P)

Clause Construction • Flip signs of literals (when refining MLN) • Move literals on wrong side of implication • !CseQtr(C1,Q1) v !CseQtr(C2,Q2) v !SameCse(C1,C2) v !SameQtr(Q1,Q2) • Beginning of algorithm • Expensive, optional • Limit # of distinct variables to restrict search space

!AdvBy(S,P) v Stu(S) Search Strategies • Shortest-first search (SFS) • Find gain of each clause • Sort clauses by gain • Return top 5 with positive gain MLN wt1, !AdvBy(S,P) wt2, clause2 … • Find gain of each clause • Sort them by gain • Add 5 clauses to MLN • Retrain wts of MLN (Yikes! All length-2 clauses have gains · 0) candidate set

!AdvBy(S,P) v Stu(S) v Prof(P) !AdvBy(S,P) v Stu(S) Shortest-First Search • Extend 20 length-2 clause with highest gains • Form new candidate set • Keep 1000 clauses with highest gains MLN wt1, !AdvBy(S,P) wt2, clause2 …

Shortest-First Search • Shortest-first search (SFS) • Repeat process • Extend all length-2 clauses before length-3 ones MLN wt1, clause1 wt2, clause2 … How do you refine a non-empty MLN? candidate set

SFS – MLN Refinement • Extend 20 length-2 clause with highest gains • Extend length-2 clauses in MLN • Remove a predicate from length-4 clauses in MLN • Flip signs of length-3 clauses in MLN (optional) • b,c,d replaces original clause in MLN MLN wt1, !AdvBy(S,P) wt2, clause2 … wtA, clauseA wtB, clauseB …

Search Strategies • Beam Search • Keep a beam of 5 clauses with highest gains • Track best clause • Stop when best clause does not change after two consecutive iterations MLN wt1, clause1 wt2, clause2 … wtA, clauseA wtB, clauseB … How do you refine a non-empty MLN?

We can refine non-empty MLN • We use pseudo-likelihood; different optimizations. • Applicable to arbitrary MN (not only linear chains) • Maintain separate candidate set • Add best ¼10s in model Difference from CRF – Feature Induction • Set of atomic features (word=the, capitalized etc) • Start from empty CRF • While convergence criteria is not met • Create list of new features consisting of • Atomic features • Binary conjunctions of atomic features • Conjunctions of atomic features with features already in model • Evaluate gain in P(y|x) of adding each feature to model • Add best K features to model (100s-1000s features) Flexible enough to fit in different search algms

Speedup Techniques • Recall: FindBestClauses(MLN) Search for, and create candidate clauses For each candidate clause c Compute gainWPLLof adding c to MLN Return k clauses with highest gain • LearnWeights(MLN+c) to optimize WPLL with L-BFGS • L-BFGS computes value and gradient of WPLL • Many candidate clauses; important to compute WPLL and its gradient efficiently

CLL Speedup Techniques • WPLL: • Ignore clauses in which predicate does not appear in • e.g. predicate l does not appear in clause 1

Speedup Techniques • Gnd pred’s CLL affected by clauses that contains it • Most clause weights do not  significantly • Most CLLs do not  much • Don’t have to recompute all CLLs • Store WPLL and CLLs • Recompute CLLs only if weights affecting it  beyond some threshold • Subtract old CLLs and add new CLLs to WPLL

Speedup Techniques • WPLL is a sum over all ground predicates • Estimate WPLL • Uniformly sampling grounding of each FO predicates • Sample x% of # groundings subject to min, max • Extrapolate the average

Speedup Techniques • WPLL and its gradient • Compute # true groundings of a clause • #P-complete problem • Karp & Luby (1983)’s Monte-Carlo algorithm • Gives estimate that is within  of true value with probability 1- • Draws samples of a clause • Found that estimate converges faster than algorithm specifies • Use convergence test (DeGroot & Schervish 2002) after every 100 samples • Earlier termination

Speedup Techniques • L-BFGS used to learn clause weights to optimize WPLL • Two parameters: • Max number of iterations • Convergence Threshold • Use smaller # max iterations and looser convergence thresholds • When evaluating candidate clause’s gain • Faster termination

Speedup Technique • Lexicographic ordering on clauses • Avoid redundant computations for clauses that are syntactically the same • Don’t detect semantically identical but syntactically different clauses (NP-complete problem) • Cache new clauses • Avoid recomputation

Speedup Techniques • Also used R&D04 techniques for WPLL gradient : • Ignore predicates that don’t appear in ith formula • Ignore ground formulas with truth value unaffected by changing truth value of any literal • # true groundings of a clause computed once and cached

Experiments • UW-CSE domain • 22 predicates e.g. AdvisedBy, Professor etc • 10 types e.g. Person, Course, Quarter etc • Total # ground predicates about 4 million • # true ground predicates (in DB) = 3212 • Handcrafted KB with 94 formulas • Each student has at most one advisor • If a student is an author of a paper, so is her advisor etc

Experiments • Cora domain • 1295 citations to 112 CS research papers • Author, Venue, Title, Year fields • 5 Predicates viz. SameCitation, SameAuthor, SameVenue, SameTitle, SameYear • Evidence Predicates e.g. • WordsInCommonInTitle20%(title1, title2) • Total # ground predicates about 5 million • # true ground predicates (in DB) = 378,589 • Handcrafted KB with 26 clauses • If two citations same, then they have same authors, titles etc, and vice versa • If two titles have many words in common, then they are the same, etc

Systems • MLN(KB): weight-learning applied to handcrafted KB • MLN(CL): structure-learning with CLAUDIEN; weight-learning • MLN(KB+CL): structure-learning with CLAUDIEN, using the handcrafted KB as its language bias; weight-learning • MLN(SLB): structure-learning with beam search, start from empty MLN • MLN(KB+SLB): ditto, start from handcrafted KB • MLN(SLB+KB): structure-learning with beam search, start from empty MLN, allow handcrafted clauses to be added in a first search step • MLN(SLS): structure-learning with SFS, start from empty MLN

Systems • CL: CLAUDIEN alone • KB: handcrafted KB alone • KB+CL: CLAUDIEN with KB as its language bias • NB: naïve bayes • BN: Bayesian networks

Methodology • UW-CSE domain • DB divided into 5 areas: ai, graphics, languages, systems, theory • Leave-one-out testing by area • Cora domain • 5 different train-test splits • Measured • average CLL of the predicates • average area under the precision-recall curve of the predicates (AUC)

Results • MLN(SLS), MLN(SLB) better than • MLN(CL), MLN(KB), CL, KB, NB, BN CLL (-ve) AUC

Results • MLN(SLS), MLN(SLB) better than • MLN(CL), MLN(KB), CL, KB, NB, BN CLL (-ve) CLL AUC

Results • MLN(SLB+KB) better than • MLN(KB+CL), KB+CL CLL (-ve) AUC

Results • MLN(SLB+KB) better than • MLN(KB+CL), KB+CL CLL (-ve) CLL AUC

Results • MLN(<system>) does better than corresponding <system> CLL (-ve) AUC

Results • MLN(<system>) does better than corresponding <system> CLL (-ve) CLL AUC

Results • MLN(SLS) on UW-CSE; cluster of 15 dual-CPUs 2.8 GHz Pentium 4 machines • With speed-ups: 5.3 hrs • Without speed-ups: didn’t finish running in 24 hrs • MLN(SLB) on UW-CSE; on single 2.8 GHz Pentium 4 machine • With speedups: 8.8 hrs • Without speedups: 13.7 hrs

Future Work • Speeding up counting of # true groundings of clause • Probabilistically bounding the loss in accuracy due to subsampling • Probabilistic predicate discovery

Conclusion • Develop algorithm: • Learns FO clauses by directly optimizing pseudo-likelihood • Fast enough • Learns better structure than R&D, pure ILP, purely probabilistic and purely KB approaches

Learning the Structure of Markov Logic Networks