610 likes | 831 Views
Learning the Structure of Markov Logic Networks. Stanley Kok & Pedro Domingos Dept. of Computer Science and Eng. University of Washington. Overview. Motivation Background Structure Learning Algorithm Experiments Future Work & Conclusion. Motivation. Statistical Relational Learning (SRL)
E N D
Learning the Structure of Markov Logic Networks Stanley Kok & Pedro DomingosDept. of Computer Science and Eng.University of Washington
Overview • Motivation • Background • Structure Learning Algorithm • Experiments • Future Work & Conclusion
Motivation • Statistical Relational Learning (SRL) combines the benefits of: • Statistical Learning: uses probability to handle uncertainty in a robust and principled way • Relational Learning:models domains with multiple relations
Motivation • Many SRL approaches combine a logical language and Bayesian networks • e.g. Probabilistic Relational Models [Friedman et al., 1999] • The need to avoid cycles in Bayesian networks causes many difficulties [Taskar et al., 2002] • Started using Markov networks instead
Motivation • Relational Markov Networks [Taskar et al., 2002] • conjunctive database queries + Markov networks • Require space exponential in the size of the cliques • Markov Logic Networks [Richardson & Domingos, 2004] • First-order logic + Markov networks • Compactly represent large cliques • Did not learn structure (used external ILP system)
Motivation • Relational Markov Networks [Taskar et al., 2002] • conjunctive database queries + Markov networks • Require space exponential in the size of the cliques • Markov Logic Networks [Richardson & Domingos, 2004] • First-order logic + Markov networks • Compactly represent large cliques • Did not learn structure (used external ILP system) • This paper develops a fast algorithm that learns MLN structure • Most powerful SRL learner to date
Overview • Motivation • Background • Structure Learning Algorithm • Experiments • Future Work & Conclusion
Markov Logic Networks • First-order KB: set of hard constraints • Violate one formula, a world has zero probability • MLNs soften constraints • OK to violate formulas • The fewer formulas a world violates, the more probable it is • Gives each formula a weight, reflects how strong a constraint it is
MLN Definition • A Markov Logic Network (MLN) is a set of pairs (F, w) where • F is a formula in first-order logic • w is a real number • Together with a finite set of constants,it defines a Markov network with • One node for each grounding of each predicate in the MLN • One feature for each grounding of each formula F in the MLN, with the corresponding weight w
Ground Markov Network 2.7 AdvisedBy(S,P) ) Student(S) ^ Professor(P) constants: STAN, PEDRO AdvisedBy(STAN,STAN) Student(STAN) Professor(STAN) AdvisedBy(STAN,PEDRO) AdvisedBy(PEDRO,STAN) Student(PEDRO) Professor(PEDRO) AdvisedBy(PEDRO,PEDRO)
MLN Model Vector of value assignments to ground predicates
MLN Model Vector of value assignments to ground predicates Partition function. Sums over all possible value assignments to ground predicates
MLN Model Vector of value assignments to ground predicates Weight of ith formula Partition function. Sums over all possible value assignments to ground predicates
MLN Model Vector of value assignments to ground predicates Weight of ith formula Partition function. Sums over all possible value assignments to ground predicates # of true groundings of ith formula
MLN Weight Learning • Likelihood is concave function of weights • Quasi-Newton methods to find optimal weights • e.g. L-BFGS [Liu & Nocedal, 1989]
MLN Weight Learning • Likelihood is concave function of weights • Quasi-Newton methods to find optimal weights • e.g. L-BFGS [Liu & Nocedal, 1989] SLOW #P-complete
MLN Weight Learning • Likelihood is concave function of weights • Quasi-Newton methods to find optimal weights • e.g. L-BFGS [Liu & Nocedal, 1989] SLOW #P-complete SLOW #P-complete
MLN Weight Learning • R&D used pseudo-likelihood[Besag, 1975]
MLN Weight Learning • R&D used pseudo-likelihood[Besag, 1975]
MLN Structure Learning • R&D “learned” MLN structure in two disjoint steps: • Learn first-order clauses with an off-the-shelf ILP system (CLAUDIEN [De Raedt & Dehaspe, 1997]) • Learn clause weights by optimizing pseudo-likelihood • Unlikely to give best results because CLAUDIEN • find clauses that hold with some accuracy/frequency in the data • don’t find clauses that maximize data’s (pseudo-)likelihood
Overview • Motivation • Background • Structure Learning Algorithm • Experiments • Future Work & Conclusion
MLN Structure Learning • This paper develops an algorithm that: • Learns first-order clauses by directly optimizing pseudo-likelihood • Is fast enough • Performs better than R&D, pure ILP, purely KB and purely probabilistic approaches
Structure Learning Algorithm • High-level algorithm REPEAT MLN ÃMLN [FindBestClauses(MLN) UNTIL FindBestClauses(MLN) returns NULL • FindBestClauses(MLN) Create candidate clauses FOR EACH candidate clause c Compute increase in evaluation measure of adding c to MLN RETURNk clauses with greatest increase
Structure Learning • Evaluation measure • Clause construction operators • Search strategies • Speedup techniques
Evaluation Measure • R&D used pseudo-log-likelihood • This gives undue weight to predicates with large # of groundings
Evaluation Measure • Weighted pseudo-log-likelihood (WPLL) • Gaussian weight prior • Structure prior
Evaluation Measure • Weighted pseudo-log-likelihood (WPLL) • Gaussian weight prior • Structure prior weight given to predicate r
Evaluation Measure • Weighted pseudo-log-likelihood (WPLL) • Gaussian weight prior • Structure prior weight given to predicate r sums over groundings of predicate r
Evaluation Measure • Weighted pseudo-log-likelihood (WPLL) • Gaussian weight prior • Structure prior CLL: conditional log-likelihood weight given to predicate r sums over groundings of predicate r
Clause Construction Operators • Add a literal (negative/positive) • Remove a literal • Flip signs of literals • Limit # of distinct variables to restrict search space
Beam Search • Same as that used in ILP & rule induction • Repeatedly find the single best clause
Shortest-First Search (SFS) • Start from empty or hand-coded MLN • FORL Ã 1 TO MAX_LENGTH • Apply each literal addition & deletion to each clause to create clauses of length L • Repeatedly add K best clauses of length L to the MLN until no clause of length L improves WPLL • Similar to Della Pietra et al. (1997), McCallum (2003)
Speedup Techniques • FindBestClauses(MLN) Creates candidate clauses FOR EACH candidate clause c Compute increase in WPLL (using L-BFGS) of adding c to MLN RETURNk clauses with greatest increase
Speedup Techniques • FindBestClauses(MLN) Creates candidate clauses FOR EACH candidate clause c Compute increase in WPLL (using L-BFGS) of adding c to MLN RETURNk clauses with greatest increase SLOW Many candidates
Speedup Techniques • FindBestClauses(MLN) Creates candidate clauses FOR EACH candidate clause c Compute increase in WPLL (using L-BFGS) of adding c to MLN RETURNk clauses with greatest increase SLOW Many candidates SLOW Many CLLs SLOW Each CLL involves a #P-complete problem
Speedup Techniques • FindBestClauses(MLN) Creates candidate clauses FOR EACH candidate clause c Compute increase in WPLL (using L-BFGS) of adding c to MLN RETURNk clauses with greatest increase NOT THAT FAST SLOW Many candidates SLOW Many CLLs SLOW Each CLL involves a #P-complete problem
Speedup Techniques • Clause Sampling • Predicate Sampling • Avoid Redundancy • Loose Convergence Thresholds • Ignore Unrelated Clauses • Weight Thresholding
Speedup Techniques • Clause Sampling • Predicate Sampling • Avoid Redundancy • Loose Convergence Thresholds • Ignore Unrelated Clauses • Weight Thresholding
Speedup Techniques • Clause Sampling • Predicate Sampling • Avoid Redundancy • Loose Convergence Thresholds • Ignore Unrelated Clauses • Weight Thresholding
Speedup Techniques • Clause Sampling • Predicate Sampling • Avoid Redundancy • Loose Convergence Thresholds • Ignore Unrelated Clauses • Weight Thresholding
Speedup Techniques • Clause Sampling • Predicate Sampling • Avoid Redundancy • Loose Convergence Thresholds • Ignore Unrelated Clauses • Weight Thresholding
Speedup Techniques • Clause Sampling • Predicate Sampling • Avoid Redundancy • Loose Convergence Thresholds • Ignore Unrelated Clauses • Weight Thresholding
Overview • Motivation • Background • Structure Learning Algorithm • Experiments • Future Work & Conclusion
Experiments • UW-CSE domain • 22 predicates, e.g., AdvisedBy(X,Y), Student(X), etc. • 10 types, e.g., Person, Course, Quarter, etc. • # ground predicates ¼ 4 million • # true ground predicates ¼ 3000 • Handcrafted KB with 94 formulas • Each student has at most one advisor • If a student is an author of a paper, so is her advisor • Cora domain • Computer science research papers • Collective deduplication of author, venue, title
Systems MLN(SLB): structure learning with beam search MLN(SLS): structure learning with SFS
Systems MLN(SLB) MLN(SLS) KB: hand-coded KB CL: CLAUDIEN FO: FOIL AL: Aleph
Systems MLN(SLB) MLN(SLS) MLN(KB) MLN(CL) MLN(FO) MLN(AL) KB CL FO AL
Systems MLN(SLB) MLN(SLS) MLN(KB) MLN(CL) MLN(FO) MLN(AL) KB CL FO AL NB: Naïve Bayes BN:Bayesian networks
Methodology • UW-CSE domain • DB divided into 5 areas: AI, Graphics, Languages, Systems, Theory • Leave-one-out testing by area • Measured • average CLL of the ground predicates • average area under the precision-recall curve of the ground predicates (AUC)