Discriminative Learning for Markov Logic Networks

Discriminative Learning for Markov Logic Networks PhD Proposal October 9th, 2009 Tuyen N. Huynh Adviser: Prof. Raymond J. Mooney Some slides are taken from [Domingos, 2007], [Mooney, 2008]

Motivation • Most machine learning methods assume independent and identically distributed (i.i.d.) examples represented as feature vectors. • Most of data in the real world are not i.i.d. and also cannot be effectively represented as feature vectors and • Biochemical data • Social network data • Multi-relational data • …

Biochemical data Predicting mutagenicity [Srinivasan et. al, 1995]

Web-KB dataset [Slattery & Craven, 1998]

Characteristics of these structured data • Contains multiple objects/entities and relationships among them • There are a lot of uncertainties in the data: • Uncertainty about the attributes of an object • Uncertainty about the type of an object • Uncertainty about relationships between objects

Statistical Relational Learning (SRL) • SRL attempts to integrate methods from first-order logic and probabilistic graphical models to handle such noisy structured/relational data. • Some proposed SRL models: • Stochastic Logic Programs (SLPs) [Muggleton, 1996] • Probabilistic Relational Models (PRMs) [Friedman et al., 1999] • Bayesian Logic Programs (BLPs) [Kersting & De Raedt, 2001] • Relational Markov networks (RMNs) [Taskar et al., 2002] • Markov Logic Networks (MLNs) [Richardson & Domingos, 2006]

Statistical Relational Learning (SRL) • SRL attempts to integrate methods from first-order logic and probabilistic graphical models to handle such noisy structured/relational data. • Some proposed SRL models: • Stochastic Logic Programs (SLPs) [Muggleton, 1996] • Probabilistic Relational Models (PRMs) [Friedman et al., 1999] • Bayesian Logic Programs (BLPs) [Kersting & De Raedt, 2001] • Relational Markov networks (RMNs) [Taskar et al., 2002] • Markov Logic Networks (MLNs)[Richardson & Domingos, 2006]

Discriminative learning • Generative learning: learn a joint model over all variables • Discriminative learning: learn a conditional model of the output variables given the input variables • directly learn a model for predicting the outputs  has better predictive performance on the outputs in general • Most problems in structured/relational data are discriminative: make predictions based on some evidence (observable data).  Discriminative learning is more suitable

Discriminative Learning for Markov Logic Networks

Outline • Motivation • Background • Discriminative learning for MLNs with non-recursive clause [Huynh & Mooney, 2008] • Max-margin weight learning for MLNs [Huynh & Mooney, 2009] • Future work • Conclusion

First-Order Logic • Constants: Anna, Bob • Variables: x, y • Function: fatherOf(x) • Predicate: binary functions E.g: Smoke(x), Friends(x,y) • Literals: Predicates or its negation • Grounding: Replace all variables by constantsE.g.: Friends (Anna, Bob) • World (model, interpretation): Assignment of truth values to all ground literals

First-Order Clauses • Clause: A disjunction of literals • Can be rewritten as a set of implication rules ¬Smoke(x) v Cancer(x) Smoke(x) => Cancer(x) ¬Cancer(x) =>¬Smoke(x)

Markov Networks [Pearl, 1988] Smoking Cancer • Undirected graphical models Asthma Cough • Potential function: function defined over a clique ( a complete sub-graph)

Markov Networks [Pearl, 1988] Smoking Cancer • Undirected graphical models Asthma Cough • Log-linear model: Weight of Feature i Feature i

Markov Logic Networks[Richardson & Domingos, 2006] • Set of weighted first-order clauses. • Larger weight indicates stronger belief that the clause should hold. • The clauses are called thestructureof the MLN. • MLNs are templates for constructing Markov networks for a given set of constants MLN Example: Friends & Smokers

Example: Friends & Smokers Two constants: Anna (A) and Bob (B)

Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

Probability of a possible world a possible world A possible world becomes exponentially less likely as the total weight of all the grounded clauses it violates increases. Weight of formula i No. of true groundings of formula i in x

Inference in MLNs • MAP/MPE inference: find the most likely state of all unknown grounding literals given the evidence • MaxWalkSAT algorithm [Kautz et al., 1997] • Cutting Plane Inference algorithm [Riedel, 2008] • Computing the marginal conditional probability of a set of grounding literals: P(Y=y|x) • MC-SAT algorithm [Poon & Domingos, 2006] • Lifted first-order belief propagation [Singla & Domingos, 2008]

Existing structure learning methods for MLNs • Top-down approach: MSL[Kok & Domingos 05], [Bibaetal., 2008] • Start from unit clauses and search for new clauses • Bottom-up approach: BUSL [Mihalkova & Mooney, 07], LHL [Kok & Domingos 09] • Use data to generate candidate clauses

Existing weight learning methods in MLNs • Generative: maximize the (Pseudo) Log-Likelihood[Richardson & Domingos, 2006] • Discriminative : maximize the Conditional Log- Likelihood (CLL) • [Singla & Domingos, 2005] • Structured Perceptron[Collins, 2002] • [Lowd & Domingos, 2007] • First and second-order methods to optimize the CLL • Found that the Preconditioned Scaled Conjugate Gradient (PSCG) performs best

Outline • Motivation • Background • Discriminative learning for MLNs with non-recursive clause • Max-margin weight learning for MLNs • Future work • Conclusion

Drug design for Alzheimer’s disease • Comparing different analogues of Tacrine drug for Alzheimer’s disease on four biochemical properties: • Maximization of inhibition of amine re-uptake • Minimization of toxicity • Maximization of acetyl cholinesterase inhibition • Maximization of the reversal of scopolamine-induced memory impairment Tacrine drug Template for the proposed drugs

Inductive Logic Programming • Use first-order logic to represent background knowledge and examples • Automated learning of logic rules from examples and background knowledge

Inductive Logic Programming systems • GOLEM [Muggleton and Feng, 1992] • FOIL [Quinlan, 1993] • PROGOL [Muggleton, 1995] • CHILLIN [Zelle and Mooney, 1996] • ALEPH[Srinivasan, 2001]

Inductive Logic Programming example[King et al., 1995]

Results with existing learning methods for MLNs Average accuracy • What happened: The existing learning methods for MLNs fail to capture the relations between the background predicates and the target predicate New discriminative learning methods for MLNs *MLN1: MSL + PSCG **MLN2:BUSL+ PSCG

Step 2 Step 1 Clause Learner (Selecting good clauses) (Generating candidate clauses) Proposed approach Discriminative structure learning Discriminative weight learning

Discriminative structure learning • Use a variant of ALEPH, called ALEPH++, to produce a larger set of candidate clauses: • Score the clauses by m-estimate[Dzeroski, 1991], a Bayesian estimate of the accuracy of a clause. • Keep all the clauses having an m-estimate greater than a pre-defined threshold (0.6), instead of the final theory produced by ALEPH.

Facts r _subst_1(A1,H) r_subst_1(B1,H) r _subst_1(D1,H) x_subst(B1,7,CL) x_subst(HH1,6,CL) x _subst(D1,6,OCH3) polar(CL,POLAR3) polar(OCH3,POLAR2) great_polar(POLAR3,POLAR2) size(CL,SIZE1) size(OCH3,SIZE2) great_size(SIZE2,SIZE1) alk_groups(A1,0) alk groups(B1,0) alk_groups(D1,0) alk_groups(HH1,1) flex(CL,FLEX0) flex(OCH3,FLEX1) less_toxic(A1,D1) less_toxic(B1,D1) less_toxic(HH1,A1) ALEPH++ Candidate clauses x_subst(d1,6,m1) ^ alk_groups(d1,1) => less_toxic(d1,d2) alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. They are all non-recursive clauses

Discriminative weight learning • Maximize CLL with L1-regularization: • Use exact inference instead of approximate inferences • Use L1-regularization instead of L2-regularization

Exact inference • Since the candidate clauses are non-recursive, the query predicate appears only once in each clause, i.e. the probability of a query atom being true or false only depends on the evidence

L1-regularization • Put a Laplacian prior with zero mean on each weight wi • L1 ignores irrelevant features by setting their weights to zero [Ng, 2004] • Larger value of b, the regularizing parameter, corresponds to smaller variance of the prior distribution

CLL with L1-regularization • This is convex and non-smooth optimization problem • Use the Orthant-Wise Limited-memory Quasi-Newton (OWL-QN) software [Andrew & Gao, 2007] to solve the optimization problem

Facts r _subst_1(A1,H) r_subst_1(B1,H) r _subst_1(D1,H) x_subst(B1,7,CL) x_subst(HH1,6,CL) x _subst(D1,6,OCH3) … Candidate clauses alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. L1 weight learner Weighted clauses 0.34487 alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) 2.70323 x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. 0 x_subst(v8719,6,v8774) ^ alk_groups(v8719,1) => less_toxic(v8719,v8720)

Experiments

Datasets

Methodology • 10-fold cross-validation • Metric: • Average predictive accuracy over 10 folds

Q1: Does the proposed approach perform better than existing learning methods for MLNs and traditional ILP methods? Average accuracy

Q2: The effect of L1-regularization # of clauses

Q2: The effect of L1-regularization (cont.) Average accuracy

Q3:The benefit of collective inference • Adding a transitive clause with infinite weight to the learned MLNs. • less_toxic(a,b) ^ less_toxic(b,c) => less_toxic(a,c). Average accuracy

Q4: The performance of our approach against other “advanced ILP” methods Average accuracy

Outline • Motivation • Background • Discriminative learning for MLNs with non-recursive clause • Max-margin weight learning for MLNs • Future work • Conclusion

Motivation • All of the existing training methods for MLNs learn a model that produce good predictive probabilities • In many applications, the actual goal is to optimize some application specific performance measures such as F1 score (harmonic mean of precision and recall) • Max-margin training methods, especially Structural Support Vector Machines (SVMs), provide the framework to optimize these application specific measures  Training MLNs under the max-margin framework

Generic Strutural SVMs[Tsochantaridis et.al., 2004] • Learn a discriminant function f: X x Y→ R • Predict for a given input x: • Maximize the separation margin: • Can be formulated as a quadratic optimization problem

Generic Strutural SVMs (cont.) • [Joachims et.al., 2009] proposed the 1-slack formulation of the Structural SVM: Make the original cutting-plane algorithm [Tsochantaridis et.al., 2004] run faster and more scalable

Structural SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Cutting plane algorithm Repeatedly finds the next most violated constraint… … until cannot find any new constraint Cutting plane algorithm for solving the structural SVMs *Slide credit: Yisong Yue

Discriminative Learning for Markov Logic Networks

Discriminative Learning for Markov Logic Networks

Presentation Transcript

Markov Logic and Deep Networks

Transfer in Reinforcement Learning via Markov Logic Networks

10-803 Markov Logic Networks

Online Structure Learning for Markov Logic Networks

Efficient Weight Learning for Markov Logic Networks

Markov Logic Networks

Learning Markov Logic Networks with Many Descriptive Attributes

Learning Markov Logic Networks Using Structural Motifs

Online Max-Margin Weight Learning for Markov Logic Networks

Boosting Markov Logic Networks

Learning the Structure of Markov Logic Networks

Max-Margin Weight Learning for Markov Logic Networks

Discriminative Learning for Hidden Markov Models

Learning the Structure of Markov Logic Networks

Discriminative Training of Markov Logic Networks

Discriminative Structure and Parameter Learning for Markov Logic Networks

Online Max-Margin Weight Learning with Markov Logic Networks

Discriminative Structure and Parameter Learning for Markov Logic Networks

Learning the Structure of Markov Logic Networks

Discriminative Training of Markov Logic Networks