780 likes | 1.03k Views
Discriminative Learning for Markov Logic Networks. PhD Proposal. October 9 th , 2009. Tuyen N. Huynh Adviser: Prof. Raymond J. Mooney. Some slides are taken from [Domingos, 2007], [Mooney, 2008]. Motivation.
E N D
Discriminative Learning for Markov Logic Networks PhD Proposal October 9th, 2009 Tuyen N. Huynh Adviser: Prof. Raymond J. Mooney Some slides are taken from [Domingos, 2007], [Mooney, 2008]
Motivation • Most machine learning methods assume independent and identically distributed (i.i.d.) examples represented as feature vectors. • Most of data in the real world are not i.i.d. and also cannot be effectively represented as feature vectors and • Biochemical data • Social network data • Multi-relational data • …
Biochemical data Predicting mutagenicity [Srinivasan et. al, 1995]
Characteristics of these structured data • Contains multiple objects/entities and relationships among them • There are a lot of uncertainties in the data: • Uncertainty about the attributes of an object • Uncertainty about the type of an object • Uncertainty about relationships between objects
Statistical Relational Learning (SRL) • SRL attempts to integrate methods from first-order logic and probabilistic graphical models to handle such noisy structured/relational data. • Some proposed SRL models: • Stochastic Logic Programs (SLPs) [Muggleton, 1996] • Probabilistic Relational Models (PRMs) [Friedman et al., 1999] • Bayesian Logic Programs (BLPs) [Kersting & De Raedt, 2001] • Relational Markov networks (RMNs) [Taskar et al., 2002] • Markov Logic Networks (MLNs) [Richardson & Domingos, 2006]
Statistical Relational Learning (SRL) • SRL attempts to integrate methods from first-order logic and probabilistic graphical models to handle such noisy structured/relational data. • Some proposed SRL models: • Stochastic Logic Programs (SLPs) [Muggleton, 1996] • Probabilistic Relational Models (PRMs) [Friedman et al., 1999] • Bayesian Logic Programs (BLPs) [Kersting & De Raedt, 2001] • Relational Markov networks (RMNs) [Taskar et al., 2002] • Markov Logic Networks (MLNs)[Richardson & Domingos, 2006]
Discriminative learning • Generative learning: learn a joint model over all variables • Discriminative learning: learn a conditional model of the output variables given the input variables • directly learn a model for predicting the outputs has better predictive performance on the outputs in general • Most problems in structured/relational data are discriminative: make predictions based on some evidence (observable data). Discriminative learning is more suitable
Discriminative Learning for Markov Logic Networks
Outline • Motivation • Background • Discriminative learning for MLNs with non-recursive clause [Huynh & Mooney, 2008] • Max-margin weight learning for MLNs [Huynh & Mooney, 2009] • Future work • Conclusion
First-Order Logic • Constants: Anna, Bob • Variables: x, y • Function: fatherOf(x) • Predicate: binary functions E.g: Smoke(x), Friends(x,y) • Literals: Predicates or its negation • Grounding: Replace all variables by constantsE.g.: Friends (Anna, Bob) • World (model, interpretation): Assignment of truth values to all ground literals
First-Order Clauses • Clause: A disjunction of literals • Can be rewritten as a set of implication rules ¬Smoke(x) v Cancer(x) Smoke(x) => Cancer(x) ¬Cancer(x) =>¬Smoke(x)
Markov Networks [Pearl, 1988] Smoking Cancer • Undirected graphical models Asthma Cough • Potential function: function defined over a clique ( a complete sub-graph)
Markov Networks [Pearl, 1988] Smoking Cancer • Undirected graphical models Asthma Cough • Log-linear model: Weight of Feature i Feature i
Markov Logic Networks[Richardson & Domingos, 2006] • Set of weighted first-order clauses. • Larger weight indicates stronger belief that the clause should hold. • The clauses are called thestructureof the MLN. • MLNs are templates for constructing Markov networks for a given set of constants MLN Example: Friends & Smokers
Example: Friends & Smokers Two constants: Anna (A) and Bob (B)
Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)
Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)
Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)
Probability of a possible world a possible world A possible world becomes exponentially less likely as the total weight of all the grounded clauses it violates increases. Weight of formula i No. of true groundings of formula i in x
Inference in MLNs • MAP/MPE inference: find the most likely state of all unknown grounding literals given the evidence • MaxWalkSAT algorithm [Kautz et al., 1997] • Cutting Plane Inference algorithm [Riedel, 2008] • Computing the marginal conditional probability of a set of grounding literals: P(Y=y|x) • MC-SAT algorithm [Poon & Domingos, 2006] • Lifted first-order belief propagation [Singla & Domingos, 2008]
Existing structure learning methods for MLNs • Top-down approach: MSL[Kok & Domingos 05], [Bibaetal., 2008] • Start from unit clauses and search for new clauses • Bottom-up approach: BUSL [Mihalkova & Mooney, 07], LHL [Kok & Domingos 09] • Use data to generate candidate clauses
Existing weight learning methods in MLNs • Generative: maximize the (Pseudo) Log-Likelihood[Richardson & Domingos, 2006] • Discriminative : maximize the Conditional Log- Likelihood (CLL) • [Singla & Domingos, 2005] • Structured Perceptron[Collins, 2002] • [Lowd & Domingos, 2007] • First and second-order methods to optimize the CLL • Found that the Preconditioned Scaled Conjugate Gradient (PSCG) performs best
Outline • Motivation • Background • Discriminative learning for MLNs with non-recursive clause • Max-margin weight learning for MLNs • Future work • Conclusion
Drug design for Alzheimer’s disease • Comparing different analogues of Tacrine drug for Alzheimer’s disease on four biochemical properties: • Maximization of inhibition of amine re-uptake • Minimization of toxicity • Maximization of acetyl cholinesterase inhibition • Maximization of the reversal of scopolamine-induced memory impairment Tacrine drug Template for the proposed drugs
Inductive Logic Programming • Use first-order logic to represent background knowledge and examples • Automated learning of logic rules from examples and background knowledge
Inductive Logic Programming systems • GOLEM [Muggleton and Feng, 1992] • FOIL [Quinlan, 1993] • PROGOL [Muggleton, 1995] • CHILLIN [Zelle and Mooney, 1996] • ALEPH[Srinivasan, 2001]
Results with existing learning methods for MLNs Average accuracy • What happened: The existing learning methods for MLNs fail to capture the relations between the background predicates and the target predicate New discriminative learning methods for MLNs *MLN1: MSL + PSCG **MLN2:BUSL+ PSCG
Step 2 Step 1 Clause Learner (Selecting good clauses) (Generating candidate clauses) Proposed approach Discriminative structure learning Discriminative weight learning
Discriminative structure learning • Use a variant of ALEPH, called ALEPH++, to produce a larger set of candidate clauses: • Score the clauses by m-estimate[Dzeroski, 1991], a Bayesian estimate of the accuracy of a clause. • Keep all the clauses having an m-estimate greater than a pre-defined threshold (0.6), instead of the final theory produced by ALEPH.
Facts r _subst_1(A1,H) r_subst_1(B1,H) r _subst_1(D1,H) x_subst(B1,7,CL) x_subst(HH1,6,CL) x _subst(D1,6,OCH3) polar(CL,POLAR3) polar(OCH3,POLAR2) great_polar(POLAR3,POLAR2) size(CL,SIZE1) size(OCH3,SIZE2) great_size(SIZE2,SIZE1) alk_groups(A1,0) alk groups(B1,0) alk_groups(D1,0) alk_groups(HH1,1) flex(CL,FLEX0) flex(OCH3,FLEX1) less_toxic(A1,D1) less_toxic(B1,D1) less_toxic(HH1,A1) ALEPH++ Candidate clauses x_subst(d1,6,m1) ^ alk_groups(d1,1) => less_toxic(d1,d2) alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. They are all non-recursive clauses
Discriminative weight learning • Maximize CLL with L1-regularization: • Use exact inference instead of approximate inferences • Use L1-regularization instead of L2-regularization
Exact inference • Since the candidate clauses are non-recursive, the query predicate appears only once in each clause, i.e. the probability of a query atom being true or false only depends on the evidence
L1-regularization • Put a Laplacian prior with zero mean on each weight wi • L1 ignores irrelevant features by setting their weights to zero [Ng, 2004] • Larger value of b, the regularizing parameter, corresponds to smaller variance of the prior distribution
CLL with L1-regularization • This is convex and non-smooth optimization problem • Use the Orthant-Wise Limited-memory Quasi-Newton (OWL-QN) software [Andrew & Gao, 2007] to solve the optimization problem
Facts r _subst_1(A1,H) r_subst_1(B1,H) r _subst_1(D1,H) x_subst(B1,7,CL) x_subst(HH1,6,CL) x _subst(D1,6,OCH3) … Candidate clauses alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. L1 weight learner Weighted clauses 0.34487 alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) 2.70323 x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. 0 x_subst(v8719,6,v8774) ^ alk_groups(v8719,1) => less_toxic(v8719,v8720)
Methodology • 10-fold cross-validation • Metric: • Average predictive accuracy over 10 folds
Q1: Does the proposed approach perform better than existing learning methods for MLNs and traditional ILP methods? Average accuracy
Q2: The effect of L1-regularization # of clauses
Q2: The effect of L1-regularization (cont.) Average accuracy
Q3:The benefit of collective inference • Adding a transitive clause with infinite weight to the learned MLNs. • less_toxic(a,b) ^ less_toxic(b,c) => less_toxic(a,c). Average accuracy
Q4: The performance of our approach against other “advanced ILP” methods Average accuracy
Outline • Motivation • Background • Discriminative learning for MLNs with non-recursive clause • Max-margin weight learning for MLNs • Future work • Conclusion
Motivation • All of the existing training methods for MLNs learn a model that produce good predictive probabilities • In many applications, the actual goal is to optimize some application specific performance measures such as F1 score (harmonic mean of precision and recall) • Max-margin training methods, especially Structural Support Vector Machines (SVMs), provide the framework to optimize these application specific measures Training MLNs under the max-margin framework
Generic Strutural SVMs[Tsochantaridis et.al., 2004] • Learn a discriminant function f: X x Y→ R • Predict for a given input x: • Maximize the separation margin: • Can be formulated as a quadratic optimization problem
Generic Strutural SVMs (cont.) • [Joachims et.al., 2009] proposed the 1-slack formulation of the Structural SVM: Make the original cutting-plane algorithm [Tsochantaridis et.al., 2004] run faster and more scalable
Structural SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Cutting plane algorithm Repeatedly finds the next most violated constraint… … until cannot find any new constraint Cutting plane algorithm for solving the structural SVMs *Slide credit: Yisong Yue