1 / 78

Discriminative Learning for Markov Logic Networks

Discriminative Learning for Markov Logic Networks. PhD Proposal. October 9 th , 2009. Tuyen N. Huynh Adviser: Prof. Raymond J. Mooney. Some slides are taken from [Domingos, 2007], [Mooney, 2008]. Motivation.

pandora
Download Presentation

Discriminative Learning for Markov Logic Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discriminative Learning for Markov Logic Networks PhD Proposal October 9th, 2009 Tuyen N. Huynh Adviser: Prof. Raymond J. Mooney Some slides are taken from [Domingos, 2007], [Mooney, 2008]

  2. Motivation • Most machine learning methods assume independent and identically distributed (i.i.d.) examples represented as feature vectors. • Most of data in the real world are not i.i.d. and also cannot be effectively represented as feature vectors and • Biochemical data • Social network data • Multi-relational data • …

  3. Biochemical data Predicting mutagenicity [Srinivasan et. al, 1995]

  4. Web-KB dataset [Slattery & Craven, 1998]

  5. Characteristics of these structured data • Contains multiple objects/entities and relationships among them • There are a lot of uncertainties in the data: • Uncertainty about the attributes of an object • Uncertainty about the type of an object • Uncertainty about relationships between objects

  6. Statistical Relational Learning (SRL) • SRL attempts to integrate methods from first-order logic and probabilistic graphical models to handle such noisy structured/relational data. • Some proposed SRL models: • Stochastic Logic Programs (SLPs) [Muggleton, 1996] • Probabilistic Relational Models (PRMs) [Friedman et al., 1999] • Bayesian Logic Programs (BLPs) [Kersting & De Raedt, 2001] • Relational Markov networks (RMNs) [Taskar et al., 2002] • Markov Logic Networks (MLNs) [Richardson & Domingos, 2006]

  7. Statistical Relational Learning (SRL) • SRL attempts to integrate methods from first-order logic and probabilistic graphical models to handle such noisy structured/relational data. • Some proposed SRL models: • Stochastic Logic Programs (SLPs) [Muggleton, 1996] • Probabilistic Relational Models (PRMs) [Friedman et al., 1999] • Bayesian Logic Programs (BLPs) [Kersting & De Raedt, 2001] • Relational Markov networks (RMNs) [Taskar et al., 2002] • Markov Logic Networks (MLNs)[Richardson & Domingos, 2006]

  8. Discriminative learning • Generative learning: learn a joint model over all variables • Discriminative learning: learn a conditional model of the output variables given the input variables • directly learn a model for predicting the outputs  has better predictive performance on the outputs in general • Most problems in structured/relational data are discriminative: make predictions based on some evidence (observable data).  Discriminative learning is more suitable

  9. Discriminative Learning for Markov Logic Networks

  10. Outline • Motivation • Background • Discriminative learning for MLNs with non-recursive clause [Huynh & Mooney, 2008] • Max-margin weight learning for MLNs [Huynh & Mooney, 2009] • Future work • Conclusion

  11. First-Order Logic • Constants: Anna, Bob • Variables: x, y • Function: fatherOf(x) • Predicate: binary functions E.g: Smoke(x), Friends(x,y) • Literals: Predicates or its negation • Grounding: Replace all variables by constantsE.g.: Friends (Anna, Bob) • World (model, interpretation): Assignment of truth values to all ground literals

  12. First-Order Clauses • Clause: A disjunction of literals • Can be rewritten as a set of implication rules ¬Smoke(x) v Cancer(x) Smoke(x) => Cancer(x) ¬Cancer(x) =>¬Smoke(x)

  13. Markov Networks [Pearl, 1988] Smoking Cancer • Undirected graphical models Asthma Cough • Potential function: function defined over a clique ( a complete sub-graph)

  14. Markov Networks [Pearl, 1988] Smoking Cancer • Undirected graphical models Asthma Cough • Log-linear model: Weight of Feature i Feature i

  15. Markov Logic Networks[Richardson & Domingos, 2006] • Set of weighted first-order clauses. • Larger weight indicates stronger belief that the clause should hold. • The clauses are called thestructureof the MLN. • MLNs are templates for constructing Markov networks for a given set of constants MLN Example: Friends & Smokers

  16. Example: Friends & Smokers Two constants: Anna (A) and Bob (B)

  17. Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

  18. Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

  19. Example: Friends & Smokers Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

  20. Probability of a possible world a possible world A possible world becomes exponentially less likely as the total weight of all the grounded clauses it violates increases. Weight of formula i No. of true groundings of formula i in x

  21. Inference in MLNs • MAP/MPE inference: find the most likely state of all unknown grounding literals given the evidence • MaxWalkSAT algorithm [Kautz et al., 1997] • Cutting Plane Inference algorithm [Riedel, 2008] • Computing the marginal conditional probability of a set of grounding literals: P(Y=y|x) • MC-SAT algorithm [Poon & Domingos, 2006] • Lifted first-order belief propagation [Singla & Domingos, 2008]

  22. Existing structure learning methods for MLNs • Top-down approach: MSL[Kok & Domingos 05], [Bibaetal., 2008] • Start from unit clauses and search for new clauses • Bottom-up approach: BUSL [Mihalkova & Mooney, 07], LHL [Kok & Domingos 09] • Use data to generate candidate clauses

  23. Existing weight learning methods in MLNs • Generative: maximize the (Pseudo) Log-Likelihood[Richardson & Domingos, 2006] • Discriminative : maximize the Conditional Log- Likelihood (CLL) • [Singla & Domingos, 2005] • Structured Perceptron[Collins, 2002] • [Lowd & Domingos, 2007] • First and second-order methods to optimize the CLL • Found that the Preconditioned Scaled Conjugate Gradient (PSCG) performs best

  24. Outline • Motivation • Background • Discriminative learning for MLNs with non-recursive clause • Max-margin weight learning for MLNs • Future work • Conclusion

  25. Drug design for Alzheimer’s disease • Comparing different analogues of Tacrine drug for Alzheimer’s disease on four biochemical properties: • Maximization of inhibition of amine re-uptake • Minimization of toxicity • Maximization of acetyl cholinesterase inhibition • Maximization of the reversal of scopolamine-induced memory impairment Tacrine drug Template for the proposed drugs

  26. Inductive Logic Programming • Use first-order logic to represent background knowledge and examples • Automated learning of logic rules from examples and background knowledge

  27. Inductive Logic Programming systems • GOLEM [Muggleton and Feng, 1992] • FOIL [Quinlan, 1993] • PROGOL [Muggleton, 1995] • CHILLIN [Zelle and Mooney, 1996] • ALEPH[Srinivasan, 2001]

  28. Inductive Logic Programming example[King et al., 1995]

  29. Results with existing learning methods for MLNs Average accuracy • What happened: The existing learning methods for MLNs fail to capture the relations between the background predicates and the target predicate New discriminative learning methods for MLNs *MLN1: MSL + PSCG **MLN2:BUSL+ PSCG

  30. Step 2 Step 1 Clause Learner (Selecting good clauses) (Generating candidate clauses) Proposed approach Discriminative structure learning Discriminative weight learning

  31. Discriminative structure learning • Use a variant of ALEPH, called ALEPH++, to produce a larger set of candidate clauses: • Score the clauses by m-estimate[Dzeroski, 1991], a Bayesian estimate of the accuracy of a clause. • Keep all the clauses having an m-estimate greater than a pre-defined threshold (0.6), instead of the final theory produced by ALEPH.

  32. Facts r _subst_1(A1,H) r_subst_1(B1,H) r _subst_1(D1,H) x_subst(B1,7,CL) x_subst(HH1,6,CL) x _subst(D1,6,OCH3) polar(CL,POLAR3) polar(OCH3,POLAR2) great_polar(POLAR3,POLAR2) size(CL,SIZE1) size(OCH3,SIZE2) great_size(SIZE2,SIZE1) alk_groups(A1,0) alk groups(B1,0) alk_groups(D1,0) alk_groups(HH1,1) flex(CL,FLEX0) flex(OCH3,FLEX1) less_toxic(A1,D1) less_toxic(B1,D1) less_toxic(HH1,A1) ALEPH++ Candidate clauses x_subst(d1,6,m1) ^ alk_groups(d1,1) => less_toxic(d1,d2) alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. They are all non-recursive clauses

  33. Discriminative weight learning • Maximize CLL with L1-regularization: • Use exact inference instead of approximate inferences • Use L1-regularization instead of L2-regularization

  34. Exact inference • Since the candidate clauses are non-recursive, the query predicate appears only once in each clause, i.e. the probability of a query atom being true or false only depends on the evidence

  35. L1-regularization • Put a Laplacian prior with zero mean on each weight wi • L1 ignores irrelevant features by setting their weights to zero [Ng, 2004] • Larger value of b, the regularizing parameter, corresponds to smaller variance of the prior distribution

  36. CLL with L1-regularization • This is convex and non-smooth optimization problem • Use the Orthant-Wise Limited-memory Quasi-Newton (OWL-QN) software [Andrew & Gao, 2007] to solve the optimization problem

  37. Facts r _subst_1(A1,H) r_subst_1(B1,H) r _subst_1(D1,H) x_subst(B1,7,CL) x_subst(HH1,6,CL) x _subst(D1,6,OCH3) … Candidate clauses alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) x_subst(d1,6,m1) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. L1 weight learner Weighted clauses 0.34487 alk_groups(d1,0) ^ r_subst_1(d2,H) => less_toxic(d1,d2) 2.70323 x_subst(d1,6,m1) ^ polar(m1,POLAR3) ^ alk_groups(d1,1) => less_toxic(d1,d2) …. 0 x_subst(v8719,6,v8774) ^ alk_groups(v8719,1) => less_toxic(v8719,v8720)

  38. Experiments

  39. Datasets

  40. Methodology • 10-fold cross-validation • Metric: • Average predictive accuracy over 10 folds

  41. Q1: Does the proposed approach perform better than existing learning methods for MLNs and traditional ILP methods? Average accuracy

  42. Q2: The effect of L1-regularization # of clauses

  43. Q2: The effect of L1-regularization (cont.) Average accuracy

  44. Q3:The benefit of collective inference • Adding a transitive clause with infinite weight to the learned MLNs. • less_toxic(a,b) ^ less_toxic(b,c) => less_toxic(a,c). Average accuracy

  45. Q4: The performance of our approach against other “advanced ILP” methods Average accuracy

  46. Outline • Motivation • Background • Discriminative learning for MLNs with non-recursive clause • Max-margin weight learning for MLNs • Future work • Conclusion

  47. Motivation • All of the existing training methods for MLNs learn a model that produce good predictive probabilities • In many applications, the actual goal is to optimize some application specific performance measures such as F1 score (harmonic mean of precision and recall) • Max-margin training methods, especially Structural Support Vector Machines (SVMs), provide the framework to optimize these application specific measures  Training MLNs under the max-margin framework

  48. Generic Strutural SVMs[Tsochantaridis et.al., 2004] • Learn a discriminant function f: X x Y→ R • Predict for a given input x: • Maximize the separation margin: • Can be formulated as a quadratic optimization problem

  49. Generic Strutural SVMs (cont.) • [Joachims et.al., 2009] proposed the 1-slack formulation of the Structural SVM: Make the original cutting-plane algorithm [Tsochantaridis et.al., 2004] run faster and more scalable

  50. Structural SVM Problem Exponential constraints Most are dominated by a small set of “important” constraints Cutting plane algorithm Repeatedly finds the next most violated constraint… … until cannot find any new constraint Cutting plane algorithm for solving the structural SVMs *Slide credit: Yisong Yue

More Related