470 likes | 577 Views
Symbolic model checking of biochemical systems Logic programming steps towards formal biology François Fages, INRIA Rocquencourt http://contraintes.inria.fr/. Joint work with and Nathalie Chabrier-Rivier Sylvain Soliman
E N D
Symbolic model checking of biochemical systemsLogic programming steps towards formal biologyFrançois Fages, INRIA Rocquencourthttp://contraintes.inria.fr/ • Joint work with and • Nathalie Chabrier-Rivier Sylvain Soliman • In collaboration with ARC CPBIO http://contraintes.inria.fr/cpbio • Alexander Bockmayr, Vincent Danos, Vincent Schächter et al.
Current revolution in Biology • Elucidation of high-level biological processes • in terms of their biochemical basis at the molecular level. • Mass production of genomic and post-genomic data: • ARN expression, protein synthesis, protein-protein interactions,… • Need for a strong parallel effort on the formal representation of biological processes. • Need for formal tools for modeling and reasoning about their global behavior.
Formalisms for modeling biochemical systems • Diagrammatic notation • Boolean networks [Thomas 73] • Milner’s p–calculus [Regev-Silverman-Shapiro 99-01, Nagasali et al. 00] • Concurrent transition systems[Chabrier-Chiaverini-Danos-Fages-Schachter 03] • Biochemical abstract machine BIOCHAM [Chabrier-Fages-Soliman 03] • Pathway logic [Eker-Knapp-Laderoute-Lincoln-Meseguer-Sonmez 02] • Bio-ambients [Regev-Panina-Silverman-Cardelli-Shapiro 03] • Differential equations • Hybrid Petri nets [Hofestadt-Thelen 98, Matsuno et al. 00] • Hybrid automata [Alur et al. 01, Ghosh-Tomlin 01] • Hybrid concurrent constraint languages[Bockmayr-Courtois 01]
Our goal • Beyond simulation, provide formal tools for querying, validating and completing biological models. • Our proposal: • Use of temporal logic CTL as a query language for models of biological processes; • Use of concurrent transition systems for their modeling; • Use of symbolic and constraint-based model checkers for automatically evaluating CTL queries in qualitative and quantitative models. • Use of inductive logic programming for learning models [EU APRIL 2] • In course, learn and teach bits of biology with logic programs.
Plan of the talk • Introduction • A simple algebra of cell molecules • Concurrent transition systems of biochemical reactions • Example of the mammalian cell cycle control • Temporal logic CTL as a query language • Computational results with BIOCHAM • Learning models • An experiment with inductive logic programming • Quantitative models • Simulation with differential equations • Constraint-based model checking • Conclusion
References • A wonderful textbook: • Molecular Cell Biology. 5th Edition, 1100 pages+CD, Freeman Publ. • Lodish, Berk, Zipursky, Matsudaira, Baltimore, Darnell. Nov. 2003. • Genes and signals. Ptashne, Gann. CSHL Press. 2002. • Modeling dynamic phenomena in molecular and cellular biology. • Segel. Cambridge Univ. Press. 1987. • Modeling and querying bio-molecular interaction networks. • Chabrier, Chiaverini, Danos, Fages, Schächter. To appear in TCS. 2003. • The biochemical abstract machine BIOCHAM. Chabrier, Fages, Soliman. http://contraintes.inria.fr/BIOCHAM
2. A Simple Algebra of Cell Molecules • Small molecules: covalent bonds (outer electrons shared) 50-200 kcal/mol • 70% water • 1% ions • 6% amino acids (20), nucleotides (5), • fats, sugars, ATP, ADP, … • Macromolecules: hydrogen bonds, ionic, hydrophobic, Waals 1-5 kcal/mol • Stability and bindings determined by the number of weak bonds: 3D shape • 20% proteins (50-104 amino acids) • RNA (102-104 nucleotides AGCU) • DNA (102-106 nucleotides AGCT)
Structure levels of proteins • 1) Primary structure: word of n amino acids residues (20n possibilities) • linked with C-N bonds • ICLP • Isoleucine Cysteine Leucine Proline • 2) Secondary: word of m a-helix, b-strands, random coils,… (3m-10m) • stabilized by hydrogen bonds H---O • 3) Tertiary 3D structure: spatial folding • stabilized by • hydrophobic • interactions
Formal proteins • Cyclin dependent kinase 1 Cdk1 • (free, inactive) • Complex Cdk1-Cyclin A Cdk1–CycB • (low activity) • Phosphorylated Cdk1~{thr161}-CycB • at site threonine 161 • (high activity) • (BIOCHAM syntax)
Gene expression: DNA RNA protein • DNA:word over 4 nucleotides Adenine, Guanine, Cytosine, Thymine • double helix of pairs A--T and C---G • Replication: DNA synthesis • Genes: parts of DNA • Transcription: RNA copying from a gene • ERCC1-(PRB-JUN-CFOS)
Genome Size 3,200,000,000 pairs of nucleotides single nucleotide polymorphism 1 / 2kb
Algebra of Cell Molecules • E ::= Name|E-E|E~{E,…,E}|(E) S ::= _|E|S+S • Names: proteins, gene binding sites, molecules, abstract processes… • - : binding operator for protein complexes, gene binding sites, … • Non associative, non commutative (could be in most cases) • ~{…}: modification operator for phosphorylated sites, … • Associative, Commutative, Idempotent. • + : solution operator, “soup aspect”, Assoc. Comm. Idempotent, Neutral _ • No membranes, no transport formalized. Bitonal calculi [Cardelli 03].
Plan of the talk • Introduction • A simple algebra of cell molecules • Concurrent transition systems of biochemical reactions • Example of the mammalian cell cycle control • Temporal logic CTL as a query language • Computational results with BIOCHAM • Learning models • An experiment with inductive logic programming • Quantitative models • Simulation with differential equations • Constraint-based model checking • Conclusion
3. Concurrent Transition Syst. of Biochemical Reactions • Enzymatic reactions: • R ::= S=>S | S=[E]=>S | S=[R]=>S | S<=>S | S<=[E]=>S • (where A<=>B stands for A=>BB=>A and A=[C]=>B for A+C=>B+C, etc.) • define a concurrent transition system over integers denoting the multiplicity of the molecules (multiset rewriting). • One can associate a finite abstract CTS over booleanstate variables denoting the presence/absence of molecules • which correctly over-approximates the set of all possible behaviors • If we translate a reaction A+B=>C+D by 4 rules for possible consumption: • A+BA+B+C+D A+BA+B +C+D • A+BA+B+C+D A+BA+B+C+D
Four Rule Schemas • Complexation: A + B => A-B • Cdk1+CycB => Cdk1–CycB • Phosphorylation: A =[C]=> A~{p} • Cdk1–CycB =[Myt1]=> Cdk1~{thr161}-CycB • Cdk1~{thr14,tyr15}-CycB =[Cdc25~{Nterm}]=> Cdk1-CycB • Synthesis: _ =[C]=> A. • _ =[Ge2-E2f13-Dp12]=> CycA • Degradation: A =[C]=> _. • CycE =[UbiPro]=> _ (not for CycE-Cdk2 which is stable)
An Actin-Myosin Engine with ATP fuel • A two-stroke nano-engine: • Myosin + ATP => Myosin-ATP • Myosin-ATP => Myosin + ADP • http://www.sci.sdsu.edu/movies http://www-rocq.inria.fr/sosso/icema2
Cell Cycle: G1 DNA Synthesis G2 Mitosis • G1: CdK4-CycD • Cdk6-CycD • Cdk2-CycE • S: Cdk2-CycA • G2 • M: Cdk1-CycA • Cdk1-CycB
Kohn’s map detail for Cdk2 • Complexation with CycA and CycE Phosphorylation sites PY15 and P • Concurrent Transition Rules: • cdk2+cycA => cdk2-cycA. • cdk2~{p2}+cycA => cdk2~{p2}-cycA. • cdk2~{p1}+cycA => cdk2~{p1}-cycA. • cdk2~{p1,p2}+cycA => cdk2~{p1,p2}-cycA. • cdk2+cycE => cdk2-cycE. • cdk2+cycE~{p1} => cdk2-cycE~{p1}. • cdk2~{p2}+cycE => cdk2~{p2}-cycE. • … • 700 rules, 165 proteins and genes, 500 variables, 2500 states.
Translation in Prolog • Encode states with a single predicate p(A,B,C,D,E) • A+BC+D. p(1,1,_,_,E):-p(_,_,1,1,E). • C A. p(_,B,1,D,E):- p(1,B,_,D,E). • Thm. [Delzanno-Podelski 99] Predecessor(S) = TP(S) • Backward analysis by computing lfp(TP{p(x):-s}). • CLP-based Deductive Model Checker DMC [Delzanno-Podelski 99] • More efficient implementation using state-of-the-art symbolic model-checker NuSMV [Cimatti Clarke Giunchiglia Giunchiglia Pistore 02].
Plan of the talk • Introduction • A simple algebra of cell molecules • Concurrent transition systems of biochemical reactions • Example of the mammalian cell cycle control • Temporal logic CTL as a query language • Computational results with BIOCHAM • Learning models • An experiment with inductive logic programming • Quantitative models • Simulation with differential equations • Constraint-based model checking • Conclusion
E, A Non-determinism AG EU EF F,G,U Time 4. Temporal Logic CTL as a Query Language • Computation Tree Logic
Kripke Structures • A Kripke structure K is a triple (S; R; L) where S is a set of states, and RSxS is a total relation. • s |= f if f is true in s, • s |= E f if there is a path from s such that |= f, • s |= A f if for every path from s, |= f, • |= f if s |= f where s is the starting state of , • |= X f if 1 |= f, • |= F f if there exists k >0 such that k |= f, • |= G f if for every k >0, k |= f, • |= f1 U f2 iff there exists k>0 such that k |= f for all j < k j |= f. • Following [Emerson 90] we identify a formula f to the set of states which satisfy it f ~ {sS : s |= f}.
Symbolic Model Checking • Model Checking is an algorithm for computing, in a given finite Kripke structure the set of states satisfying a CTL formula: {sS : s |= f}. • Basic algorithm: represent K as a graph and iteratively label the nodes with the subformulas of f which are true in that node. • Add f to the states satisfying f • Add EF f(EX f) to the (immediate) predecessors of states labeled by f • Add E(f1 U f2 ) to the predecessor states of f2 while they satisfy f1 • Add EG f to the states for which there exists a path leading to a non trivial strongly connected component of the subgraph of states satisfying f • Symbolic model checking: use OBDDs to represent states and transitions as boolean formulas (S is finite).
Biological Queries (1/3) • About reachability: • Given an initial state init, can the cell produce some protein P? init EF(P) • Which are the states from which a set of products P1,. . . , Pn can be produced simultaneously? EF(P1^…^Pn) • About pathways: • Can the cell reach a state s while passing by another state s2? init EF(s2^EFs) • Is state s2 a necessary checkpoint for reaching state s? EF(s2U s) • Is it possible to produce P without using nor creating Q? EF(Q U s) • Can the cell reach a state s without violating some constraints c? init EF(cUs)
Biological Queries (2/3) • About stability: • Is a certain (partially described) state s a stable state? sAG(s)sAG(s) (s denotes both the state and the formula describing it). • Is s a steady state (with possibility of escaping) ? sEG(s) • Can the cell reach a stable state? initEF(AG(s))not a LTL formula. • Must the cell reach a stable state? initAF(AG(s)) • What are the stable states? Not expressible in CTL [Chan 00]. • Can the system exhibit a cyclic behavior w.r.t. the presence of P ? init EG((P EF P) ^ (P EF P))
Biological Queries (3/3) • About the correctness of the model: • Can one see the inaccuracies of the model and correct them? • Exhibit a counterexample pathway or a witness. Suggest refinements of the model or biological experiments to validate/invalidate the property of the model. • About durations: • How long does it take for a molecule to become activated? • In a given time, how many Cyclins A can be accumulated? • What is the duration of a given cell cycle’s phase? • CTL operators abstract from durations. Time intervals can be modeled in FO by adding numerical arguments for start times and durations.
Cell to Cell Signaling by Hormones and Receptors • Receptor Tyrosine Kinase RTK • RAF + RAFK -> RAF-RAFK • RAFp + RAFPH -> RAFp-RAFPH • MEKp + RAFp -> MEKp-RAFp • … • RAF-RAFK -> RAF + RAFK. • RAFp-RAFPH -> RAFp + RAFPH. • MEKp-RAFp -> MEKp + RAFp. • … • RAF-RAFK -> RAFK + RAFp. • RAFp-RAFPH -> RAF + RAFPH. • MEKp-RAFp -> MEKpp + RAFp. • …
Cell to Cell Signaling by Hormones and Receptors • Receptor Tyrosine Kinase RTK • RAF + RAFK -> RAF-RAFK • RAFp + RAFPH -> RAFp-RAFPH • MEKp + RAFp -> MEKp-RAFp • … • RAF-RAFK -> RAF + RAFK. • RAFp-RAFPH -> RAFp + RAFPH. • MEKp-RAFp -> MEKp + RAFp. • … • RAF-RAFK -> RAFK + RAFp. • RAFp-RAFPH -> RAF + RAFPH. • MEKp-RAFp -> MEKpp + RAFp. • … MEKp is a checkpoint for the cascade (producing MAPKpp) ?- nusmv(!(E(!(MEKp) U MAPKpp))). true The PH complexes are only here to "slow down" the cascade ?- nusmv(E(!(MEKp-MEKPH) U MAPKpp)). true
Cell Cycle: G1 DNA Synthesis G2 Mitosis • G1: CdK4-CycD • Cdk6-CycD • Cdk2-CycE • S: Cdk2-CycA • G2 • M: Cdk1-CycA • Cdk1-CycB
Mammalian Cell Cycle Control Benchmark • 700 rules, 165 proteins and genes, 500 variables, 2500 states. • BIOCHAM NuSMV model-checker time in seconds:
Plan of the talk • Introduction • A simple algebra of cell molecules • Concurrent transition systems of biochemical reactions • Example of the mammalian cell cycle control • Temporal logic CTL as a query language • Computational results with BIOCHAM • Learning models • An experiment with inductive logic programming • Quantitative models • Simulation with differential equations • Constraint-based model checking • Conclusion
5. Learning Models • Basic idea: learn reaction rules from temporal properties of the system. • Learning of yeast cell cycle rules from reachability properties and counterexamples with Progol [Muggleton 00]. • reaction([m_CP,m_Y],[m_pM]). • reaction([m_CP],[m_C2]). • % reaction([m_pM],[m_M]). • reaction([m_M],[m_C2,m_YP]). • reaction([m_C2],[m_CP]). • reaction([m_YP],[]). • reaction([],[m_Y]). • pathway(S1,S2) :- same(S1,S2). • pathway(S1,S2) :- reaction(L1,L2), transition(S1,L1,S3,L2), pathway(S3,S2).
Inductive Logic Programming • reaction([m_pM],[m_M]) learned… • 6th PCRD APRIL 2 “Applications of Probabilistic Inductive Logic Progr.” Luc de Raedt, Univ. Freiburg, Stephen Muggleton, Univ. London.
Plan of the talk • Introduction • A simple algebra of cell molecules • Concurrent transition systems of biochemical reactions • Example of the mammalian cell cycle control • Temporal logic CTL as a query language • Computational results with BIOCHAM • Learning models • An experiment with inductive logic programming • Quantitative models • Simulation with differential equations • Constraint-based model checking • Conclusion
6. Quantitative Models • Enzymatic reactions with rates k1 k2 k3 • E+S k1 C k2 E+P • E+S k3 C • can be compiled by the law of mass action into a system of • Ordinary Differential Equations • dE/dt = -k1ES+(k2+k3)C • dS/dt = -k1ES+k3C • dC/dt = k1ES-(k2+k3)C • dP/dt = k2C
Circadian Cycle Model • C' = -(k1*C)-k4*C-kdC*C +k2*CN+k3*P2*T2 • CN' = k1*C-k2*CN-kdN*CN • MP' = (KIP^n*nusP)/(KIP^n+CN^n) • -kd* MP-(numP*MP)/(KmP+MP) • MT' = (KIT^n*nusT)/(KIT^n+CN^n) • -MT[ t]*(kd+numT/(KmT+MT)) • P0' = ksP*MP-kd*P0-(V1P*P0)/( K1P+P0) • +(V2P*P1)/(K2P+P1) • P1' = (V1P*P0)/(K1P+P0)-kd*P1 -(V2P*P1)/(K2P+P1) • -(V3P*P1)/( K3P+P1)+(V4P*P2)/(K4P+P2) • P2' = k4*C+(V3P*P1)/(K3P+P1) -kd*P2-(V4P*P2)/(K4P+P2) • -(nudP*P2)/(KdP+P2)-k3*P2*T2 • T0' = ksT*MT-kd*T0-(V1T*T0)/( K1T+T0)+(V2T*T1)/(K2T+T1) • T1' = (V1T*T0)/(K1T+T0)-kd*T1 -(V2T*T1)/(K2T+T1)-(V3T*T1)/( K3T+T1)+(V4T*T2)/(K4T+T2) • T2' = k4*C+(V3T*T1)/(K3T+T1) -k3*P2*T2-(V4T*T2)/(K4T+T2) -T2*(kd+nudT/(KdT+T2))
Gene Interaction Networks • Gene interaction example [Bockmayr-Courtois 01] • Hybrid Concurrent Constraint Programming HCC [Saraswat et al.] • 2 genes x and y. • dx/dt = 0.01 – 0.02*x if y < 0.8 • dx/dt = – 0.02*x if y ≥ 0.8 • dy/dt = 0.01*x
Concurrent Transition System • Time discretized using Euler’s method (Runge-Kutta method in HCC): • y < 0.8 x’ = x + dt*(0.01-0.02*x) , y’ = y + dt*0.01*x • y ≥ 0.8 x’ = x + dt*(0.01-0.02*x) , y’ = y + dt*0.01*x • Initial condition: x=0, y=0. • CLP(R) program • Init :- X=0, Y=0, p(X,Y). • p(X,Y):-X>=0, Y>=0, Y<0.8, • X1=X-0.02*X+0.01, Y1=Y+0.01*X, p(X1,Y1). • p(X,Y):-X>=0, Y>=0, Y>=0.8, • X1=X-0.02*X, Y1=Y+0.01*X, p(X1,Y1).
Proving CTL properties by computing fixpoints of CLP programs Theorem [Delzanno Podelski 99] EF(f)=lfp(TP{p(x):-f}), EG(f)=gfp(TPf). Safety property AG(f) iff EF(f) iff initlfp(TP{f}) Liveness property AG(f1AF(f2)) iff initlfp(TPf1gfp(TP{f2})) Prolog-based implementation in CLP(R,B) [Delzanno 00] Applications to life in silico: Proof of protocols, cache consistency, etc. [Delzanno 01]
Deductive Model Checker DMC: Gene Interaction • r(init, p(s_s,A,B), {A=0,B=0}). • r(p(s_s,A,B), p(s_s,C,D), {A>=0,B>=0.8,C=A-0.02*A,D=B+0.01*A}). • r(p(s_s,A,B), p(s_s,C,D), {A>=0,B>=0,B<0.8, • C=A-0.02*A+0.01,D=B+0.01*A}). • | ?- prop(P,S). • P = unsafe, S = p:s*(x>=0.6) • | ?- ti. • Property satisfied. Execution time 0.0 • | ?- ls. • s(0, p(s_s,A,_), {A>=0.6}, 1, (0,0)).
Demonstration DMC (continued) • | ?- prop(P,S). • P = unsafe, S = p:s*(x>=0.2) ? • | ?- ti. • Property NOT satisfied. Execution time 1.5 • | ?- ls. • s(0, p(s_s,A,_), {A>=0.2}, 1, (0,0)). • s(1, p(s_s,A,B), {B<0.8,B>=-0.0,A>=0.19387755102040816}, 2, (2,1)). • … • s(26, p(s_s,A,B), {B>=0.0,A>=0.0, • B+0.1982676351105516*A<0.7741338175552753}, 27, (2,26)). • s(27, init, {}, 28, (1,27)).
7. Conclusion • The great ambition of logic programming is to make of programming a modeling task in the first place, with equations, constraints and logical formulae. • In this respect, computational molecular biology offers numerous challenges to the logic programming community at large. • Besides combinatorial search and optimization problems coming from molecular biology (DNA and protein sequence comparison, protein structure prediction,…) there is a need to model globally the system at hand and automate reasoning on all its possible behaviors.
Conclusion • The biochemical abstract machine BIOCHAM project aims at developing: • Qualitative models of complex biochemical processes: • Intracellular and extracellular signaling, cell-cycle control,… [http://contraintes.inria.fr/CMBSlib] • Prolog-based implementation + BDD symbolic model-checking • ILP-based learning of models from temporal properties [6thPCRD APRIL 2] • Membranes and transportation not modeled • Bitonal algebras [Cardelli et al. 03] BioAmbients, Brane calculi [Cardelli et al. 03]
Perspectives for LP • Quantitative models: • Differential equations • Hybrid discrete-continuous time models • Hybrid concurrent constraint programming [Bockmayr-Courtois-Eveillard 03] • CLP-based model-checking [Delzanno-Podelski 99] [Chabrier-Fages 03] • Multi-scale molecular-electro-physiological models[Sorine et al. 03] • http://www-rocq.inria.fr/sosso/icema2 • http://www.sci.sdsu.edu/movies