Models and Algorithmic Tools for Computational Processes in Cellular Biology Bhaskar DasGupta Department of Computer S

Models and Algorithmic Tools for Computational Processes in Cellular Biology BhaskarDasGupta Department of Computer Science University of Illinois at Chicago Chicago, IL 60607-7053 dasgupta@cs.uic.edu

What is “systems biology” in one sentence ? study to unravel and conceptualize dynamic processes, feedback control loops and signal processing mechanisms underlying life ISBRA 2012

Cellular Networks • A single cell by itself is complex enough • Various technologies have facilitated the monitoring of expression of genes and activities of proteins • Difficult to find the causal relations and overall structure of the network http://www.nyas.org/ebriefreps/ebrief/000534/images/mendes2.gif ISBRA 2012

Cellular Networks Genes and gene products interact on several levels, e.g.: • Genes regulate each other’s expression as part of gene regulatory networks • transcription factors can activate or inhibit the transcription of genes to give mRNAs • these transcription factors are themselves products of genes • Protein-protein interaction networks • proteins can participate in diverse post-translational interactions that lead to modified protein functions or to formation of protein complexes that have new roles • Different levels of interactions are integrated • e.g., presence of an external signal triggers a cascade of interactions that involves biochemical reactions, protein-protein interactions and transcriptional regulation ISBRA 2012

Cellular networks • cellular interaction maps only represent a network of possibilities, and not all edges are present and active in vivo in a given condition or in a given cellular location • only an integration of time-dependent interaction and activity information will be able to give the correct dynamical picture of a cellular network ISBRA 2012

Modeling problem • interaction data produced by the biologist in the form of a diagram (e.g., some type of labeled digraph) • wish to pose questions about the behavior (dynamics) of such a network • essential to provide a precise mathematical formulation of its dynamics, and specifically how the state of each node depends on the state of the nodes interacting with it ISBRA 2012

Models • discrete, continuous and hybrid models • their inter-relationships, powers and limitations • computational complexity and algorithmic issues • biological implications and validations • fascinating interplay between several areas such as: • biology • control theory • discrete mathematics and computer science ISBRA 2012

System dynamics • state variables • continuous • discrete (e.g., small number of quantitative states) • time variables • continuous (e.g., partial differential equation, delay equations) • discrete (difference equations, quantized descriptions of continuous variables) • deterministic or probabilistic nature of the model • hybrid models • combines continuous and discrete time-scales and/or • combines continuous and discrete time variables ISBRA 2012

Continuous-state dynamics Differential equation (continuous-time) Difference equation (discrete-time) ISBRA 2012

Boolean x1, x2, x3  {0.1} Examples of other models Boolean feedforward Signal Transduction ISBRA 2012

Reverse engineering of models Given • partial knowledge about the process/network • access to suitable biological experiments How to gain more knowledge about the model ? • effective use of resources (time, cost) ISBRA 2012

Reverse engineering Process of backward reasoning, requiring careful observation of inputs and outputs, to elucidate the structure of the system ISBRA 2012 http://www.computerworld.com/computerworld/records/images/story/46Reverse-engineering.gif

Ingredients for reverse engineering • Mathematical model to be reverse engineered • e.g., differential equation model • Biological experiments available, e.g., • perturbation experiments • gene expression measurements ISBRA 2012

Many reverse engineering approaches are possible I will discuss two types of approaches: • “hitting set” based combinatorial approaches • modular response analysis (MRA) approach ISBRA 2012

Reverse Engineering of Networks Via Modular Response Analysis Method

Ingredients for reverse engineering viamodular response analysis approach • Mathematical models • differential equation model • Biological experiments available • perturbation experiments ISBRA 2012

Differential Equation Model state variables evolve by (unknown) ordinary differential equations x = (x1(t),...,xn(t)) state variables over time t measurable (e.g., activity levels of proteins) p = (p1,...,pm)parameters that can be manipulated f(x*,p*)=0 p* “wild-type” (i.e., normal) condition of p x* corresponding steady-state condition ISBRA 2012

settings for modular response analysis method • do not know f • but, prior information of the following type is available • parameter pjdoes not effect variables xi (i.e., fi/pj ≡ 0 or not) Kholodenko, Kiyatkin, Bruggeman, Sontag, Westerhoff and Hoek, PNAS, 2002 ISBRA 2012

Experimental protocols(perturbation experiments) • perturb one parameter, saypk • for perturbed p, measure steady state vector x = (p) • let the system relax to steady state • measure xi (western blots, microarrys etc.) • estimate n “sensitivities”: where ej is the jth canonical basis vector ISBRA 2012

A B + 9.3 1.2 - 4.8 + + 2.1 - 5.3 C D Modeling Goal Modeling goal can be at different levels Topology of connections only Direction of the relationship Information about stimulatory or inhibitory effects Strength of relationship ISBRA 2012

Goal of MRA approach Obtain information about the sign of fi/xj(x,p) e.g., if fi/xj 0, then xj has a positive (catalytic) effect on the formation of xi ISBRA 2012

In a nutshell after some combinatorics and linear algebra one can quantify the additional prior knowledge necessary to reach the goal Kholodenko, Kiyatkin, Bruggeman, Sontag, Westerhoff and Hoek, PNAS, 2002 Bermen, DasGupta and Sontag, Discrete Applied Math, 2007 Berman, DasGupta and Sontag, Annals of NYAS, 2007 ISBRA 2012

But, assuming (near)-sufficient prior information • how to determine a minimum or near-minimum number of perturbation experiments that will work? This now becomes a algorithmic/complexity issue... ISBRA 2012

After some effort, one can see that designing minimal sets of experiments leads to the set multi-cover problem ISBRA 2012

In our biological application context, our set-multicover algorithm provides a set of suggested experiments such that # of experiments ≈ minimum possible ISBRA 2012

Modular Response Analysis for Differential Equations model Linear Algebraic formulation Combinatorial Algorithms (randomized) Combinatorial formulation Selection of appropriate perturbation experiments Overall high-level picture ISBRA 2012

Experimental validation of MRA Method See the paper: S. D. M. Santos, P. J. Verveer, P. I. H. Bastiaens, Growth factor-induced MAPK network topology shapes Erk response determining PC-12 cell fate Nature Cell Biology 9, 324 - 330 (2007) • MAPK pathway involving proteins Raf,Mek and Erk is activated through receptor tyrosine kinasesTrkA and epidermal growth factor receptor (EGFR) by two different stimuli, NGF (neuronal-) or EGF (epidermal growth factor) • MRA method was applied to determine the MAPK network architecture in the context of NGF and EGF stimulations ISBRA 2012

Reverse Engineering of Networks Via Hitting-set based (combinatorial) Method

“Hitting set” based combinatorial approaches topology of interconnection network hitting set steady state profiles of perturbations of the network multi-hitting set introduce redundancy topology of interconnection network hitting set expression data representing state transition measurement for wildtype and perturbation data multi-hitting set introduce redundancy ISBRA 2012

Basic idea behind the hitting-set based approaches x5 changes so does x1, x3, x4 at least one of {x1,x3,x4} must influence x5 minimal dependency (hitting set problem) {x1,x2} which variables influence x5 ? {x1,x3,x4} {x2,x3,x4} {x1} {x1} {x1,x3} build dependency information over all successive time steps ISBRA 2012

Why construct “minimal” dependency ? Occam's razor entia non suntmultiplicandapraeternecessitatem (entities must not be multiplied beyond necessity) However, biological networks may be redundant: e.g. • G. Tononi, O. Sporns, G. M. Edelman, PNAS, 1999 • R. Albert et al., Physical Review E, 2011 How can we introduce redundancy if necessary ? ISBRA 2012

How can we introduce redundancy if necessary ? First idea: add random extra dependencies (edges) not good, these edges may not be supported by given data Better idea: modify hitting set to “multi-hitting set” {x1,x3,x4} previously:select at least 1 now: select at least 2 (in general, some r) ISBRA 2012

Evaluation of performance of reverse engineering Methods Reverse-engineering methods are ill-posed, i.e., their solution is not unique • existence of measurement error • not all molecular species involved in a given analyzed phenomenon are included in the construction of a network • i.e., existence of hidden variables Two possible ways for evaluation: Experimental testing of predictions: after a model has been inferred, newly found interactions or predictions can be tested experimentally Benchmarking testing: measure how “accurate” the method of our interest is in recovering a known (“gold standard”) network ISBRA 2012

Evaluation of performance of reverse engineering Methods Metrics for accuracy for benchmark testing Measurements: • correct interactions inferred (true positives, TP) • incorrect interactions inferred (false positives, FP) • correct non-interactions inferred (true negatives, TN) • incorrect non-interactions inferred (false negatives FN) Metrics • recall or true positive rate • false positive rate • accuracy • precision or positive predictive value ISBRA 2012

Two published method based on hitting set approach (A) Ideker, Thorsson, Karp, PSB (2000) First step (network inference): estimate a set of Boolean networks consistent with an observed set of steady-state gene expression profiles, each generated from a different perturbation to the genetic network Second step (optimization): use an entropy-based approach to select an additional perturbation experiment to perform a model selection from the set of predicted Boolean networks (B) Jarrah, Laubenbacher, Stigler, Stillman, Adv. in Applied Mathematics (2007) Attempts to infer the most likely causal relationships among network elements from gene expression data For other published results, see, for example: Krupa, Journal of Theoretical Biology (2002) ISBRA 2012

Comparative analysis (via benchmark testing) of two approaches by (A)Ideker, Thorsson, Karp (B)Jarrah, Laubenbacher, Stigler, Stillman Two gold standard networks: • Segment polarity network ofDrosophila melanogaster(fruit fly): • last step in the hierarchical cascade of gene families initiating the segmented body of the fruit fly • genes of this network include • engrailed (en) • wingless (wg), • hedgehog (hh) • patched (ptc) • cubitusinterruptus (ci) and • sloppy paired (slp) coding for the corresponding proteins • 1 para-segment of 4 cells • 60 nodes: variables are expression levels of segment polarity genes/proteins • Boolean model from (Albert and Othmer, Journal of Theoretical Biology, 2003) DasGupta, Vera-Licona, Sontag, 2011 ISBRA 2012

b. In Siliconetwork: gene regulatory network with external perturbations • 13 species: 10 genes plus 3 different environmental perturbations • perturbations affect the transcription rate of the gene on which they act directly (through inhibition or activation) and their effect is propagated throughout the network by the interactions between the genes • generated using the software packagein (Mendes, Trends Biochem. Sci, 1997) DasGupta, Vera-Licona, Sontag, 2011 ISBRA 2012

generated time courses for both networks (a) and (b) For method (A) we considered both greedy and linear programming based approximations to the hitting set problem as well as redundancy values R=1, 2 For method (B), input data must be discrete used three discretization methods: • graph-theoretic based approached “D” from (Dimitrova, Garcia-Puente, Jarrah, Laubenbacher, Stigler, Stillman, Vera-Licona, 2010) • quantile “Q” discretization (method in which each variable state receives an equal number of data values) • interval “I” discretization (select thresholds for the different discrete values). DasGupta, Vera-Licona, Sontag, 2011 ISBRA 2012

Summary of Comparison • network (b): • method (B) was better than method (A) in ROC space • method (A) achieved a performance no better than random guessing • network (a): • method (B) could not obtain any results after running over 12 hours • method (A) was able to compute results in less than 1 minute • method (A) improved slightly when small redundancy was introduced DasGupta, Vera-Licona, Sontag, 2011 ISBRA 2012

implementation of method (B): http://polymath.vbi.vt.edu/polynome/ implementation of method (A) done by (DasGupta, Vera-Licona, Sontag, 2011) at http://sts.bioengr.uic.edu/causal/ DasGupta, Vera-Licona, Sontag, 2011 ISBRA 2012

Direct Synthesization of Signal Transduction Networks Only from known interactions and information No new experiments needed

direct interaction A → B A ┤B network Overall Goal Method (algorithms, software) additional information minimal complexity biologically relevant double-causal interaction A → (B → C) A → (B ┤C) FAST ISBRA 2012

Nature of experimental evidence • biochemical • directinteraction, e.g., • binding of two proteins • a transcription factor activating the transcription of a gene • a chemical reaction with a single reactant and single product • pharmacological • indirect causal effects most probably resulting from a chain of interactions and reactions, e.g., • binding of a chemical to a receptor protein starts a cascade of protein-protein interactions and chemical reactions that ultimately results in the transcription of a gene • genetic evidence of differential responses to a stimulus • can be direct, but most often indirect (double-causal) ISBRA 2012

We describe a method for synthesizing double-causal (path-level) information into a consistent network ISBRA 2012

A B Direct interactions A promotes B A → B A inhibits B A ┤ B Illustration of double-causal interaction C promotes the process of A promoting B B A pseudo A B C ISBRA 2012

“Critical” edge (known direct interaction, part of input) ISBRA 2012

Main computational step for network synthesis • Pseudo-vertex collapse (PVC) • easy • Binary transitive reduction (BTR) • hard • need heuristics ISBRA 2012

Pseudo-vertex collapse (PVC) Intuitively, PVC is useful for reducing the pseudo-vertex set to the minimal set that maintains the graph consistent with all indirect experimental observations. pseudo-vertices u out(u)=out(v) in(u)=in(v) v new psuedo-vertex uv ISBRA 2012

Illustration of Binary Transitive Reduction (BTR) remove? remove? yes, alternate path no, critical edge Intuitively, the BTR problem is useful for determining the sparsest graph consistent with a set of experimental observations ISBRA 2012

Some biologists did look at very simplified or somewhat different version of BTR, e.g.: • A. Wagner, Estimating Coarse Gene Network Structure from Large-Scale Gene Perturbation Data, Genome Research, 12, pp. 309-315, 2002 • too special (reachability only), no efficient algorithms reported • T. Chen, V. Filkov and S. Skiena, Identifying Gene Regulatory Networks from Experimental Data, Third Annual International Conference on Computational Moledular Biology, pp. 94-103, 1999 • “excess edge deletion” problem, biologically too restrictive version See the following excellent survey for more comprehensive information about biological network inference and modeling: • V. Filkov, Identifying Gene Regulatory Networks from Gene Expression Data, in Handbook of Computational Molecular Biology (edited by S. Aluru), Chapman & Hall/CRC Press, 2005 • H. D. Jong, Modelling and Simulation of Genetic Regulatory Systems: A Literature Review, Journal of Computational Biology, Volume 9, Number 1, pp. 67-103, 2002 ISBRA 2012

Models and Algorithmic Tools for Computational Processes in Cellular Biology Bhaskar DasGupta Department of Computer S

Models and Algorithmic Tools for Computational Processes in Cellular Biology Bhaskar DasGupta Department of Computer S

Presentation Transcript

Computational biology and computational biologists

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY

A Novel Method for Signal Transduction Network Inference from Indirect Experimental Evidence Bhaskar DasGupta Department

Algorithmic Foundations of Computational Biology

Modeling and Computational Tools for Contemporary Biology

Hidden Markov models in Computational Biology

Hidden Markov models in Computational Biology

Methods for Construction and Analysis of Computational Models in S ystems Biology

Cellular Processes

Adrianne Norris Department of Biochemistry, Cellular and Molecular Biology

Computational Biology and Bioinformatics in Computer Science

Algorithmic Foundations of Computational Biology

Cellular Processes

COMPUTER MODELS IN BIOLOGY

Introduction to Probabilistic Models for Computational Biology

Computer modeling of cellular processes

Computer modeling of cellular processes

Cellular Processes

Computational Tools for Population Biology

Computational Technologies for Algorithmic Medicine

Bioinformatics III “Systems biology” “Cellular networks” “Computational cell biology”

Cellular Processes