350 likes | 490 Views
CS 774B Topics in AI Machine Learning: Theory and Practice Learning Genetic Regulation Networks from mRNA Expression Data. reviewed by PhilHyoun Lee BioInformatics System Laboratory Department of BioSystems, KAIST, KOREA. Outline. Introduction Overview of Previous Approaches Clustering
E N D
CS 774B Topics in AI Machine Learning: Theory and PracticeLearning Genetic Regulation Networks frommRNA Expression Data reviewed by PhilHyoun Lee BioInformatics System Laboratory Department of BioSystems, KAIST, KOREA
Outline • Introduction • Overview of Previous Approaches • Clustering • Boolean Network • Differential Equation • Bayesian Network Approaches • Bayesian network • Freidman et al. 2000 – first Bayesian network modeling • Pe’er et al. 2001 – infer the causality • Hartemink et al. 2002 - combine location information • Summary
Part 1. IntroductionGenetic Regulatory NetworkmRNA Gene Expression Data
muscle cell neuron cell blood cell A central goal of biology is to understand the regulation of protein synthesis & its reactions to external and internal signals
Genetic Regulatory Network Cytoplasm cis sites Transcription Factors Intracellular Signaling Genetic Regulatory Network mRNA Translation + processing Nucleus Receptors Ion Channels Extracellular space Ligands ELECTROPHYSIOLOGY the set of mutually activating and repressing genes and gene products and their interactions
Gene Expression overall process of genetic information flows from genes to proteins mRNA Data from cytoplasm Genetic Regulatory Junctions
mRNA Expression Data Format From cDNA microarray N X P matrix • 0 < ratio < Inf. • Inf. < log2(ratio) < + Inf. • where • log2(ratio) > 0: increase • log2(ratio) < 0: decrease
Problem Definition Difficulty in Reconstructing Genetic Regulatory Network 1. mRNA expression is only a partial picture 2. the number of sample is much smaller than the number of genes 3. high noise Gene 4 Gene 2 Gene 3 Gene 1 Gene 5 Gene 6 Genetic regulation network Microarray data
Part 2. Previous approachesI. Clustering II. Boolean Network II. Differential Equation
I. Clustering • Grouping genes with similar patterns of expression Common role gene clustered together Regulation and interaction pattern inferred Uncharacterized gene function guessed Similarity measure : standard correlation coefficient, .. Method : Hierarchical clustering, K-means, SOM .. Can’t reveal the inner interaction structure !
Ex1 Ex2 Ex3 Ex4 Ex5 Ex6 Gene 1 1 1 0 0 0 0 Gene 2 1 1 1 0 0 0 Gene 3 0 1 1 1 0 0 110 111 011 001 000 II. Boolean Network • Boolean network is a Graph consist of G(V, F) V is a set of nodes ( genes ) as x1 , x2, …, xn F is a list of Boolean functions f(x1 , x2, …, xn) Gene expression data is quantized to 1 (Active) and 0 (Inactive) X2 = X1 ∧ X2 Trajectory : series of state transition Attractor : a set of states that repeat itself in a fixed sequence Similar semantics with biological phenomenon such as cell cycle, differentiation But too unrealistic assumptions !
Dynamic System for Gene Expression n The number of genes in genome r mRNA concentration (n-dim vector) p Protein concentration (n-dim vector) f(p) Transcription function L Translational constant n X n non-degenerate diagonal matrix V Degradation rates of mRNA n X n non-degenerate diagonal matrix U Degradation rates of Proteins n X n non-degenerate diagonal matrix change in mRNA concentration change in protein concentration III. Differential Equation • Linear Transcription Model • Assume Transcription function f(p) is a linear functions of p, f(p) = Cp • Let x = (r, p)T, M be a 2n X 2n transition matrix, gene expression can be modeled
Part 3. Bayesian NetworkI. Bayesian NetworkII. Freidman et al. 2000 III. Pe’er et al. 2001IV. Hartemink et al. 2002
I. Bayesian Network - Definition Gene B Gene A Gene D Gene C Gene E P( A, B, C, D, E ) = ∏ P ( Xi | Parent(Xi) ) = P(A) P(B) P(C|A,B) P(D|B) P(E|D) Probabilistic framework for robust inference of interactions in the presence of noise • G: a directed-acyclic graph structure • : a set of parameters for conditional distribution of each variable
B A Structure Learning ! C I. Bayesian Network - Process P(A,B,C) = P(A) P(B) P(A|C) Independence Data Expression data (N X P matrix)
B B B B B B A A A A A A C C C C C C S(G:D) = 79 S(G:D) = 56 S(G:D) = 86 S(G:D) = 76 S(G:D) = 64 S(G:D) = 56 I. Bayesian Network - Structure Learning • Heuristic Search Approaches • greedy-hill climbing, simulated annealing etc • Model selection • select a good model • Selective model averaging • select a number of good models and pretend these models are exhaustive
X and Z are conditionally independent given Y partially directed graph (PDAG) Z X Y Y Y Y X X X Z Z Z I. Bayesian Network– Structure Learning Independence Equivalent Class iff they have the same v-structure ignoring arc direction Ordered tuple(X,Y,Z) such that there is an arc from X to Y and from Z to Y, but no arc between X and Z We can’t distinguish between equivalent graphs
prior likelihood S(G:D) = log p(D, Sh) = log p(Sh) + log p(D|Sh) From the chain rule of probability Likelihood log p(D|Sh) = ∑ log p(xi | pa(xi), Sh) • Assuming equal priors on structure, • I. Model with the highest log likelihood is a model that is the best predictor of the data D • II. Score can use local criteria ∑ Slocal(Xi, Pa(Xi), D) • and is same for members of equivalent classes I. Bayesian Network – Structure Learning Get the score for each network with respect to the training data
II. Using Bayesian Networks to Analyze Expression Databy Friedman et al. 2000. S.Cerevisiae Cell-Cycle data by spellman (1998) Discretization Bayesian Network Structure Learning Feature Estimation : Bootstrap method Feature Analysis
II. Using Bayesian Networks to Analyze Expression Data by Friedman et al. 2000. • Data 800 X 76 data varied over the different cell-cycle stages (mainly 250 cell cycle regulated genes, trans-acting factors) • Discretization 3 categories: -1, 0, and 1 (threshold value of 0.5 in log2 scale) • Structure Learning algorithm Sparse Candidate Algorithm ( identify a small number of candidate parents for each gene based on local statistics such as correlation )
II. Using Bayesian Networks to Analyze Expression Data by Friedman et al. 2000. • Feature estimation: extract useful features Markov relation iff there is either an edge between them, or both are parents of another variable two genes are related in some joint biological interaction or process Order relation A is an ancestor of B in all the equivalent Bayesian networks learned transcription of one gene is a direct cause of the transcription of another gene Dominant gene potential causal sources of cell-cycle process 200 fold Bootstrap Approach generate “perturbed” versions of original data set and learn from them
II. Using Bayesian Networks to Analyze Expressio n Data by Friedman et al. 2000. • Summary • The first Bayesian network model • Focus is on extracting features, not the whole network structure • Future work • Deal with continuous data • Incorporate biological knowledge • Applying Dynamic Bayesian Networks to temporal expression data • Improve discretization method • Discover causal patterns
III. Inferrring Subnetworks from Perturbed Expression Profilesby Pe’er et al. 2001. S.Cerevisiae Perturbation data by Hughes et al (2000) Discretization Bayesian Network Structure Learning Feature Estimation : Bootstrap method Subnetwork Analysis
A A B B No effect to gene A Effect to gene B III. Inferrring Subnetworks from Perturbed Expression Profilesby Pe’er et al. 2001. • Data 565 X 300 ( including mutated genes and genes of significant change in at least 4 profiles ) Ideal intervention gene assigined a specific value => Gene deletion, over-expression mutants Others are modeled as indicator variable, constrained to be roots => Temperature sensitive, kinetic mutation and external treatment
III. Inferrring Subnetworks from Perturbed Expression Profiles by Pe’er et al. 2001. • New Feature – Activation, Inhibition Let Xis one ofPa(Y) and U = Pa(Y) – {X} X activates YIf P(Y=1 | X, U) increases when X increases and U is fixed X inhibits YIf P(Y=-1 | X, U) increases when X increases and U is fixed • Sub-network construction select a threshold ts (=0.75) of significant confidence find maximal connected components each component (more than 3 variables) is a seed expand seed with variables by a confidence t’ (t’< ts, t’=0.5)
III. Inferrring Subnetworks from Perturbed Expression Profiles by Pe’er et al. 2001. • Summary • Can identify causality relationship from data • New types of features • Focus is on extracting sub-networks, not the whole network • Future work • Incorporate biological knowledge • Identifying latent factors that interact with several observed genes
IV. Combining Location and Expression data for Principled Discovery of Genetic Regulatory Network Modelsby Hartemink et al. 2002. S.Cerevisiae Mating data by Hartemink et al (2002) Discretization Structure Learning with Prior knowledge Feature Estimation : Model Averaging Compare with non-constraint one
STE12 FUS1 MCM1 IV. Combining Location and Expression data for Principled Discovery of Genetic Regulatory Network Modelsby Hartemink et al. 2002. • Expression Data 32 * 320 under a diversity of experimental conditions => pheromone response signaling or mating related genes is discretized into 4 stages Location Data find the upstream regions where a specific transcription factor binds using a chromatin immunoprecipitation assay • Incorporate genomic location data to guide the model structure non-uniform prior over structures that gives zero weight to models with required edges absent
IV. Combining Location and Expression data for Principled Discovery of Genetic Regulatory Network Modelsby Hartemink et al. 2002. • Model Averaging gather 500 highest scoring models Compute the probability of edge using weighted average approximation • Draw the result network edges are included if their posterior probability > 0.5
IV. Combining Location and Expression data for Principled Discovery of Genetic Regulatory Network Modelsby Hartemink et al. 2002. • Summary model reconstruction is unable from expression data alone edges indicate a statistical dependence between transcript level of genes, but not necessarily the form or presence of a physical dependence => location data is proved physical direct edge • Future work location data also could be noisy => relax the model prior 0 to small but positive weight
Part 4. Summary • Bayesian network is suitable for genetic network reconstruction • Can deal with stochastic nature • Ideal for sparse domain (Useful for locally interacting components • Can handle noisy data • Missing data • Hidden variable (protein level, other molecules) • Inference reasoning • More research needed • To solve dimensionality problem => Incorporation of more biological information • To model feedback process => Dynamic Bayesian networks
Reference • General • A bibliography on learning causal networks of gene interactions by Florian Markowetz, 2003 • Modeling and Simulation of Genetic Regulatory Systems: A Literature Review by Hidde De Jong, 2002 • Genetic Network Analysis – From Bench to computers and back by Zoltan Szallasi, 2001 • Modeling Transcriptional Control in Gene Networks-Methods, Recent Results, and Future Directions by Paul Smolen et al. 2000 • Boolean Network • Identification of Genetic Networks from a small number of gene expression patterns under the boolean network model by Tatsuya et al. 1999. • REVEAL, a general reverse engineering algorithm for inference of genetic network architectures by Shoudan Liang, 1998.
Reference • Differential Expression • Inferring Gene Regulator Networks from Time-Ordered Gene Expression Data Using Differential Equation by Michiel de Hoon et al. 2002. • Stability of Genetic Regulatory Network with Time Delay by Luonan chen et al. 2002. 3. Modeling Gene Expression with Differential Equations by Ting Chen et al. 1999. • Bayesian Network • Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection by Yoshinori et al. 2003. • Combining Location and Expression data for Principled Discovery of Genetic Regulatory Network Modelsby Hartemink et al. 2002. • Inferrring Subnetworks from Perturbed Expression Profiles by Pe’er et al. 2001. 4. Using Bayesian Networks to Analyze Expression Databy Friedman et al. 2000.
Reference • Dynamic Bayesian Network • Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks by Dirk Husmeier et al. 2003. 2. Modeling regulatory pathways in E.coli from times series expression profiles by Irene M. Ong et al. 2002. 3. Evaluating functional network inference using simulations of complex biological systems by V. Anne Smith et al. 2002. • Modeling Gene Expression Data using Dynamic Bayesian Networks by Kevin Murphy et al. 2000. • Etc • Inference of Gene Regulatory Model by Genetic Algorithm by Shin Ando. 2001. • Gene Network Reconstruction Using a Distributed GA with a Backprop Local Search by Mark Cumiskey et al. 2002.