Beyond Co-expression: Gene Network Inference

Beyond Co-expression:Gene Network Inference Patrik D’haeseleer Harvard University http:/genetics.med.harvard.edu/~patrik

Beyond Co-expression • Clustering approaches rely on co-expression of genes under different conditions • Assumes co-expression is caused by co-regulation • We would like to do better than that: • Causal inference • What is regulating what?

Gene Network Inference

Overview • Modeling Issues: • Level of biochemical detail • Boolean or continuous? • Deterministic or stochastic? • Spatial or non-spatial? • Data Requirements • Linear Models • Nonlinear models • Conclusions

Level of Biochemical Detail • Detailed models require lots of data! • Highly detailed biochemical models are only feasible for very small systems which are extensively studied • Example: Arkin et al. (1998), Genetics 149(4):1633-48 lysis-lysogeny switch in Lambda: 5 genes, 67 parameters based on 50 years of research, stochastic simulation required supercomputer

Example: Lysis-Lysogeny Arkin et al. (1998), Genetics 149(4):1633-48

Level of Biochemical Detail • In-depth biochemical simulation of e.g. a whole cell is infeasible (so far) • Less detailed network models are useful when data is scarce and/or network structure is unknown • Once network structure has been determined, we can refine the model

0 A 0 C = A AND B C 1 B Boolean or Continuous? • Boolean Networks (Kauffman (1993), The Origins of Order) assumes ON/OFF gene states. • Allows analysis at the network-level • Provides useful insights in network dynamics • Algorithms for network inference from binary data

Boolean or Continuous? • Boolean abstraction is poor fit to real data • Cannot model important concepts: • amplification of a signal • subtraction and addition of signals • compensating for smoothly varying environmental parameter (e.g. temperature, nutrients) • varying dynamical behavior (e.g. cell cycle period) • Feedback control: negative feedback is used to stabilize expression  causes oscillation in Boolean model

Deterministic or Stochastic? • Use of concentrations assumes individual molecules can be ignored • Known examples (in prokaryotes) where stochastic fluctuations play an essential role (e.g. lysis-lysogeny in lambda) • Requires stochastic simulation (Arkin et al. (1998), Genetics 149(4):1633-48), or modeling molecule counts (e.g. Petri nets, Goss and Peccoud (1998), PNAS 95(12):6750-5) • Significantly increases model complexity

Deterministic or Stochastic? • Eukaryotes: larger cell volume, typically longer half-lives. Few known stochastic effects. • Yeast: 80% of the transcriptome is expressed at 0.1-2 mRNA copies/cell Holstege, et al.(1998), Cell 95:717-728. • Human: 95% of transcriptome is expressed at <5 copies/cell Velculescu et al.(1997), Cell 88:243-251

Spatial or Non-Spatial • Spatiality introduces additional complexity: • intercellular interactions • spatial differentiation • cell compartments • cell types • Spatial patterns also provide more data e.g. stripe formation in Drosophila: Mjolsness et al. (1991), J. Theor. Biol. 152: 429-454. • Few (no?) large-scale spatial gene expression data sets available so far.

Overview • Modeling Issues: • Level of biochemical detail • Boolean or continuous? • Deterministic or stochastic? • Spatial or non-spatial? • Data Requirements • Linear Models • Nonlinear models • Conclusions

Overview • Modeling Issues • Data Requirements: • Lower bounds from information theory • Effect of limited connectivity • Comparison with clustering • Variety of data points needed • Linear Models • Nonlinear models • Conclusions

Lower Bounds from Information Theory • How many bits of information are needed just to specify the connection pattern of a network? • N2 possible connections between N nodes  N2 bits needed to specify which connections are present or absent • O(N) bits of information per “data point” O(N) data points needed

Effect of Limited Connectivity • Assume only K inputs per gene (on average)  NK connections out of N2 possible: possible connection patterns • Number of bits needed to fully specify the connection pattern: O(Klog(N/K)) data points needed

Comparison with clustering • Use pairwise correlation comparisons as a stand-in for clustering • As number of genes increases, number of false positives will increase as well  need to use more stringent correlation test • If we want to use the same correlation cutoff value r, we need to increase the number of data points as N increases O(log(N)) data points needed

Summary Fully connected N (thousands) Connectivity K Klog(N/K) (hundreds?) Clustering log(N) (tens) • Additional constraints reduce data requirements: • limited connectivity • choice of regulatory functions • Network inference is feasible, but does require much more data than clustering

Variety of Data Points Needed • To unravel regulation of a gene, need to sample many different combinations of its regulatory inputs (different environmental conditions and perturbations) • Time series data yields dynamics, but all data points are related • Steady-state data yields attractors, can sample state space more efficiently • Both types of data will be needed, and multiple data sets of each

Overview • Modeling Issues • Data Requirements • Linear Models: • Formulation • Underdetermined problem! • Solution 1: reduce N • Solution 2 • Nonlinear models • Conclusions

g1 g2 or w12 w55 g5 w23 g4 g3 Linear Models • Basic model: weighted sum of inputs • Simple network representation: • Only first-order approximation • Parameters of the model: weight matrix containing NxN interaction weights • “Fitting” the model: find the parameters wji, bi such that model best fits available data

Underdetermined problem! • Assumes fully connected network: need at least as many data points (arrays, conditions) as variables (genes)! • Underdetermined (underconstrained, ill-posed) model: we have many more parameters than data values to fit • No single solution, rather infinite number of parameter settings that will all fit the data equally well

Solution 1: reduce N • Rather than trying to model all genes, we can reduce the dimensionality of the problem: • Network of clusters: construct a linear model based on the cluster centroids • rat CNS data (4 clusters): Wahde and Hertz (2000), Biosystems 55, 1-3:129-136. • yeast cell cycle (15-18 clusters): Mjolsness et al.(2000), Advances in Neural Information Processing Systems 12; van Someren et al.(2000) ISMB2000, 355-366. • Network of Principle Components: linear model between “characteristic modes” of the data Holter et al.(2001), PNAS 98(4):1693-1698.

Solution 2: • Take advantage of additional information: • replicates • accuracy of measurements • smoothness of time series • … • Most likely, the network will still be poorly constrained.  Need a method to identify and extract those parts of the model that are well-determined and robust

What’s next? • Regulatory motifs: once we have identified the corresponding DNA binding proteins (transcription factors), we can start building the gene network from there • Integration with other data: • transcription factors • functional annotation • known interactions in the literature • protein-protein interactions • protein expression levels • genetic data • ...

Linking Regulatory Motifs to Expression Data Patrik D’haeseleer Harvard University http:/genetics.med.harvard.edu/~patrik

TF DNA regulatory motif gene Introduction • Gene expression is regulated by Transcription Factors (TFs), that bind to specific regulatory motifs in the promoter region of the gene. Synonyms: regulatory element, regulatory sequence, promoter elements, promoter motifs, (TF) binding site, operator (in prokaryotes), … • Question: Do genes with similar expression patterns share regulatory motifs?

1.5 1 0.5 0 1 2 3 -0.5 -1 -1.5 1.5 1 1.2 0.5 0.7 0 0.2 1 2 3 -0.5 -0.3 1 2 3 -1 -0.8 -1.5 -2 -1.3 -1.8 1: Systematic Determination of Genetic Network Architecture Time-point 1 Tavazoie et al., Nature Genetics 22, 281 – 285 (1999) Normalized Expression Time-point 3 Time -point Time-point 2 Normalized Expression Normalized Expression Time -point Time -point

Search for Motifs in Promoter Regions 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 …ARO4 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ILV6 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 …ARO1 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 300-600 bp of upstream sequence per gene are searched in Saccharomyces cerevisiae.

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 …ARO4 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 …ARO1 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA ********** Best Motif Found by AlignACE

MCB ORFs within MIPS Functional category (# ORFs) category 23 DNA synthesis and replication (82) 30 Cell cycle control and mitosis (312) 11 Recombination and DNA repair (84) Number of ORFs 40 Nuclear organization (720) CLUSTER Number of sites Distance from ATG (b.p.) N=182

Systematic Determination of Genetic Network Architecture • Tavazoie et al., Nature Genetics 22, 281–285 (1999) • Most motifs found are highly selective for the cluster they were found in. • Can find many known binding sites for transcription factors. • Also finds many novel regulatory motifs, associated with specific functional categories. 1) cluster 2) identify regulatory motifs in clustered genes 3) identify TF’s that bind to those motifs  Gene regulation network

2: Regulatory Element Detection Using Correlation with Expression • What is the contribution of each regulatory motif (or the TF that binds to that motif) to the expression level of the genes containing the motif? • Given a set of known or putative regulatory motifs, identify all genes that contain the motif in their promoter region. • For a single expression experiment (e.g. single point in a time series), is the presence of the motif correlated with the expression level of the genes? • Perform multiple regression of (log) expression level on the presence/absence of the motifs. • Plot contribution of motif throughout time series. Bussemaker et al., Nature Genetics 27, 167 – 174 (2001)

: the presence of motif 1 is correlated with the expression levels of the genes in which it appears : motif 2 is not correlated with expression levels of the genes in which it appears : motif 3 is negatively correlated with expression levels of the genes in which it appears Contribution of Motifs to Expression Levels

Linear Combination of Motif Contributions • Find the most highly correlated motif. • Determine its contribution Fi to expression level by linear regression. • Subtract its contribution from the expression levels. • Find the next highest correlated motif. • Repeat until no more significantly correlated motifs. • Repeat this entire analysis for each time point of a time series  weights Fi for the individual motifs will change throughout he time course.

Time (minutes) Time (minutes) Time Courses of Regulatory Signals • We can think of the time-varying contributions Fi of each motif as the Regulatory Signals of the transcription factors that bind to these motifs

Regulatory Element Detection Using Correlation with Expression • Bussemaker et al., Nature Genetics 27, 167–174 (2001) • Can be used with known regulatory motifs, sets of putative motifs, and even exhaustively on the set of all motifs up to a certain length (n=7). • Known motifs generally have high statistical significance. • Allows us to infer regulatory inputs of (possibly unknown) transcription factors. • Accounts for only 30% of total signal present in genome-wide expression patterns. • Purely linear model: no synergistic effects between TF’s, cooperative binding, etc.

3: Identifying Regulatory Networks by Combinatorial Analysis of Promoter Elements • Most transcription factors are thought to work in concert with other TF’s.  Synergistic effects • Clustering: • a motif may occur in more than one cluster, because it may give rise to different expression patterns depending on its interaction partners. • several motifs may occur in the same cluster. • Correlation with expression pattern: • by itself, a motif may not show a clear expression pattern. • contributions of multiple motifs may not be simply additive. Pilpel et al., Nature Genetics 29, 153–159 (2001)

SFF but not Mcm1 Mcm1 but not SFF E C = 0 . 0 5 E C = 0 . 0 5 2 2 0 0 Expression level - 2 - 2 5 1 0 1 5 5 1 0 1 5 Time Time SFF and Mcm1 E C = 0 . 2 3 2 0 Expression level - 2 5 1 0 1 5 Time Synergy between Mcm1 and SFF in Cell Cycle Data Set Mcm1 and SFF were not detected in Tavazoie et al Yet TFs that bind these motifs are known to interact in control of G2-genes (Nature. 2000 406:90-4.) Bussemaker et al found that these motifs are antagonistic.

Expression Coherence and Synergy • Expression Coherence (EC) score indicates how tightly clustered the expression profiles of a set of genes are. • For every combination of N=2,3 motifs: 1) Calculate the expression coherence score of the genes that have the N motifs 2) Calculate the expression coherence score of genes that have every possible subset of N-1 motifs 3 )Test (statistically) the hypothesis that the score of the orfs with N motifs is significantly higher than that of orfs that have any sub set of N-1 motifs

Ho et al. Nature. 2002 The “Combinogram” Highly synergistic interaction between MCB and SFF Previously unknown Subsequently predicted via chromatin immuno-precipitation (ChIP) (cell cycle data)

Identifying Regulatory Networks by Combinatorial Analysis of Promoter Elements • Pilpel et al., Nature Genetics 29, 153–159 (2001) • Found several known and novel interactions between regulatory elements active in cell cycle, sporulation and stress response. • Doesn’t assume a specific (e.g. linear) model of TF interactions. • Combined with TF expression patterns, may allow us to infer a model of interaction.

Protein Networks Patrik D’haeseleer Harvard University http:/genetics.med.harvard.edu/~patrik

Prot1 Prot1 AD AD AD AD AD BD BD BD BD BD Prot1 BD AD Reporter gene Binding site AD Prot2 BD Prot2 Prot2 Binding site Reporter gene Yeast 2-Hybrid Assays Transcription Factor (e.g. Gal4) MATa MATa + Reconstituted active TF “bait” fusion: “prey” fusion: Fields and Song, Nature 340:245-246 (1989)

Large-Scale 2-Hybrid Data Sets • Uetz et al, Nature 403:623-627 (2000) • 6000 x 192 protein pairs screened using protein array • nearly all 6000 x 6000 pairs, using pooled prey libraries • total of 957 putative interactions between 1004 proteins • Ito et al, PNAS 98:4569-4574 (2001) • nearly all 6000 x 6000 pairs, using bait and prey pools • total of 4549 putative interactions between 3278 proteins • core set of 841 interactions between 797 proteins • Surprisingly little overlap between the data sets, possibly indicating a large number of missed interactions (false negative).

Intersections between Protein Interaction Data Sets MIPS 1546 MIPS 1546 Ito full 4475 1415 1436 Ito core 806 49 28 54 28 21 61 648 109 156 709 4242 756 Uetz 947 Uetz 947

Prot1 AD AD BD BD Prot2 Binding site Reporter gene Causes of False Positives • Bait acts as activator • Bait interacts with endogenous activator • Prey binds to DNA • Prey interacts with endogenous transcription factors • Bait interacts with Activation Domain • Prey interacts with DNA Binding Domain • “Sticky” proteins (nonspecific binding) • Changes in plasmid copy number • Various other artifacts • . . .

Yeast Protein-Protein Interaction Map Each node is a protein Each line is an interaction 5560 putative interactions 3725 different proteins ~ 3 interactions / protein Uetz, Schwikowski, Fields and co-workers; Ito and co-workers

Membrane Proteins Transcription Factors

- membrane protein - DNA-binding protein - all other yeast proteins - physical interaction between two proteins

Beyond Co-expression: Gene Network Inference

Beyond Co-expression: Gene Network Inference

Presentation Transcript

Analysis of Gene Expression Data

Gene Regulation

Goals of today s lecture

Regulation of gene expression

Network Construction “A General Framework for Weighted Gene Co-Expression Network Analysis”

General consideration about Gene Expression and expression studies

Gene Expression

Regulation of Gene Expression in Eukaryotes

Gene Regulatory Network Inference

Gene Expression and Regulation

Gene Expression

Regulation of Gene Expression

Regulation of gene expression

GENE EXPRESSION I

REGULATION OF GENE EXPRESSION

Gene Expression Analysis by SAGE

Gene Expression

Part 7: Gene Structure and Function

Protein Synthesis and Gene Expression

Lecture 1 Gene Expression

Gene expression