1.06k likes | 1.31k Views
Discovering Regulatory Networks from Gene Expression and Promoter Sequence. Eran Segal Stanford University. Modules. Interactions. Activity. From Parts to Systems. Parts. Gene 1. Gene 2. RNA. Protein. is a tightly regulated process. DNA. RNA. Gene Regulation. DNA. Gene 1. Gene 2.
E N D
Discovering Regulatory Networks from Gene Expression and Promoter Sequence Eran Segal Stanford University
Modules Interactions Activity From Parts to Systems Parts
Gene 1 Gene 2 RNA Protein is a tightly regulated process DNA RNA Gene Regulation DNA
Gene 1 Gene 2 Coding Coding Regulator Control Control Swi5 RNA ACGTGC Motif Swi5 Regulator (transcription factor) Gene Regulation DNA
Gene 1 Gene 2 Coding Coding Control Control Genome-wide Available Data • DNA Sequence • Gene Expression • mRNA level of all genes • Measured in different conditions ……ACTAGCGGCTATAATGACTGGACCTACGTACCGATATAATGTCAGCTAGCA…… RNA DNA Microarray
Gene 1 Gene 2 Coding Coding Regulator Control Control ACGTGC Motif Many diagnostic, prognostic and therapeutic implications Gene Regulation Swi5 • How are genes regulated? • How are genes regulated? • Who regulates whom? • How are genes regulated? • Who regulates whom? • Under which conditions? • How are genes regulated? • Who regulates whom? • Under which conditions? • Which genes are co-regulated?
clustering Motif Procedural • Apply a different method to each type of data • Use output of one method as input to the next GACTGC Example: Finding Motifs • Cluster gene expression profiles • Search for motifs in control regions of clustered genes Control regions Gene I AGCTAGCTGAGACTGCACAC TTCGGACTGCGCTATATAGA GACTGCAGCTAGTAGAGCTC CTAGAGCTCTATGACTGCCG ATTGCGGGGCGTCTGAGCTC TTTGCTCTTGACTGCCGCTT AGCTAGCTGAGACTGCACAC TTCGGACTGCGCTATATAGA GACTGCAGCTAGTAGAGCTC CTAGAGCTCTATGACTGCCG ATTGCGGGGCGTCTGAGCTC TTTGCTCTTGACTGCCGCTT Gene II Gene III Genes Gene IV Gene V Gene VI Experiments
What is a model? probabilistic stochastic A description of the biological process that could have generated the observed data Our Approach: Model Based
Our Approach: Model Based • Statistical modeling language for biological domains • Based on Bayesian networks • Classes of objects • Properties • Observed: gene sequence,experiment conditions • Hidden: gene module • Interactions • Expression level as afunction of gene andexperiment properties Gene Experiment Condition Tumor Module Expression STGFK ’01 (ISMB)
Bayesian Network Condition1 Condition2 Tumor1 Tumor2 Module1 Level1,1 Level1,2 Module2 Level2,1 Level2,2 P(Level2,1 | Module2,Condition2,Tumor2) Probabilistic Model • Defines a joint distribution Exper. Condition Gene Module Tumor Level Expression STGFK ’01 (ISMB)
Problem-specific structure • Modularity in biological systems • Convex optimization • Graph theoretic algorithms • Dynamic programming • Heuristic search NP-Hard Probabilistic Model • Defines a joint distribution • Learned automatically from data • Parameterization • Structure • Assignment to hidden variables Exper. Condition Gene Module Tumor Level Expression Find model M that maximizes P(M | D) Learn parameterization and structure of distributions Learn network structure • Thousands of variables • Space of possible networks is super-exponential Probabilistic inference in the Bayesian network • Millions of hidden variables • Variables are highly dependent STGFK ’01 (ISMB)
Biological problem Analyze results • Visualization • Literature • Statistics Model design • Classes of objects • Properties • Interactions Learn model • Automatically from data • Structure • Parameterization Derive biological insights from model Scheme Analyze results Model design Learn model Data STGFK ’01 (ISMB)
Reg. ACGTGC Outline • Who regulates whom and when? • Model • Learning algorithm • Evaluation • Wet lab experiments • How are genes regulated? • Regulation of multi-functional genes • Evolution of gene regulation
Ongoing Biological Debate Can we discover actual regulators from gene expression data alone?
State 1 State 2 State 3 Repressor Regulated gene Activator Activator Activator Activator Repressor Activator Repressor Repressor Regulators Regulators DNA Microarray DNA Microarray Regulated gene Regulated gene Regulated gene Gene Regulation: Simple Example
Regulation program Module genes Regulation Tree SSRPBKF ’03 (Nature Genetics) Activator? Activator expression false true true Repressor? Repressor expression false true Genes in the same module share the same regulation program State 1 State 2 State 3
false true HAP4 true false CMK1 Module Networks SSRPBKF ’03 (Nature Genetics) Modules Goal: Discover regulatory modules and their regulators • Module genes: set of genes that are similarly controlled • Regulation program: expression as function of regulators
P(Level | Module, Regulators) Module HAP4 Expression level of Regulator1 in experiment CMK1 1 What module does gene “g” belong to? 0 Regulator1 0 0 BMH1 Regulator2 GIC2 2 Regulator3 0 0 0 Expression level in each module is a function of expression of regulators Level Module Network Probabilistic Model SSRPBKF ’03 (Nature Genetics) Experiment Module Gene Expression
Reg. ACGTGC Outline • Who regulates whom and when? • Model • Learning algorithm • Evaluation • Wet lab experiments • How are genes regulated? • Regulation of multi-functional genes • Evolution of gene regulation
Goal: Find gene module assignments and tree structures that maximize P(M|D) Hard Gene module assignments Regulator1 Tree structures Regulator2 Regulator3 HAP4 CMK1 Level 0 0 0 Learning Problem SSRPBKF ’03 (Nature Genetics) • Genes: 5000-10000 • Regulators: ~500 Experiment Module Gene Expression
clustering Gene module assignment Learn regulation programs Relearn gene assignments to modules Regulatory modules HAP4 CMK1 Learning Algorithm Overview SSRPBKF ’03 (Nature Genetics)
Experiments sorted in original order Regulator HAP4 CMK1 SIP4 HAP4 Hap4 expression Experiments sorted by Hap4 expression log P(M|D) log P(DHAP4 |HAP4 ,HAP4 ) + log P(DHAP4 |HAP4 ,HAP4 ) + log P(HAP4,HAP4, HAP4 ,HAP4) log P(M|D) log P(DSIP4 |SIP4 ,SIP4 ) + log P(DSIP4 |SIP4 ,SIP4 ) + log P(SIP4,SIP4, SIP4 ,SIP4) log P(M|D) log P(DHAP4 |HAP4 ,HAP4 ) + log P(DCMK1 |CMK1 ,CMK1 ) + log P(DCMK1 |CMK1 ,CMK1 ) + … Learning Regulation Programs Experiments Module genes log P(M|D) log P(D|,) + log P(,) Module genes
-128 -129 Bayesian score (avg. per gene) -130 Algorithm iterations -131 0 5 10 15 20 50 40 Gene module assignment changes (% from total) 30 20 10 Algorithm iterations 0 0 5 10 15 20 Learning Algorithm Performance SPRKF ’03 (UAI) Significant improvements across learning iterations Many genes (50%) change module assignment in learning
Reg. ACGTGC Outline • Who regulates whom and when? • Model • Learning algorithm • Evaluation • Wet lab experiments • How are genes regulated? • Regulation of multi-functional genes • Evolution of gene regulation
Yeast Stress Data • Genes • Selected 2355 that showed activity • Experiments (173) • Diverse environmental stress conditions: heat shock, nitrogen depletion,…
Bayesian NetworkFriedman et al ’00Hartemink et al. ’01 Hap4 Expression level of each gene is a function of expression of regulators Mig1 Yap1 Cmk1 Ste12 Gic1 Fragment of learned Bayesian network • 2355 variables (genes) • 173 instances (experiments) Comparison to Bayesian Networks Problems • Robustness • Interpretability
Regulator1 Regulator2 Regulator3 Module Solutions • Robustness sharing parameters • Interpretability module-level model Level Comparison to Bayesian Networks Bayesian NetworkFriedman et al ’00Hartemink et al. ’01 Module NetworkSPRKF ’03 (UAI) Hap4 Mig1 Yap1 Cmk1 Ste12 Gic1 Problems • Robustness • Interpretability
150 Test Data Log-Likelihood(gain per instance) 100 50 Learn which parameters are shared(by learning which genes are in the same module) Bayesian Network performance 0 -50 Number of modules -100 -150 0 100 200 300 400 500 Comparison to Bayesian Networks SPRKF ’03 (UAI) Problems • Robustness • Interpretability Solutions • Robustness sharing parameters • Interpretability module-level model
HAP4 CMK1 HAP4 CMK1 0 0 0 Regulator1 Regulator2 Regulator3 Module Biologically relevant? Level From Model to Regulatory Modules SSRPBKF ’03 (Nature Genetics)
Regulation program Module genes Respiration Module SSRPBKF ’03 (Nature Genetics) • Module genes functionally coherent? • Module genes known targets of predicted regulators? Predicted regulator Energy production (oxid. phos. 26/55 P<10-30) Hap4+Msn4 known to regulate module genes
Regulation program Module genes Energy, Osomlarity, & cAMP Signaling • Regulation by non-TFs (Tpk1 – cAMP-dependent protein kinase) • Module genes known targets of predicted regulators?
Are the module genes functionally coherent? Are some module genes known targets of the predicted regulators? Biological Evaluation Summary SSRPBKF ’03 (Nature Genetics) 46/50 Functionally coherent = module genes enriched for GO annotations with hypergeometric p-value < 0.01 (corrected for multiple hypotheses) 30/50 Known targets = direct biological experiments reported in the literature
Reg. ACGTGC Outline • Who regulates whom and when? • Model • Learning algorithm • Evaluation • Wet lab experiments • How are genes regulated? • Regulation of multi-functional genes • Evolution of gene regulation
HAP4 Ypl230w ? From Model to Detailed Predictions SSRPBKF ’03 (Nature Genetics) • Prediction: • Experiment: Regulator ‘X’ regulates process ‘Y’ Knock out ‘X’ and repeat experiment X
wild-type mutant 1334 regulated genes(312 expected by chance) Modules predicted to be regulated by Ypl230w >4x Regulated genes Does ‘X’ Regulate Predicted Genes? SSRPBKF ’03 (Nature Genetics) Experiment: knock out Ypl230w (stationary phase) Rank modules by regulated genes Ypl230w regulates computationally predicted genes Predicted modules
wild-type mutant wild-type mutant Does ‘X’ Regulate Predicted Genes? SSRPBKF ’03 (Nature Genetics) Ppt1 knockout(hypo-osmotic stress) Kin82 knockout (heat shock) Regulated genes(1014) Regulated genes(1034)
New yeast biology suggested • Ypl230w activates protein-folding, cell wall and ATP-binding genes • Ppt1 represses phosphate metabolism and rRNA processing • Kin82 activates energy and osmotic stress genes Wet Lab Experiments Summary SSRPBKF ’03 (Nature Genetics) 3/3 regulators regulate computationally predicted genes
Many regulatory relationships can be induced from gene expression data Ongoing Biological Debate SSRPBKF ’03 (Nature Genetics) Can we discover actual regulators from gene expression data alone?
Feedforward, auto-regulatory “motifs” (Shen-Orr et al. 2002) TFs and SMs have detectable expression signature Sip2 (SM) Phd1 (TF) Msn4 (TF) Yap6 (TF) Hap4 (TF) Statistical methods can infer their regulatory relationships from gene expression data Vid24 Tor1 Gut2 Vid24 Tor1 Gut2 Cox4 Cox6 Atp17 Positive signaling loop Auto regulation Regulator chain (Sporulation & cAMP) (Respiration) (Snf kinase regulated processes) Undetected regulators Detected regulators Detected target Why Does it Work? SSRPBKF ’03 (Nature Genetics) Assumption: Regulators are transcriptionally regulated
Reg. ACGTGC Reg. ACGTGC Motif Outline • Who regulates whom and when? • How are genes regulated? • Model • Evaluation • Regulation of multi-functional genes • Evolution of gene regulation
DNA control sequence GATAG GATAG ACGTGC Motif ACGTGC + No motifs GATAG GATAG From Sequence to Expression DNA Microarray Repressor Activator ? ? ? Activator Activator Repressor Gene 1 Gene 2 Gene 3
Sequence Expression ACGTGC + No motifs GATAG GATAG From Sequence to Expression Goal: Explain how expression arises from sequence • Construct mechanistic model of gene regulation • Learn the model from sequence and expression data
clustering Motif Procedural • Apply a different method to each type of data • Use output of one method as input to the next GACTGC Two Phase Approach (I) • Cluster gene expression profiles • Search for motifs in control regions of clustered genes Control regions Gene I AGCTAGCTGAGACTGCACAC TTCGGACTGCGCTATATAGA GACTGCAGCTAGTAGAGCTC CTAGAGCTCTATGACTGCCG ATTGCGGGGCGTCTGAGCTC TTTGCTCTTGACTGCCGCTT AGCTAGCTGAGACTGCACAC TTCGGACTGCGCTATATAGA GACTGCAGCTAGTAGAGCTC CTAGAGCTCTATGACTGCCG ATTGCGGGGCGTCTGAGCTC TTTGCTCTTGACTGCCGCTT Gene II Gene III Genes Gene IV Gene V Gene VI Experiments
Shared Motif Clustering B Clustering A Shared Motif Cluster I Cluster I Cluster II Cluster II Two Phase Approach: Problems • Expression clustering is not perfect
TCGACT CGATGG AAATTA TCGACT ACGAGA GATACC GATACC TTCGCA ACGACT AAATGC CGCTGA GATACC Two Phase Approach (II) • Iterate over all sequences of length k • Find all genes that have each k-mer in their promoter • Keep k-mers whose genes are coherent in expression
TCGACTGC TCGACTGC TCGACTGC GATAC TCGACTGC + + + GATAC GATAC GATAC Two Phase Approach: Problems • Single motifs may not have coherent expression • Activator: • Repressor: TCGACTGC GATAC
OR TCGACTGC ? + CCAAT Two Phase Approach: Problems • Are we missing motifs? TCGACTGC
Genes TCGACTGC GATAC CCAAT GCAGTT Motifs TCGACTGC TCGACTGC TCGACTGC TCGACTGC TCGACTGC TCGACTGC GCAGTT Motif Profiles CCAAT GATAC CCAAT GATAC GATAC GATAC CCAAT CCAAT CCAAT CCAAT CCAAT CCAAT + + GCAGTT GCAGTT GCAGTT GATAC CCAAT Expression Profiles Unified Model of Gene Regulation ACGATGCTAGTGTAGCTGATGCTGATCGATCGTACGTGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCAGCTAGCTCGACTGCTTTGTGGGGCCTTGTGTGCTCAAACACACACAACACCAAATGTGCTTTGTGGTACTGATGATCGTAGTAACCACTGTCGATGATGCTGTGGGGGGTATCGATGCATACCACCCCCCGCTCGATCGATCGTAGCTAGCTAGCTGACTGATCAAAAACACCATACGCCCCCCGTCGCTGCTCGTAGCATGCTAGCTAGCTGATCGATCAGCTACGATCGACTGATCGTAGCTAGCTACTTTTTTTTTTTTGCTAGCACCCAACTGACTGATCGTAGTCAGTACGTACGATCGTGACTGATCGCTCGTCGTCGATGCATCGTACGTAGCTACGTAGCATGCTAGCTGCTCGCAAAAAAAAAACGTCGTCGATCGTAGCTGCTCGCCCCCCCCCCCCGACTGATCGTAGCTAGCTGATCGATCGATCGATCGTAGCTGAATTATATATATATATATACGGCG Sequence SYK ’03 (ISMB)
Unified Model of Gene Regulation Sequence cis-regulatory modules Motifs TCGACTGC GCAGTT Motif Profiles CCAAT + + GATAC CCAAT Expression Profiles
Regulatory Module DNA control sequences of module genes Expression of module genes TCGACTGC GATAC Motif Profile: + Unified Model of Gene Regulation Experiments Modules SYK ’03 (ISMB)