330 likes | 425 Views
From Sequence to Expression: A Probabilistic Framework. Joint work with:. Eran Segal (Stanford). Nir Friedman (Hebrew U.) Daphne Koller (Stanford). Yoseph Barash (Hebrew U.) Itamar Simon (Whitehead Inst.). G1. S. M. G2. Understanding Cellular Processes.
E N D
From Sequence to Expression:A Probabilistic Framework Joint work with: Eran Segal(Stanford) Nir Friedman (Hebrew U.) Daphne Koller (Stanford) Yoseph Barash(Hebrew U.) Itamar Simon (Whitehead Inst.)
G1 S M G2 Understanding Cellular Processes • Complex biological processes (e.g. cell cycle) • Coordination of multiple events • Each event requires different modules Can we recover the regulatory circuits that control such processes?
Coding Region CTAGTAGATATCGATCAG mRNA Promoter Region Protein Gene Structure
Gene 1 Gene 2 Sequence Motif AGACTTCAGA Gene 3 Gene 4 Gene 5 Gene Regulation A mRNA
- Transcription Factor Swi5 Gene 1 Gene 2 A Gene 3 Gene 4 A Gene 5 Gene Regulation A mRNA
Swi5 Gene 1 A Swi5 Gene 2 A Gene 3 Swi5 Gene 4 A More mRNA(higher expression) Gene 5 Activated A Gene Regulation Swi5 mRNA
Swi5 Gene 1 A B Swi5 Gene 2 A AGTTGA Gene 3 B Swi5 Gene 4 A B Gene 5 B Activated A Gene Regulation Swi5 mRNA
Ndd1 Swi5 Gene 1 A B Swi5 Gene 2 A Ndd1 Gene 3 B Swi5 Ndd1 Gene 4 A B Ndd1 Gene 5 B + Activated B A Gene Regulation mRNA
G1 G2 t2 Motif t1 Motif AGCTAGCTGAGACTGCACACTGATCGAGCCCCACCATAGCTTCGGACTGCGCTATATAGACTGCAGCTAGTAGAGCTCTGCTAGAGCTCTATGACTGCCGATTGCGGGGCGTCTGAGCTCTTTGCTCTTGACTGCCGCTTATTGATATTATCTCTCTTGCTCGTGACTGCTTTATTGTGGGGGGGACTGCTGATTATGCTGCTCATAGGAGAGACTGCGAGAGTCGTCGTAGGACTGCGTCGTCGTGATGATGCTGCTGATCGATCGGACTGCCTAGCTAGTAGATCGATGTGACTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCGACTGCTCGAGAGGAAGTATATATGACTGCGCGCGCCGCGCGCCGGACTGCAGCTGATGCATGCATGCTAGTAGACTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCGACTGCAGTCGATCGATGCTAGTTATTGGACTGCGTAGTAGTGCGACTGCTCGTAGCTGTAG R(t2) R(t1) Goal ACTAGTGCTGA + CTATTATTGCA CTGATGCTAGC
Model of Gene Regulation Probabilistic Relational Models (PRMs) Pfeffer and Koller (1998)Friedman et al (1999)Segal et al (2001) Sequence Promoter sequences Gene Experiment Regulation by transcription factors • Context • Cluster Expression measurements Expression
Regulation to Expression Gene Experiment R(t2) R(t1) Exp. type Exp. cluster Level Expression R(t1) = yes t1 regulates geneR(t1) = no t1 does not regulate gene
CPD R(t1)R(t2) Etype P(Level) P(Level) 0 0 I -0.7 1.2 0 1 II 0.8 0.6 … Level Level -0.7 0.8 Regulation to Expression Gene Experiment R(t2) R(t1) Exp. type Exp. cluster Level Expression
Exp. type = G1 true false R(t1) = Yes R(t2)=yes • Gaussian decision tree • T1 only relevant in G1 • T2 only relevant in G2 . . . P(Level) true false true false Level 2 P(Level) P(Level) Level 0 Level 3 Modeling Context Specificity Gene Experiment R(t2) R(t1) Exp. type Exp. cluster Level Expression
PSSM: • Background distribution • Motif distribution • Discriminative training where From Sequence to Regulation • Assumptions: • Binding site is of length k • Binding may occur at any k-mer • TF regulates gene if binding occurs anywhere
Localization Assay • Localization data: measure TF binding to promoter of each gene (assign binding confidence) Simon et al (2001)
Is Regulation Observed? • Not quite… • Localization is measured for specific conditions • Localization is measured for large DNA regions • Localization is noisy
Localization Model Gene R(t1) L(t1) Observed • Localization p-value is noisy sensor of actual regulation • If regulation occurs, p-value likely to be low • If no regulation, p-value likely to be high
Bayesian score • Heuristic search • Expectation Maximization • Discriminative training(conjugate gradient) Model Learning • Structure Learning: • Tree structure • Missing Data: • Experiment cluster • Regulation variables • Motif Model: • Parameter estimation
Experimental Details + LocalizationData ACGCCTA Model Learning promoter … s1 sk Gene Experiment R(t2) R(t1) Exp. type L(t1) Exp. cluster L(t1) Level Expression
Resulting Bayesian Network Exp. type2 Exp. type sk1 s11 Exp. cluster Exp. cluster R(t2)1 Level1,1 Level1,2 L(t2)1 R(t1)1 L(t1)1 s12 sk2 R(t2)2 Level2,1 Level2,2 L(t2)2 R(t1)2 L(t1)2 s13 sk3 R(t2)3 Level3,1 Level3,2 L(t2)3 R(t1)3 L(t1)3
Model Learning: E-Step Exp. type2 Exp. type sk1 s11 Exp. cluster Exp. cluster R(t2)1 Level1,1 Level1,2 L(t2)1 R(t1)1 L(t1)1 s12 sk2 R(t2)2 Level2,1 Level2,2 L(t2)2 R(t1)2 L(t1)2 s13 sk3 R(t2)3 Level3,1 Level3,2 L(t2)3 R(t1)3 Loopy belief propagation L(t1)3
Model Learning: M-Step Exp. type2 Exp. type sk1 s11 Exp. cluster Exp. cluster R(t2)1 Level1,1 Level1,2 L(t2)1 R(t1)1 L(t1)1 s12 sk2 R(t2)2 Level2,1 Level2,2 L(t2)2 R(t1)2 L(t1)2 s13 sk3 R(t2)3 ConjugateGradient Level3,1 Level3,2 L(t2)3 R(t1)3 Standard ML estimation L(t1)3
Experimental Results Yeast • Cell Cycle expression data (Spellman et al) • Localization data for 9 TFs (Simon et al) • Yeast genome (promoters)
Clustering genes -112.24 Generalization Gene log-likelihood -112.24 Experiment Gene R(t2) R(t1) Exp. Cluster Level Expression
Clustering genes -112.24 Generalization Gene log-likelihood -121.48 • Localization -112.24 Experiment Gene L(t2) L(t1) Exp. type Level Expression
Clustering genes -112.24 -121.48 • Localization • Localization + exp. cluster -103.76 L(t1) L(t3) Generalization Gene log-likelihood -112.24 Experiment Gene R(t2) R(t1) Exp. type Exp. Cluster Level Expression
Clustering genes -112.24 -121.48 • Localization • Localization + exp. cluster -103.76 • + Sequence -94.59 promoter … s1 sk L(t1) L(t3) Generalization Gene log-likelihood -112.24 Experiment Gene R(t2) R(t1) Exp. type Exp. Cluster Level Expression
Example: Genes regulated by Swi6, notby Mcm1 and not by Fkh2, exhibit unique expression pattern in phase G1in the cell cycle Gene functions: DNA repair [P 3e-09] DNA synthesis [P 7e-05] Generating Hypotheses
Phase Swi5 regulated Swi5 expression Expression vs Regulation 1 0.5 Genes predicted to be regulated by Swi5 are probably real Swi5 targets 0 -0.5 -1 0 21 42 63 84 105 10 70 100 130 160 190 220 250 0 30 60 90 120 150 0 90 180 270 360 cdc15 cdc28 elu alpha
Combinatorial Effects 1 Phase 0.5 Mcm1 & Ndd1 Mcm1 & Ace2 Mcm1 & Swi5 0 -0.5 -1 0 21 42 63 84 105 10 70 100 130 160 190 220 250 0 30 60 90 120 150 0 90 180 270 360 cdc15 cdc28 elu alpha
Motifs Found • Ndd1 Simonet al. 17 Expanded set identified additional genes regulated by Ndd1 ExpandedSet 28 1 Remaining Genes
Conclusions • Unified probabilistic model explaining gene regulation using sequence, localization and expression data • Models complex interactions between regulators • Discriminative model maximizing P(Expr. | Seq.) • Sequence data helps explain expression patterns