300 likes | 425 Views
Drosophila modENCODE Data Integration. Manolis Kellis on behalf of: modEncode Analysis Working Group ( AWG ) modEncode Data Analysis Center ( DAC ). Broad Institute of MIT and Harvard. MIT Computer Science & Artificial Intelligence Laboratory.
E N D
Drosophila modENCODE Data Integration Manolis Kellison behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) Broad Institute of MIT and Harvard MIT Computer Science & Artificial Intelligence Laboratory
mod/ENCODE: (aka. everything you wanted to know about gene regulation but were afraid to ask) Organism goes here
CTCF, check GAF, check Su(Hw), check BEAF-32, variant CP190, novel Mod(mdg4), novel The challenge ahead Binding sites of every developmental regulator Sequence motifs for every regulator Annotations & images for all expression patterns Dorsal-Ventral Expression domain primitives reveal underlying logic Anterior-Posterior Understand regulatory logic specifying development
The components of genomes and gene regulation • Goal: A systems-level understanding of genomes and gene regulation: • The regulators: TFs, GFs, miRNAs, their specificities • The regions: enhancers, promoters, insulators • The targets: individual regulatory motif instances • The grammars: combinations predictive of tissue-specific activity • The parts list = Building blocks of gene regulation • Our tools: Comparative genomics & large-scale experimental datasets. • Evolutionary signatures for promoter/enhancer/3’UTR motif annotation • Chromatin signatures for integrating histone modification datasets • Sequence signatures associated with TF binding, chromatin, dynamics • Infer regulatory networks, their temporal and spatial dynamics • Integrate diverse datasets
Outline • Annotate regulatory regions • Promoters, enhancers, insulators • Annotate chromatin states • De novo learning of chromatin mark combinations • Predict TF/Chromatin binding • Sequence -> TFs -> Chromatin -> Expression • Infer regulatory networks • Integrate motifs, expression, chromatin • Predictive models of gene expression • Chromatin/expression time-course • Embryo expression domains
Annotate Regulatory Regions Promoters, enhancers, insulators
TFs and Chromatin together define enhancer regions Enrichment in individual features Combinations of features improve performance All features + conservation Fraction of predictions that are true (precision) All features TFs + Remodeling TFs Chromatin marks + Remodeling factors Number of true enhancers recovered (recall) • Evaluate predictive power of TFs/GFs/Chromatin marks in recovery of known enhancers (REDfly, Furlong) • Combinations across features shows max performance • New enhancers also supported by patterning, motifs Rachel Sealfon, Chris Bristow
Chromatin marks reveal novel/refined promoters Chromatin-basedpromoterprediction Previously-annotatedTSS positive Datasets negative • Chromatin-based annotation of active promoter regions • Reveal microRNA precursors, lowly-expressed genes, alternate starts • Reveal promoter regions even in absence of CAGE/RACE datasets Transcript support from multiple stages No CAGE/RACE evidence Combine shape and intensity of chromatin mark information of six chromatin marks, CBP, PolII Predictions confirmed w/TSS expression, even when CAGE/RACE data is missing Chris Bristow
Annotate Chromatin States De novo learning of mark combinations
De novo chromatin states from mark combinations Promoter states • Learn de novo significant combinations of chromatin marks • Reveal functional elements, even without looking at sequence • Use for genome annotation • Use for studying regulation dynamics in different cell types Transcribed states Active Intergenic Repressed Jason Ernst
Each chromatin state associated w/ distinct function Frequency of each chromatin mark Annotation enrichments • Reveals several classes of promoters, enhancers • Distinct marks in transcripts, exons/introns, 5’/3’ UTRs • Distinguish inactive, repressed, heterochromatin Tentative annotations 20 different chromatin states Jason Ernst, Gary Karpen
Positional enrichments of each chromatin state Jason Ernst, Gary Karpen
Functional enrichments of different chromatin states • Developmental patterning regulators enriched in specific states • Different general factors associated with active/repressed states • Insulator proteins associated with wide range of chromatin marks • Replication origins associated with promoter/enhancer regions • Specific regulatory motifs associated with enhancer/repressed regions DV regulators AP regulators General TFs Insulators Replication Motifs Analysis: Jason Ernst, Pouya Kheradpour Data: David MacAlpine, Kevin White, Gary Karpen
Predictive models of TF/Chromatin Sequence TFs Chromatin Expression
Transcription Factor binding highly combinatorial Transcription factor binding • Extensive cross-enrichment suggests cross-talk between motifs of different TFs • Enriched and depleted motifs predictive of TF binding • TF binding prediction increases with motif combinations • Both synergistic and antagonistic effects Motif enrichment All above/below 1.5-fold Top/bottom 5 motifs All above 1.5-fold Known motif Top 5 motifs Top motif All motifs 2-4 24 Fold enrichment Pouya Kheradpour, Rachel Sealfon
Combinations of TFs predictive of chromatin states AP-state 60-fold enriched in enhancers Trx in enhancer states Polycomb states enriched for enhancers Ubiquitous genes enriched for multiple states BEAF/Chro in TSSfor ubiquitous genes Strong Su(Hw) in Negativeoutside promoter states • Spatial clustering of TF combinations • Compare to chromatin states(clusters of chromatin marks) • TF sets chromatin states highly predictive of each other Jason Ernst, Chris Bristow
Chromatin strong predictor of expression state, not level • Gene expression level distribution largely bimodal • Predict presence/absence: chromatin marks in promoter region are a very strong predictor (AUC>0.98) • Predict expression magnitude: only ~60% of variation explained by promoter marks Many other levels of regulation Peter Kharchenko, Peter Park
Inferring regulatory networks Integrate motifs, expression, chromatin
3. Data integration for improved network prediction TF Target • Input features used: • Conserved TF motif in target • ChIP binding of TF in target • TF/target co-chromatin marks • TF/target co-expression • Training set: • Edges found in REDflyentwork • Test set: • Cross-validation Daniel Marbach, Sushmita Roy, Patrick Meyer, Rogerio Candeias
Integration improves precision and recall Comparison of integration methods Comparison of individual features ~10% recovery at ~40% precision • Linear/logistic regression best, similar to each other use logistic regression • Predictive power of individual features: • Best: Evolutionarily-conserved motifs • Next: chromatin time-course, ChIP-chip for TFs • Next: chromatin cell-lines, expression data (RNA-seq and microarrays) • Conclusion: Experimental datasets together dramatically improve performance ~60% recovery at ~20% precision Daniel Marbach, Sushmita Roy, Patrick Meyer, Rogerio Candeias
ChIP-grade quality Similarfunctionalenrichment High sens. High spec. Systems-level 81% of Transc. Factors 86% of microRNAs 8k + 2k targets 46k connections Lessonslearned Pre- and post- are correlated (hihi/lolo) Regulators are heavilytargeted, feedback loop Initial regulatory network for an animal genome Pouya Kheradpour, Sushmita Roy, Alex Stark
Predictive models of gene regulation Chromatin/expression timecourse Embryo expression domains
1. Chromatin time-course reveals stage regulators H3K27me3 • abd-A motif is enriched in new H3K27me3 regions at L2 • Coincides with a drop in the expression of abd-A • Model: sites gain H3K27me3 as abd-A binding lost • Additional intriguing stories found, to be explored Fold enrichment or over expression Pouya Kheradpour
2. Predicting changes in time-series expression trl adf1 byn sna tin Vnd Inv Twi … Adf1 Trl Vnd Tin Abd-A Hmx CG11085 CG34031 En Mad Grh Btd Abd-B Ftz Antp … • Integrate TF-target motif associations with time-course • Predict positive/negative regulators at each split Adf1 Adf1 E2F Trl Dref gt3 tin vnd Dref Notice: Adf1 targets appear positively then negatively regulated. Consistent with changes in Adf1 expression (not an input to model) Kr Dref exex en h gt gt sna trl esg Adf1 activator is OFF (targets not induced) Adf1 E2F Adf1 E2F Adf1 activator is ON (targets induced) Dref Jason Ernst
Predictive power of inferred network Target Prediction Coefficients bap • Predict target expression as linear comb of TFs, fit wi • Future: can motif grammars predict weights directly? w1 en w2 Snail, stages 4 to 6 Snail w3 Mef2 w4 tin w5 twi w0 Embryo Charlie Frogner, Tom Morgan, Lorenzo Rosasco
Additional examples: striped, changing coeffs Target Prediction Coefficients Target Prediction Coefficients sna sna w1 w1 Adf1 Trl w2 w2 twi slp1 hb w3 w3 Mef2 cad w4 w4 hb w5 w5 bcd w6 prd pan Hunchback, stages 4 to 6 slp1, stages 4 to 6 w0 w0 Embryo Embryo Charlie Frogner, Tom Morgan, Lorenzo Rosasco
Outline • Annotate regulatory regions • Promoters, enhancers, insulators • Annotate chromatin states • De novo learning of chromatin mark combinations • Predict TF/Chromatin binding • Sequence -> TFs -> Chromatin -> Expression • Infer regulatory networks • Integrate motifs, expression, chromatin • Predictive models of gene expression • Chromatin/expression time-course • Embryo expression domains
CTCF, check GAF, check Su(Hw), check BEAF-32, variant CP190, novel Mod(mdg4), novel The challenge ahead Binding sites of every developmental regulator Sequence motifs for every regulator Annotations & images for all expression patterns Dorsal-Ventral Expression domain primitives reveal underlying logic Anterior-Posterior Understand regulatory logic specifying development
Drosophila modENCODE Analysis Group Sue Celniker BrentonGraveleySteve BrennerMichael BrentGary Karpen Sarah Elgin Mitzi Kuroda Vince PirrottaPeter Park Peter Kharchenko Michael Tolstorukov Eric BishopKevin White Casey Brown Nicolas Negre Nick Bild Bob Grossman Eric LaiNicolas RobineDavid MacAlpineMatthew EatonSteve HenikoffPeter BickelBen Brown Lincoln Stein GroupSuzanna LewisGosMicklemNicole WashingtonEO StinsonMarc PerryPeter Ruzanov Chris BristowPouya KheradpourRachel Sealfon Jason Ernst Mike Lin Stefan Washietl Networks group Rogerio Candeias Daniel Marbach Patrick Meyer Sushmita Roy Image analysis Tom Morgan Charlie Frogner Lorenzo Rosasco Fly modEncode AWG