1 / 55

Learning the cis regulatory code by predictive modeling of gene regulation (MEDUSA)

MEDUSA is a predictive modeling approach that learns the cis-regulatory code by identifying motifs and regulators that predict differential expression of target genes across different experimental conditions.

doll
Download Presentation

Learning the cis regulatory code by predictive modeling of gene regulation (MEDUSA)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning the cis regulatory code by predictive modeling of gene regulation(MEDUSA) Christina Leslie Center for Computational Learning Systems Columbia University, NY, USA http://www.cs.columbia.edu/compbio/medusa

  2. Transcriptional Regulation Nuclear membrane

  3. Transcriptional Regulation Nuclear membrane

  4. Transcriptional Regulation Nuclear membrane Binding site/motifCCG__CCG

  5. Transcriptional Regulation Nuclear membrane Binding site/motifCCG__CCG Genome-wide mRNA transcript data (e.g. microarrays)

  6. Transcriptional Regulation Learning problems: • Understand which regulators control which target genes Nuclear membrane Binding site/motifCCG__CCG • Discover motifs representing regulatory elements

  7. Previous work: Clustering • Cluster-first motif discovery • Cluster genes by expression profile, annotation, … to find potentially coregulated genes • Find overrepresented motifs in promoter sequences of similar genes (algorithms: MEME, Consensus, Gibbs sampler, AlignACE, …) (Spellman et al. 1998)

  8. Previous work: “Structure learning” • Graphical models (and other methods) • Learn structure of “regulatory network”, “regulatory modules”, etc. • Fitinterpretable model totraining data • Model small number of genes or clustersof genes • Many computational and statistical challenges; often used for qualitative hypotheses rather than prediction (Pe’er et al. 2001) (Segal et al, 2003, 2004)

  9. Our work: “Predictive modeling” • MEDUSA = Motif Element Discrimination Using Sequence Agglomeration What is the prediction problem? • Predict up/down regulation of target genes under different experimental conditions Key ideas: • Learn motifs and identify regulators that predict differential expression in different contexts  mechanistic inputs • Obtain single modelfor all genes and all experiments:context-specific,no clusters, no parameter tuning • Accurate predictions on test data M. Middendorf, A. Kundaje, M. Shah, Y. Freund, C. Wiggins, C. Leslie. Motif Discovery through Predictive Modeling of Gene Regulation. RECOMB 2005.

  10. MEDUSA: Different view of training data Learn regulatory program that makes genome-wide, context-specific predictions for differential (up/down) expression of target genes

  11. MEDUSA – Set up Target gene analysis, important regulators TPK1, USV1, AFR1, XBP1, …

  12. Training data – Features regulator expression promoter sequence label feature vector

  13. Boosting (Freund & Schapire 1995)

  14. Boosting (Freund & Schapire 1995) distribution overtraining data

  15. Boosting (Freund & Schapire 1995) distribution overtraining data Minimize exponential loss function weak rule

  16. Boosting (Freund & Schapire 1995) distribution overtraining data weak rule updated weights

  17. Boosting (Freund & Schapire 1995) distribution overtraining data weak rule updated weights

  18. Boosting (Freund & Schapire 1995) distribution overtraining data weak rule updated weights

  19. MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…

  20. MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… k-mers (k≤7) AGCTATG

  21. MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… k-mers (k≤7) AGCTATGGCTATGC

  22. MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… k-mers (k≤7) AGCTATGGCTATGCCTATGCC

  23. MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… k-mers (k≤7) AGCTATGGCTATGCCTATGCC

  24. MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… k-mers (k≤7) AGCTATGGCTATGCCTATGCC dimers (gapped elements) TTT_AAA

  25. MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… k-mers (k≤7) AGCTATGGCTATGCCTATGCC dimers (gapped elements) TTT_AAAGCTA_GCTA

  26. MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… k-mers (k≤7) AGCTATGGCTATGCCTATGCC dimers (gapped elements) TTT_AAAGCTA_GCTA

  27. MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… Regulator expression k-mers (k≤7) AGCTATGGCTATGCCTATGCC dimers (gapped elements) TTT_AAAGCTA_GCTA Is AGCTATG present and USV1 up? Is AGCTATG present and USV1 down? Is GCTATGC present and USV1 up? Is GCTATGC present and TPK1 up? … try all motif-regulator pairs as weak rules …

  28. MEDUSA’s weak learner …AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT… Regulator expression k-mers (k≤7) AGCTATGGCTATGCCTATGCC dimers (gapped elements) TTT_AAAGCTA_GCTA Is AGCTATG present and USV1 up? Is AGCTATG present and USV1 down? Is GCTATGC present and USV1 up? Is GCTATGC present and TPK1 up? … minimizes boosting loss try all motif-regulator pairs as weak rules … Is GCTATGC present and USV1 up?

  29. Hierarchical sequence agglomeration Is GCTATGC present and USV1 up? Is GCAATGC present and USV1 up? Is TCTATGC present and USV1 up? Is GCTTTGC present and USV1 up? … boosting loss

  30. Hierarchical sequence agglomeration Is GCTATGC present and USV1 up? Is GCAATGC present and USV1 up? Is TCTATGC present and USV1 up? Is GCTTTGC present and USV1 up? … boosting loss Agglomerate GCTATGCGCAATGCGGTATGCCCTAAGCGCTATTT … … GGTATGG PSSMs … …

  31. Hierarchical sequence agglomeration Is GCTATGC present and USV1 up? Is GCAATGC present and USV1 up? Is TCTATGC present and USV1 up? Is GCTTTGC present and USV1 up? … boosting loss Optimize over offsets when merging k-mers/PSSMs: - - GCTATGC GCTATTT - - GCTATGCGCAATGCGGTATGCCCTAAGCGCTATTT … … GGTATGG PSSMs … …

  32. Hierarchical sequence agglomeration Is GCTATGC present and USV1 up? Is GCAATGC present and USV1 up? Is TCTATGC present and USV1 up? Is GCTTTGC present and USV1 up? … boosting loss GCTATGCGCAATGCGGTATGCCCTAAGCGCTATTT … … GGTATGG PSSMs … …

  33. Hierarchical sequence agglomeration Is GCTATGC present and USV1 up? Is GCAATGC present and USV1 up? Is TCTATGC present and USV1 up? Is GCTTTGC present and USV1 up? … boosting loss Is present and USV1 up? Is present and USV1 up? Is present and USV1 up? … GCTATGCGCAATGCGGTATGCCCTAAGCGCTATTT … … GGTATGG PSSMs … …

  34. Hierarchical sequence agglomeration Is GCTATGC present and USV1 up? Is GCAATGC present and USV1 up? Is TCTATGC present and USV1 up? Is GCTTTGC present and USV1 up? … boosting loss minimize boosting loss  final weak rule Is present and USV1 up? Is present and USV1 up? Is present and USV1 up? … GCTATGCGCAATGCGGTATGCCCTAAGCGCTATTT … … GGTATGG PSSMs … …

  35. MEDUSA strong rule • Combine weak rules into a tree-structure • Alternating decision tree = margin-based generalization of decision trees [Freund & Mason 1999] • Lower nodes are conditionally dependent on higher nodes  can possibly reveal combinatorial interactions • Able to reveal motifs specific to subsets of target genes • Able to learn any boolean function

  36. Yeast Environmental Stress Response • Gasch et al. (2000) dataset, 173 microarrays, 13 environmental stresses • ~5500target genes, 475 regulators (237 TF+ 250 SM) • 500bp upstream promoter sequences • Binning into +1/0/-1 expression levels based on wildtype vs. wildtype noise

  37. Statistical validation • 10-fold cross-validation (held-out experiments), ~60,000 (gene,experiment) training examples, 700 iterations • (Nk-mers+Ndimers+NPSSMs)*Nreg*2 ~= 107 possible weak rules at every node • MEDUSA’s motifs give a better prediction accuracy on held-out experiments than database motifs

  38. Yeast ESR: Biological Validation Universal stress repressor motif STRE element

  39. Yeast ESR: Biological Validation Important regulators identified by MEDUSA Cellular localizationof MSN2/4 Segal et al. 2003 Universal stress repressor

  40. Visualizing MEDUSA motifs 1. 2. 3. 5. AAATTT TAAGGG 8. 14. 16.

  41. Biological validation – Context-specific analysis • Restrict regulatory program to particular target genes T, experimental conditions E  smaller model • Further statistical pruning of features using margin-based score: • Identify most significant context-specific regulators and motifs for target set

  42. Biological validation – Context-specific analysis • Example: oxygen sensing and regulation in yeast (collaborator: Li Zhang)

  43. Biological validation – Context-specific analysis • Example: oxygen and heme inducible targets

  44. Biological validation – Network inference P Mp P TF P MTF Mp M • Regulator-motif associations in nodes can have different meanings: • Need other data to confirm binding relationship between regulator and target (e.g. ChIP chip) • Still, can determine statistically significant regulator-target relationships from regulation program Direct binding Indirect effect Co-occurrence

  45. Biological validation – Network inference • Example: oxygen sensing and regulatory network

  46. Discussion: What does “predictive” mean? At least 2 usages: • Makes accurate quantitative predictions • Can assess predictions statistically, i.e. on test data • Gives us confidence that model contains biologically relevant information vs. • Generates biological hypotheses • Without statistical validation, can only evaluate quality of hypotheses through experiments • Issues: How much of model is correct? How many false positives? Is a network “edge” a meaningful prediction? (Cf. DREAM initiative)

  47. Discussion: “Predictive” modeling • “Manifesto” • We’re interested in hypothesis generation, but still must give statistical validation on test data, i.e. show that you’re not overfitting • Not enough to show that model is non-random, e.g. good p-values for functional enrichment • Possible goal: move towards making useful predictions for actual wet-lab experiments (e.g. fewer input variables in model) • MEDUSA: statistically predictive model, can still interpret to extract biological hypotheses

  48. Ongoing MEDUSA-related projects • Oxygen sensing and regulation in yeast (collaborator: Li Zhang, Public Health @ Columbia) • Regulation of and by microRNAs in humans (collaborators: Sander group, Sloan Kettering) • Sequence information controlling tissue-specific alternative splicing (collaborator: Larry Chasin, Biology @ Columbia) • Integration of phosphorylation (“kinome”) data to reconstruct signaling pathways • New Java MEDUSA software package – soon to be released http://www.cs.columbia.edu/compbio/medusa

  49. Thanks • Manuel Middendorf (Physics) • Anshul Kundaje (CS) • David Quigley (DBMI) • Steve Lianoglou (CS) • Xuejing Li (Physics) • Mihir Shah (CS) • Marta Arias (CCLS) • Chris Wiggins (APAM) • Yoav Freund (CS@UCSD) Funding: NIH (MAGNet NCBC grant)

  50. Visualizing MEDUSA motifs • Pruning based on feature dependence statistic:

More Related