Regulatory element discovery for developmental time series

Regulatory element discovery for developmental time series Computational Biology Program Sloan-Kettering Institute Memorial Sloan-Kettering Cancer Center Joint work with Xuejing Li, Chris Wiggins, Valerie Reinke Christina Leslie http://cbio.mskcc.org

Regulatory networks in development • Reinke lab: genome-wide expression for C. elegans developmental time series + germ cell/gametogenesis mutants • Problem: decipher regulatory networks governing germline- and sex-regulated genes

Previous work: MEDUSA in yeast • Predict up/down expression of target genes from promoter + regulator expression • Learns from a set of mRNA expression experiments without clustering • Problem: high correlation of nearby time points, many regulator profiles

Sequence to expression profile • Can we learn mapping from promoter sequence to full expression trajectory (with some level of statistical significance)? • Retain some properties of MEDUSA: • No clustering of expression profiles • Learn motifs de novo from promoters by building from k-mers …AGCTATGCCATCGACTGCTCCA…

Regression problem expression profile for gene g motif vector (k-mer counts) for gene g M E • Idea: learn latent factors T = X W that “explain” Y • Then regress X ≈ TPt, Y ≈ TQt or Y ≈ BX where B  WQt G G X Y columns wi = weight vectors columns of P, Q = loadings

First step: PLS regression • Sequentially build latent factors ti = Xwi: • Maximize covariance between factors and Y • Constrain t1, …, tK to be uncorrelated • SIMPLS: • for i = 1, …, K in 1D case subject to

Equivalent formulation • Learn latent factors ti = Xwi andui = Xci for both predictor and response variables • wi and ci chosen to maximize Cov(ti, ui) • for i = 1, …, K subject to wi ci motif weight vector expression weight vector

Next steps: sparsity, graph Laplacian • For regulatization and interpretability of weight vectors, want • sparsity in w: want most components to be 0 • smoothness in w: define graph on set of k-mers, with edge k ~ l if corresponding k-mers are close in Hamming distance

Preliminary results: worm time series • Reinke data: ~9000 genes, 12 time points (3 replicates), wild type germline development • Genes sets, from mutant expression data: • Sperm genes: high expression in spermatogenesis • Oocyte genes: high expression in oogenesis • Motif matrix: filter k-mers based on expected counts

Standard PLS • 10-fold c.v. on held-out genes

Regularized PLS • 10-fold c.v. on held-out genes

Regularized PLS • Sperm/oocyte gene sets: largest chi-square reduction for 3rd/1st latent factor

Interpretation of factor weights • To infer motifs relevant for an expression pattern: • Latent factors ti = Xwi and ui = Yci for both predictors and reponse variables • wi and ci chosen to maximize Cov(ti,ui) • ci gives weights over time points: interpret as expression pattern • wi gives weights over motifs: highly weighted motifs relevant for this expression pattern

Sperm genes • c3 correlated with sperm gene expression, consistent with drop in chi-square

Motif graph for sperm genes • Top 50 k-mer graph for w3, clusters around GATAA (ELT-1) and ACGTG (bHLH)

Oocyte genes • Oocyte genes correlate with c1 pattern

Oocyte motif map • Top 50 k-mer graph for w1, log(p) vs weight

Some related work • Zhang et al, 2008: PCA in Y for motif discovery • Naughton et al, 2006: algorithmic motif search using graph representation • Beer and Tavazoie, 2004; Segal et al, 2002: sequence to expression via clustering

Regulatory element discovery for developmental time series

Regulatory element discovery for developmental time series

Presentation Transcript

Time Series

Time Series 2 Time Series 1

SOMs for time series

Exact Discovery of Time Series Motifs

Pattern Finding and Pattern Discovery in Time Series

Time Series

Time Series

Exact Discovery of Time Series Motifs

Time series

Time Series

Time Series

Time series

Time Series

Time Series

Pattern Discovery of Fuzzy Time Series for Financial Prediction

PreDetector : Prokaryotic Regulatory Element Detector

Time Series

DISCOVERY - DS Series

PREDetector : Prokaryotic Regulatory Element Detector

Algorithms for Regulatory Motif Discovery

Time Series

Streaming Pattern Discovery in Multiple Time-Series