180 likes | 356 Views
Regulatory element discovery for developmental time series. Computational Biology Program Sloan-Kettering Institute Memorial Sloan-Kettering Cancer Center. Joint work with Xuejing Li, Chris Wiggins, Valerie Reinke Christina Leslie. http://cbio.mskcc.org. Regulatory networks in development.
E N D
Regulatory element discovery for developmental time series Computational Biology Program Sloan-Kettering Institute Memorial Sloan-Kettering Cancer Center Joint work with Xuejing Li, Chris Wiggins, Valerie Reinke Christina Leslie http://cbio.mskcc.org
Regulatory networks in development • Reinke lab: genome-wide expression for C. elegans developmental time series + germ cell/gametogenesis mutants • Problem: decipher regulatory networks governing germline- and sex-regulated genes
Previous work: MEDUSA in yeast • Predict up/down expression of target genes from promoter + regulator expression • Learns from a set of mRNA expression experiments without clustering • Problem: high correlation of nearby time points, many regulator profiles
Sequence to expression profile • Can we learn mapping from promoter sequence to full expression trajectory (with some level of statistical significance)? • Retain some properties of MEDUSA: • No clustering of expression profiles • Learn motifs de novo from promoters by building from k-mers …AGCTATGCCATCGACTGCTCCA…
Regression problem expression profile for gene g motif vector (k-mer counts) for gene g M E • Idea: learn latent factors T = X W that “explain” Y • Then regress X ≈ TPt, Y ≈ TQt or Y ≈ BX where B WQt G G X Y columns wi = weight vectors columns of P, Q = loadings
First step: PLS regression • Sequentially build latent factors ti = Xwi: • Maximize covariance between factors and Y • Constrain t1, …, tK to be uncorrelated • SIMPLS: • for i = 1, …, K in 1D case subject to
Equivalent formulation • Learn latent factors ti = Xwi andui = Xci for both predictor and response variables • wi and ci chosen to maximize Cov(ti, ui) • for i = 1, …, K subject to wi ci motif weight vector expression weight vector
Next steps: sparsity, graph Laplacian • For regulatization and interpretability of weight vectors, want • sparsity in w: want most components to be 0 • smoothness in w: define graph on set of k-mers, with edge k ~ l if corresponding k-mers are close in Hamming distance
Preliminary results: worm time series • Reinke data: ~9000 genes, 12 time points (3 replicates), wild type germline development • Genes sets, from mutant expression data: • Sperm genes: high expression in spermatogenesis • Oocyte genes: high expression in oogenesis • Motif matrix: filter k-mers based on expected counts
Standard PLS • 10-fold c.v. on held-out genes
Regularized PLS • 10-fold c.v. on held-out genes
Regularized PLS • Sperm/oocyte gene sets: largest chi-square reduction for 3rd/1st latent factor
Interpretation of factor weights • To infer motifs relevant for an expression pattern: • Latent factors ti = Xwi and ui = Yci for both predictors and reponse variables • wi and ci chosen to maximize Cov(ti,ui) • ci gives weights over time points: interpret as expression pattern • wi gives weights over motifs: highly weighted motifs relevant for this expression pattern
Sperm genes • c3 correlated with sperm gene expression, consistent with drop in chi-square
Motif graph for sperm genes • Top 50 k-mer graph for w3, clusters around GATAA (ELT-1) and ACGTG (bHLH)
Oocyte genes • Oocyte genes correlate with c1 pattern
Oocyte motif map • Top 50 k-mer graph for w1, log(p) vs weight
Some related work • Zhang et al, 2008: PCA in Y for motif discovery • Naughton et al, 2006: algorithmic motif search using graph representation • Beer and Tavazoie, 2004; Segal et al, 2002: sequence to expression via clustering