1 / 22

CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling

CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling. Qing Zhou and Wing Wong Slides by Qiaozhu Mei and Hong Cheng Presented by Saurabh Sinha. Existing Methods. Existing motif discovery methods Experimental methods

gen
Download Presentation

CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling Qing Zhou and Wing Wong Slides by Qiaozhu Mei and Hong Cheng Presented by Saurabh Sinha

  2. Existing Methods • Existing motif discovery methods • Experimental methods • DNase footprinting and gel-mobility shift assay • Computational methods • EM algorithm • Gibbs sampler • Word enumeration • Dictionary model • A good number of useful TF motifs found, but still many important TF motifs unexplored.

  3. Cis-regulatory Modules (CRMs) • Observation: Most eukaryotic genes are controlled by cis-regulatory modules (CRMs) each consisting of multiple TF-binding sites (TFBSs). • When no prior knowledge on TFs is available, we must resort to de novo motif discovery algorithm.

  4. CRMs Discovery and Motif Estimation • Greater sensitivity and specificity can be achieved for motif discovery by considering the colocalization of different TFBSs • search for modules and motifs simultaneously. • Module discovery and motif estimation is tightly coupled • Motif patterns and binding sites are essential for predicting regulatory modules; • Discovery of modules will greatly improve the performance of motif detection.

  5. Method • Goal: search for binding sites for K different TFs within the CRMs of a given set of sequences • A Hierarchical Mixture (HMx) model to generate the sequence: • 1st level: the sequences are viewed as a mixture of CRMs, each of length l, and pure background sequences outside the modules; • 2nd level: each module is modeled as a mixture of motifs and within-module background. • Bayesian inference to estimate locations of modules, TFBSs and motif patterns based on the joint posterior distribution.

  6. HMx Model as a Stochastic Process • Treat HMx model as a stochastic machinery to generate sequences. • From the first sequence position, make a series of random decisions of whether to initiate a module of length l or generate a letter from the background model. • Inside a module, If a site for the kth motif was initiated at position n, then generate wk letters from its PWM and place them at [n, n+wk-1], otherwise generate a letter from the background. • After reaching the end of the current module, decide whether sampling from the background or initiating a new module.

  7. HMx Illustration (A) Unaligned motif sites (Consider motifs independently) (B) Aligned motif sites represented by a multinomial model (representation of a motif) (C) Cis-regulatory regions of coregulated genes (consider modules and motifs in a hierarchical manner)

  8. Inside the Model • Data Observed: S • S – Set of sequences • Model variables:  • 0 - first-order Markov Chain to generate background • k - product multinomial parameters (PWM) for a motif k •  = (0, 1, 2,…, K) • r - probability of a module start • qk - probability of starting a site for motif k • q = (q0,q1,…,qK) • wk - width of motif k • W = (w1,w2,…,wK) • Hidden variables (missing data): M; A • M - indicators for a module start • S(M): sequence of modules; S(Mc): sequence of background outside modules • Ak - indicators for start positions for sites of motif k • A = (A0,A1,…,Ak) • Model parameters: • l - length of modules • K - number of motifs

  9. Inside the Model (cont.) • Under the HMx model, the complete sequence likelihood with M and A given is: • The joint posterior distribution is: • Priors: • ( | wk): a Dirichlet distribution with parameter k • (q): a Dirichlet distribution with parameter  • (wk): Poisson(w0) • (r): Beta(a, b)

  10. Bayesian Inference • Problem: how to estimate  = (, q, W, r) • Regarded M and A as missing data and used the Gibbs sampler to perform Bayesian inference. • With a random initialization, the algorithm CisModule iteratively cycles through the steps of parameter update and module-motif detection. • Given current modules and motif sites (M and A), update all the parameters: sample  from conditional prob. [ | M, A, S] • Given current values of the parameters, sample modules and motif sites from the conditional distribution

  11. Sampling  given M and A •  = (, q, r, W) = parameters of model • Align binding sites of each motif, calculate PWM from these to get samples of  • q (motif transition probabilities) derived from total number of sites of each motif • r (module probability) derived from number of modules prescribed by M • W (motif widths) sampled by Metropolis strategy

  12. Sampling M and A given  • Need to pick (M,A) with Pr(M,A | ,S) • Use “forward summation” to compute • Then use “backward sampling” to generate the module indicators (i.e., M) and the site indicators (i.e., A)

  13. Forward Summation • Forward Summation where is the probability of observing given that it is within a module.

  14. Backward Sampling • Starting from n = L. • At position n, decide whether • (i) is at the last position of a module, or • (ii) is from the background. • Probabilities of these events are proportional to An() and Bn() respectively • Depending on choosing event (i) or (ii), move to position n-l or n-1. • Repeat the binary decision process. • In this way, generate all the module indicators. • Then, generate motif indicators in a similar manner

  15. Algorithm Illustration sample M, A from conditional prob. [ M,A| , S] - Given , how to decode the sequence? Two-phase Sampling.. sample  from conditional prob. [ | M, A, S] - Given M, A, how to estimate ? Alignments!

  16. Results • Simulation Studies • Motif: E2F, YY1 and c_MYC • Background sequences are generated by a first-order Markov chain . • Module Predictions • Total length, 2,009 and 4,108 bp on average, excess rates 0.5% and 2.7% • Coverage of true sites, 84.3% and 94.0% • Motif discovery • Comparison with MEME and BioProspector • Improvement over MEME and BP

  17. Output Result • By using the samples from the joint posterior distribution, and can be estimated. • The top -mers that are most frequently sampled as sites for the kth motif are aligned as output sites. • The modules are inferred by the marginal posterior probability of each sequence position being sampled as within modules. • The positions where this probability >0.5 are output as modules.

  18. Simulation Results

  19. Homotypic Regulatory Modules in Drosophila • Motif • Bicoid (Bcd),Hunchback(Hb) and Kruppel (Kr) • Results

  20. Muscle-Specific Regulatory Regions • Motif • Mef-2, TEF and SRF • Results

  21. Discussion • HMx model • Capture the spatial correlation between different binding sites • CisModule • A Bayesian module sampler to infer the motif modules and the binding sites for a set of TFs • May be trapped in local modes. • Need multiple trials. • Can use available prior informatioin

  22. Future Work • Incorporate the information from comparative genomics into CisModule. • Greater prior probabilities for modules and sites can be assigned to the regions that are highly conserved across species of appropriate evolutionary distances. • The HMx model captures the colocalization tendency of cooperating TFBSs but not their order or precise spacing. • Additional refinements to the model may improve.

More Related