CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling

CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling Qing Zhou and Wing Wong Slides by Qiaozhu Mei and Hong Cheng Presented by Saurabh Sinha

Existing Methods • Existing motif discovery methods • Experimental methods • DNase footprinting and gel-mobility shift assay • Computational methods • EM algorithm • Gibbs sampler • Word enumeration • Dictionary model • A good number of useful TF motifs found, but still many important TF motifs unexplored.

Cis-regulatory Modules (CRMs) • Observation: Most eukaryotic genes are controlled by cis-regulatory modules (CRMs) each consisting of multiple TF-binding sites (TFBSs). • When no prior knowledge on TFs is available, we must resort to de novo motif discovery algorithm.

CRMs Discovery and Motif Estimation • Greater sensitivity and specificity can be achieved for motif discovery by considering the colocalization of different TFBSs • search for modules and motifs simultaneously. • Module discovery and motif estimation is tightly coupled • Motif patterns and binding sites are essential for predicting regulatory modules; • Discovery of modules will greatly improve the performance of motif detection.

Method • Goal: search for binding sites for K different TFs within the CRMs of a given set of sequences • A Hierarchical Mixture (HMx) model to generate the sequence: • 1st level: the sequences are viewed as a mixture of CRMs, each of length l, and pure background sequences outside the modules; • 2nd level: each module is modeled as a mixture of motifs and within-module background. • Bayesian inference to estimate locations of modules, TFBSs and motif patterns based on the joint posterior distribution.

HMx Model as a Stochastic Process • Treat HMx model as a stochastic machinery to generate sequences. • From the first sequence position, make a series of random decisions of whether to initiate a module of length l or generate a letter from the background model. • Inside a module, If a site for the kth motif was initiated at position n, then generate wk letters from its PWM and place them at [n, n+wk-1], otherwise generate a letter from the background. • After reaching the end of the current module, decide whether sampling from the background or initiating a new module.

HMx Illustration (A) Unaligned motif sites (Consider motifs independently) (B) Aligned motif sites represented by a multinomial model (representation of a motif) (C) Cis-regulatory regions of coregulated genes (consider modules and motifs in a hierarchical manner)

Inside the Model • Data Observed: S • S – Set of sequences • Model variables:  • 0 - first-order Markov Chain to generate background • k - product multinomial parameters (PWM) for a motif k •  = (0, 1, 2,…, K) • r - probability of a module start • qk - probability of starting a site for motif k • q = (q0,q1,…,qK) • wk - width of motif k • W = (w1,w2,…,wK) • Hidden variables (missing data): M; A • M - indicators for a module start • S(M): sequence of modules; S(Mc): sequence of background outside modules • Ak - indicators for start positions for sites of motif k • A = (A0,A1,…,Ak) • Model parameters: • l - length of modules • K - number of motifs

Inside the Model (cont.) • Under the HMx model, the complete sequence likelihood with M and A given is: • The joint posterior distribution is: • Priors: • ( | wk): a Dirichlet distribution with parameter k • (q): a Dirichlet distribution with parameter  • (wk): Poisson(w0) • (r): Beta(a, b)

Bayesian Inference • Problem: how to estimate  = (, q, W, r) • Regarded M and A as missing data and used the Gibbs sampler to perform Bayesian inference. • With a random initialization, the algorithm CisModule iteratively cycles through the steps of parameter update and module-motif detection. • Given current modules and motif sites (M and A), update all the parameters: sample  from conditional prob. [ | M, A, S] • Given current values of the parameters, sample modules and motif sites from the conditional distribution

Sampling  given M and A •  = (, q, r, W) = parameters of model • Align binding sites of each motif, calculate PWM from these to get samples of  • q (motif transition probabilities) derived from total number of sites of each motif • r (module probability) derived from number of modules prescribed by M • W (motif widths) sampled by Metropolis strategy

Sampling M and A given  • Need to pick (M,A) with Pr(M,A | ,S) • Use “forward summation” to compute • Then use “backward sampling” to generate the module indicators (i.e., M) and the site indicators (i.e., A)

Forward Summation • Forward Summation where is the probability of observing given that it is within a module.

Backward Sampling • Starting from n = L. • At position n, decide whether • (i) is at the last position of a module, or • (ii) is from the background. • Probabilities of these events are proportional to An() and Bn() respectively • Depending on choosing event (i) or (ii), move to position n-l or n-1. • Repeat the binary decision process. • In this way, generate all the module indicators. • Then, generate motif indicators in a similar manner

Algorithm Illustration sample M, A from conditional prob. [ M,A| , S] - Given , how to decode the sequence? Two-phase Sampling.. sample  from conditional prob. [ | M, A, S] - Given M, A, how to estimate ? Alignments!

Results • Simulation Studies • Motif: E2F, YY1 and c_MYC • Background sequences are generated by a first-order Markov chain . • Module Predictions • Total length, 2,009 and 4,108 bp on average, excess rates 0.5% and 2.7% • Coverage of true sites, 84.3% and 94.0% • Motif discovery • Comparison with MEME and BioProspector • Improvement over MEME and BP

Output Result • By using the samples from the joint posterior distribution, and can be estimated. • The top -mers that are most frequently sampled as sites for the kth motif are aligned as output sites. • The modules are inferred by the marginal posterior probability of each sequence position being sampled as within modules. • The positions where this probability >0.5 are output as modules.

Simulation Results

Homotypic Regulatory Modules in Drosophila • Motif • Bicoid (Bcd),Hunchback(Hb) and Kruppel (Kr) • Results

Muscle-Specific Regulatory Regions • Motif • Mef-2, TEF and SRF • Results

Discussion • HMx model • Capture the spatial correlation between different binding sites • CisModule • A Bayesian module sampler to infer the motif modules and the binding sites for a set of TFs • May be trapped in local modes. • Need multiple trials. • Can use available prior informatioin

Future Work • Incorporate the information from comparative genomics into CisModule. • Greater prior probabilities for modules and sites can be assigned to the regions that are highly conserved across species of appropriate evolutionary distances. • The HMx model captures the colocalization tendency of cooperating TFBSs but not their order or precise spacing. • Additional refinements to the model may improve.

CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling