270 likes | 408 Views
Special Topics in Genomics Cis-regulatory Modules and Phylogenetic Footprinting. Cis-regulatory Modules and Module Discovery. The slides for module discovery are provided by Prof. Qing Zhou @ UCLA. 1. 2. 3. 4. 5. Background. Motif (weight matrix). Motif Discovery. Mixture modeling.
E N D
Special Topics in GenomicsCis-regulatory Modules and Phylogenetic Footprinting
Cis-regulatory Modules and Module Discovery The slides for module discovery are provided by Prof. Qing Zhou @ UCLA
1 2 3 4 5 Background Motif (weight matrix) Motif Discovery Mixture modeling
Difficulties in motif discovery in higher organisms • Upstream sequences are longer. • Motifs are less conserved and shorter. • Background sequence structures are more complicated. • To solve the problem, utilize more biological knowledge in our model. 1) module structure 2) multiple species conservation
module module Cis-regulatory module • Combinatorial control of genes: cis-regulatory modules
Motif 1 Motif 2 Motif 3 B M S CisModule: modeling module structure(Zhou and Wong, PNAS 2004) • Module structure: consider co-localization of motif sites. Hierarchical Mixture modeling K: # of motifs
Parameters and missing data • Missing data problem. K # of motifs l Module length S Set of sequences M Indicators for a module start A Indicators for a motif site start Background model Weight matrices for motifs W Motif widths r Probability of a module start q Probability of starting a motif site Given Observed data Missing data Parameters Ψ
Parameter Update Given M and A, 1) Infer Θ from aligned sites. 2) Update r, q and W. • Module-motif detection • Given Θ, r, q, and W, • Sample modules: • 2) Within each module, sample motif sites: TTTGC TATCC CTTGC TTTAC GTTGC M=0 M=1 M=0 Aligned Bayesian inference by posterior sampling
Background: Module: Module sampling • Want to sample from P (M | S, Ψ), need to calculate • Denote • Forward summation:
Backward sampling Module sampling • How to calculate
Posterior inference • Motif sites: marginal posterior probability of being a motif start position > 0.5. • Modules: marginal posterior probability of being within a module > 0.5.
Simulation study • Generate 30 data sets independently, each contains: 1) 20 sequences, each of length 1000; 2) 25 modules, with length 150; 3) each module contains 1 E2F site, 1 YY1 site, and 1 cMyc site.
Example: Discovery of tissue-specific modules in Ciona • Sidow lab Collected 21 genes that are co-expressed during the development of muscle tissue in Ciona. • Want to find motifs and modules in the upstream sequences (average length = 1330) of these genes. • Found 3 motifs in 28 modules (4860 bps). Are they real motifs that determine the gene expression??
Experimental validation • Positive element: the shortest sufficient and non-overlapping sequence that drives strong expression in muscle: average length of 289 bps.
Experimental validation • 70% of our predicted motif sites are located in the positive elements!
Other tools • Gibbs Module Sampler (Thompson et al. Genome Res. 2004) • EMCMODULE (Gupta and Liu, PNAS, 2005)
Functional elements tend to be conserved across species For example, exons are conserved due to the selection pressure. Introns and intergenic regions are less likely to be conserved.
Phylogenetic footprinting Miller et al. Annu. Rev. Genomics Hum. Genet. 2004
Incorporating cross-species conservation into motif discovery • A threshold method (Wasserman et al. Nature Genetics, 2000) STEP1: construct cross-species alignment STEP2: compute conservation measure from the alignment STEP3: Non-conserved regions are filtered out STEP4: Gibbs motif sampler is applied to conserved regions of the target genome
Phylogenetic footprinting & motif discovery • CompareProspector (Liu Y. et al. Genome Res. 2004) STEP1: construct cross-species alignment STEP2: compute conservation measure (window percent identity, WPID) from the alignment STEP3: multiply the likelihood ratio at a position by the corresponding WPID, thus likelihood landscape is changed to favor conserved sites STEP4: apply a Gibbs motif sampler based algorithm
Phylogenetic footprinting & motif discovery • Evolutionary model based approach EMnEM (Moses et al. 2004) PhyME (Sinha et al. 2004) PhyloGibbs (Siddharthan et al. 2005) Tree Sampler (Li and Wong, 2005)
Incorporating cross-species conservation into motif discovery • PhyloCon(Wang and Stormo, Bioinformatics, 2003) STEP 1: construct alignment among orthologous sequences; STEP 2: convert conserved regions into profiles; STEP 3: use profiles in the first sequence as seeds; STEP 4: find matches of each seed in the second sequence; STEP 5: update seeds; STEP 6: repeat step 2 and 3 for all sequences.
Phylogenetic footprinting & module discovery • Multimodule (Zhou and Wong, The Annals of Applied Statistics, 2007)
Multimodule • Module structure of each sequence is modeled by an HMM. • Couple HMMs via multiple alignment: Aligned states are coupled and collapsed into one common state. • Uncoupled states: similar to single species model. • Coupled states: evolutionary model.
Comparing with other methods • Three data sets with experimental validation reported previously, which contain 9 known motifs with 152 validated sites. • CompareProspector (Liu et al. 2004): conservation score • PhyloCon (Wang and Stormo 2003): progressive alignment of profiles • EMnEM (Moses et al. 2004): Phylogenetic motif discovery • CisModule (Zhou and Wong 2004): Single-species module discovery.
Comparing with other methods # of known sites = 152