300 likes | 407 Views
A probabilistic method to detect regulatory modules. Saurabh Sinha et al. ISMB 2003 Presented by Tian Xia. Outline. Objective To detect regulatory modules (clusters of binding sites) Motivation
E N D
A probabilistic method to detect regulatory modules Saurabh Sinha et al. ISMB 2003 Presented by Tian Xia.
Outline • Objective • To detect regulatory modules (clusters of binding sites) • Motivation • Discovery of modules is crucial for understanding the connection between genes and organism diversity • Method • HMM and an EM algorithm, given PWMs of s set of transcription factors known to work together • New features • Take motif correlations into account • Use phylogenetic comparisons to highlight a module
GENE PROTEIN Binding Sites MODULE Basics • Objective:module detection
ACAGTGA AGAGTGA GENE ACAGAGA ACCCGTT ACCGGTT Basics • Objective: module detection Binding sites are - short - similar to each other - have some variability Different transcription factors have different looking binding sites(“motifs”)
Basics • Operation • Score one DNA sequence S with a set W of motifs. • Proceed from one end to the other, check successive sequence “windows” of a fixed length L. • Each window is scored. Output the series of scores.
Algorithm: HMM0 • Assumption • sequence S generated by HMM • Model parameter • Given: • a set of motifs W (their PWMs) • a background motif wb (length 1. sampling probability of a base depends upon the previous k bases in the sequence) • Hidden: • Transition probabilities {pi} • Pr (Wj!Wi)= pi (independent of Wj)
Algorithm: HMM0 • The Process • At each step, choose either a wifrom W, or the background motif wb • Choice dictated by {pi} • Once a motif w is chosen, sample a sequence from PWM of w, append it at the end of S • Proceed to the next step • Stop when the length of S reaches L
Algorithm: HMM0 • Parse (T): • the sequence of motifs chosen in the successive steps of the process is called a parse • Pr (S, T | ) • Each parse T of the sequence S • Pr (S | ) = T Pr (S, T | ) • The probability that S is generated by an HMM with parameter
Algorithm: HMM0 • Pr (S | ) = T Pr (S, T | ) • The probability that S is generated by an HMM with parameter • Pr (S |b ) • The probability that S is generated using only background motifwb
Algorithm: HMM0 • The score of the sequence S - Log likelihood ratio - How likely it is that S differs from background - How likely it is that S is generated by a HMM
Algorithm: HMM0 • Score: • Train the hidden parameter {pi} to maximize F(S, ) • Baum-Welch algorithm • Expectation-Maximization search (local minimum) • Dynamic programming to computelog Pr (S | )
species1 GCGTGATCGAGCTATAACGGAA species2 CTGTGATCGTCGGGTAACGCCC species3 TGGTGATCGGAACCCCTAACGA species4 AAGTGATCGATTATCCTAACGT EVOLUTIONARY TREE CONSERVED BLOCKS Multiple Species • Objective: module detection • More data: genomes of multiple species (closely related) available
species1 species1 species2 species3 species4 Conserved regions (evolved from common ancestral sequence) Multiple Species Four binding sites that are evolved from the same ancestral site Pr (species1 | ) Pr (species1, species2, species3, species4 | )
species1 species2 species3 species4 Multiple Species One binding site, independent from others Pr (species1, species2, species3, species4 | )
species1 species2 species3 species4 Look out for both kinds of binding sites ; treat them appropriately in the model Multiple Species
Multiple Species • Multiple species • the extra information • improve module detection • Two steps • Identify conserved blocks by sequence alignment algorithms • Define a homologous window between species, score the window
Multiple Species • Step 1 Identify conserved blocks by sequence alignment algorithms • Two species: Lagan (Brudno et al., 2003) • More than two: DiAlign (Morgenstern et al., 1998)
Multiple Species • Step 2 • Define a homologous window • For two species A and B, it includes • A set of non-overlapping subsequences {x1 x2 … xk} aligned with similar subsequences {y1 y2 … yk} • All subsequences ofSAoutside of its aligned regions • All subsequences ofSBbetweenyiandyi+1
Multiple Species • Step 2 • Score aligned blocks as a unit in the homologous window • Aligned block: derived from a common ancestor • Use the same weight matrix for the common ancestor and all descendants
a Timet d1 d2 Score an aligned block in the homologous window • Generalize to a setof subsequences:Pr (s1,s2 | w) • Example: two species, sites d1and d2 in a conserved block • Short time limit (t ~ 0): • a = d1 = d2 • Pr (d1,d2 | W) = Pr (a | W) • Long time limit (t ~ ) • Pr (d1,d2 | W) = Pr (d1 | W) Pr (d2 | W) • Interpolate between these two limits
a Timet d1 d2 Score an aligned block in the homologous window • Pr (d1,d2 | W) =aPr(a |W)i2{1,2}Pr (di |a, W, t) • Pr (di |a, W, t) • depends on time t • depends on motif W • More specifically • UsePr ( | w)just asPr (s | w)in HMM0
Multiple Species • Step 2 • Score an aligned block in the homologous window • Score an unaligned subsequence in the homologous window • Use HMM0 • Sum over all aligned blocks and unaligned regions
Multiple Species • Results • Comparison of the discrimination of modules by SSPECIES and MSPECIES
Motif Correlation • Motifs are correlated both in order and in spacing • In HMM0, motifs are chosen independently: Pr (Wj! Wi) = pi • Add to a correlated transition probabilitypij • The previous non-background motif placed is ‘remembered’
Motif Correlation • ‘History-conscious HMM’ (hcHMM) • Model parameter : includepijfor all pairs of motifs? • No (overfitting) • pijis added toonly if there is evidence for a correlation in occurrences ofwiandwj
Motif Correlation • pijis added towhenZijandEijare above some thresholds • Aij(S): the average of times wj follows wi over all parses of S • Eijand : expectation and standard deviation of the random variable Aij (X), over all sequences X of length L
Motif Correlation • If Corr (i, j) = true: Pr(i!j) = pij • If Corr (i, j) = false: • Now, model parameter includes • {pi}, {pij}, W • Time complexity of each iteration of hcHMM training:O(L|W|2)vs.HMM0:O(L|W|)
Motif Correlations • Results Performance of Stubb (hcHMM) on gap gene upstream region
Motif Correlation • Results • advantage of hcHMM over HMM0 in detecting modules
Implementation Issues • Stubb system • windows in the neighborhood of a high-scoring are also high-scoring • Suppress all overlapping windows • strand bias • Pre-processing: counts in both directions • background motif • Context window • alignment computation for conserved blocks • Lagan & DiAlign