A probabilistic method to detect regulatory modules

A probabilistic method to detect regulatory modules Saurabh Sinha et al. ISMB 2003 Presented by Tian Xia.

Outline • Objective • To detect regulatory modules (clusters of binding sites) • Motivation • Discovery of modules is crucial for understanding the connection between genes and organism diversity • Method • HMM and an EM algorithm, given PWMs of s set of transcription factors known to work together • New features • Take motif correlations into account • Use phylogenetic comparisons to highlight a module

GENE PROTEIN Binding Sites MODULE Basics • Objective:module detection

ACAGTGA AGAGTGA GENE ACAGAGA ACCCGTT ACCGGTT Basics • Objective: module detection Binding sites are - short - similar to each other - have some variability Different transcription factors have different looking binding sites(“motifs”)

Basics • Operation • Score one DNA sequence S with a set W of motifs. • Proceed from one end to the other, check successive sequence “windows” of a fixed length L. • Each window is scored. Output the series of scores.

Algorithm: HMM0 • Assumption • sequence S generated by HMM • Model parameter • Given: • a set of motifs W (their PWMs) • a background motif wb (length 1. sampling probability of a base depends upon the previous k bases in the sequence) • Hidden: • Transition probabilities {pi} • Pr (Wj!Wi)= pi (independent of Wj)

Algorithm: HMM0 • The Process • At each step, choose either a wifrom W, or the background motif wb • Choice dictated by {pi} • Once a motif w is chosen, sample a sequence from PWM of w, append it at the end of S • Proceed to the next step • Stop when the length of S reaches L

Algorithm: HMM0 • Parse (T): • the sequence of motifs chosen in the successive steps of the process is called a parse • Pr (S, T | ) • Each parse T of the sequence S • Pr (S | ) = T Pr (S, T | ) • The probability that S is generated by an HMM with parameter

Algorithm: HMM0 • Pr (S | ) = T Pr (S, T | ) • The probability that S is generated by an HMM with parameter  • Pr (S |b ) • The probability that S is generated using only background motifwb

Algorithm: HMM0 • The score of the sequence S - Log likelihood ratio - How likely it is that S differs from background - How likely it is that S is generated by a HMM

Algorithm: HMM0 • Score: • Train the hidden parameter {pi} to maximize F(S,  ) • Baum-Welch algorithm • Expectation-Maximization search (local minimum) • Dynamic programming to computelog Pr (S | )

species1 GCGTGATCGAGCTATAACGGAA species2 CTGTGATCGTCGGGTAACGCCC species3 TGGTGATCGGAACCCCTAACGA species4 AAGTGATCGATTATCCTAACGT EVOLUTIONARY TREE CONSERVED BLOCKS Multiple Species • Objective: module detection • More data: genomes of multiple species (closely related) available

species1 species1 species2 species3 species4 Conserved regions (evolved from common ancestral sequence) Multiple Species Four binding sites that are evolved from the same ancestral site Pr (species1 | ) Pr (species1, species2, species3, species4 | )

species1 species2 species3 species4 Multiple Species One binding site, independent from others Pr (species1, species2, species3, species4 | )

species1 species2 species3 species4 Look out for both kinds of binding sites ; treat them appropriately in the model Multiple Species

Multiple Species • Multiple species • the extra information • improve module detection • Two steps • Identify conserved blocks by sequence alignment algorithms • Define a homologous window between species, score the window

Multiple Species • Step 1 Identify conserved blocks by sequence alignment algorithms • Two species: Lagan (Brudno et al., 2003) • More than two: DiAlign (Morgenstern et al., 1998)

Multiple Species • Step 2 • Define a homologous window • For two species A and B, it includes • A set of non-overlapping subsequences {x1 x2 … xk} aligned with similar subsequences {y1 y2 … yk} • All subsequences ofSAoutside of its aligned regions • All subsequences ofSBbetweenyiandyi+1

Multiple Species • Step 2 • Score aligned blocks as a unit in the homologous window • Aligned block: derived from a common ancestor • Use the same weight matrix for the common ancestor and all descendants

a Timet d1 d2 Score an aligned block in the homologous window • Generalize to a setof subsequences:Pr (s1,s2 | w) • Example: two species, sites d1and d2 in a conserved block • Short time limit (t ~ 0): • a = d1 = d2 • Pr (d1,d2 | W) = Pr (a | W) • Long time limit (t ~ ) • Pr (d1,d2 | W) = Pr (d1 | W)  Pr (d2 | W) • Interpolate between these two limits

a Timet d1 d2 Score an aligned block in the homologous window • Pr (d1,d2 | W) =aPr(a |W)i2{1,2}Pr (di |a, W, t) • Pr (di |a, W, t) • depends on time t • depends on motif W • More specifically • UsePr ( | w)just asPr (s | w)in HMM0

Multiple Species • Step 2 • Score an aligned block in the homologous window • Score an unaligned subsequence in the homologous window • Use HMM0 • Sum over all aligned blocks and unaligned regions

Multiple Species • Results • Comparison of the discrimination of modules by SSPECIES and MSPECIES

Motif Correlation • Motifs are correlated both in order and in spacing • In HMM0, motifs are chosen independently: Pr (Wj! Wi) = pi • Add to  a correlated transition probabilitypij • The previous non-background motif placed is ‘remembered’

Motif Correlation • ‘History-conscious HMM’ (hcHMM) • Model parameter : includepijfor all pairs of motifs? • No (overfitting) • pijis added toonly if there is evidence for a correlation in occurrences ofwiandwj

Motif Correlation • pijis added towhenZijandEijare above some thresholds • Aij(S): the average of times wj follows wi over all parses of S • Eijand : expectation and standard deviation of the random variable Aij (X), over all sequences X of length L

Motif Correlation • If Corr (i, j) = true: Pr(i!j) = pij • If Corr (i, j) = false: • Now, model parameter  includes • {pi}, {pij}, W • Time complexity of each iteration of hcHMM training:O(L|W|2)vs.HMM0:O(L|W|)

Motif Correlations • Results Performance of Stubb (hcHMM) on gap gene upstream region

Motif Correlation • Results • advantage of hcHMM over HMM0 in detecting modules

Implementation Issues • Stubb system • windows in the neighborhood of a high-scoring are also high-scoring • Suppress all overlapping windows • strand bias • Pre-processing: counts in both directions • background motif • Context window • alignment computation for conserved blocks • Lagan & DiAlign

A probabilistic method to detect regulatory modules

A probabilistic method to detect regulatory modules

Presentation Transcript

Introduction to Randomized Algorithms and the Probabilistic Method

Test 3 Using PCR method to detect HBV DNA

Learning to Detect A Salient Object

Probabilistic Combination of Multiple Modalities to Detect Interest

Finding regulatory modules: A statistical approach

Computational detection of cis-regulatory modules

A Passive-based method to detect vulnerabilities

Probabilistic graphical models and regulatory networks

Probability III: The Probabilistic Method

A semi-automated method to detect opsonophagocytic killing (OPK) Janet C. Onishi

HOW TO DETECT A BLACK HOLE

Finding regulatory modules from local alignment

Two particle correlation method to Detect rotation in HIC

DHBT method to detect rotation in heavy ion collisions

Detect a Leak

The probabilistic method

Computational detection of cis-regulatory modules

Learning to Detect A Salient Object

Urine Test: The Easiest Method to Detect Drugs

How to Detect a Sewage Leak

Challenges in Method Validation – A Regulatory Laboratory Perspective

The probabilistic method