CS5263 Bioinformatics

CS5263 Bioinformatics Probabilistic modeling approaches for motif finding

Motif representation • Collection of exact words • {ACGTTAC, ACGCTAC, AGGTGAC, …} • Consensus sequence (with wild cards) • {AcGTgTtAC} • {ASGTKTKAC} S=C/G, K=G/T (IUPAC code) • Position specific weight matrices (PWMs)

Sequence Logo

Finding Motifs

Classification of approaches • Combinatorial search • Based on enumeration of words and computing word similarities • Analogy to DP for sequence alignment • Probabilistic modeling • Construct models to distinguish motifs vs non-motifs • Analogy to HMM for sequence alignment

Combinatorial motif finding Given a set of sequences S = {x1, …, xn} • A motif W is a consensus string w1…wK • Find motif W* with “best” match to x1, …, xn Definition of “best”: d(W, xi) = min hamming dist. between W and a word in xi d(W, S) = i d(W, xi) W* = argmin( d(W, S) )

Exhaustive searches 1. Pattern-driven algorithm: For W = AA…A to TT…T (4K possibilities) Find d( W, S ) Report W* = argmin( d(W, S) ) Running time: O( K N 4K ) (where N = i |xi|) Guaranteed to find the optimal solution.

Exhaustive searches 2. Sample-driven algorithm: For W = a K-long word in some xi Find d( W, S ) Report W* = argmin( d( W, S ) ) OR Report a local improvement of W* Running time: O( K N2 )

WEEDER: algorithm sketch • A list containing all eligible nodes: with at most α mismatches to P • For each node, remember #mismatches accumulated (e), and bit vector (B) for seq occ, e.g. [011100010] • Bit OR all B’s to get seq occurrence for P • Suppose #occ >= m • Pattern still valid • Now add a letter Current pattern P, |P| < K ACGTT # mismatches (e, B) Seq occ

WEEDER: algorithm sketch • Simple extension: no branches. • No change to B • e may increase by 1 or no change • Drop node if e > α • Branches: replace a node with its child nodes • Drop if e > α • B may change • Re-do Bit OR using all B’s • Try a different char if #occ < m • Report P when |P| = K Current pattern P ACGTTA (e, B)

Probabilistic modeling approaches for motif finding

Probabilistic modeling approaches • A motif model • Usually a PWM • M = (Pij), i = 1..4, j = 1..k, k: motif length • A background model • Usually the distribution of base frequencies in the genome (or other selected subsets of sequences) • B = (bi), i = 1..4 • A word can be generated by M or B

Expectation-Maximization • For any word W, • P(W | M) = PW[1] 1 PW[2] 2…PW[K] K • P(W | B) = bW[1] bW[2] …bW[K] • Let  = P(M), i.e., the probability for any word to be generated by M. • Then P(B) = 1 -  • Can compute the posterior probability P(M|W) and P(B|W) • P(M|W) ~ P(W|M) *  • P(B|W) ~ P(W|B) * (1-)

Expectation-Maximization Initialize: Randomly assign each word to M or B • Let Zxy = 1 if position y in sequence x is a motif, and 0 otherwise • Estimate parameters M, , B Iterate until converge: • E-step: Zxy = P(M | X[y..y+k-1]) for all x and y • M-step: re-estimate M,  given Z (B usually fixed)

Expectation-Maximization • E-step: Zxy = P(M | X[y..y+k-1]) for all x and y • M-step: re-estimate M,  given Z position 1 1 Initialize E-step 5 5 probability 9 9 M-step

MEME • Multiple EM for Motif Elicitation • Bailey and Elkan, UCSD • http://meme.sdsc.edu/ • Multiple starting points • Multiple modes: ZOOPS, OOPS, TCM

Gibbs Sampling • Another very useful technique for estimating missing parameters • EM is deterministic • Often trapped by local optima • Gibbs sampling: stochastic behavior to avoid local optima

Gibbs sampling Initialize: Randomly assign each word to M or B • Let Zxy = 1 if position y in sequence x is a motif, and 0 otherwise • Estimate parameters M, B,  Iterate: • Randomly remove a sequence X* from S • Recalculate model parameters using S \ X* • Compute Zx*y for X* • Sample a y* from Zx*y. • Let Zx*y = 1 for y = y* and 0 otherwise

Gibbs Sampling • Gibbs sampling: sample one position according to probability • Update prediction of one training sequence at a time • Viterbi: always take the highest • EM: take weighted average position probability Sampling Simultaneously update predictions of all sequences

Better background model • Repeat DNA can be confused as motif • Especially low-complexity CACACA… AAAAA, etc. • Solution: more elaborate background model • Higher-order Markov model 0th order: B = { pA, pC, pG, pT } 1st order: B = { P(A|A), P(A|C), …, P(T|T) } … Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} } Has been applied to EM and Gibbs (up to 3rd order)

Gibbs sampling motif finders • Gibbs Sampler • First appeared as: Larence et.al. Science 262(5131):208-214. • Continually developed and updated. webpage • The newest version: Thompson et. al. Nucleic Acids Res. 35 (s2):W232-W237 • AlignACE • Hughes et al., J. of Mol Bio, 2000 10;296(5):1205-14. • Allow don’t care positions • Additional tools to scan motifs on new seqs, and to compare and group motifs • BioProspector, X. Liu et. al. PSB 2001 , an improvement of AlignACE • Liu, Brutlag and Liu. Pac Symp Biocomput. 2001;:127-38. • Allow two-block motifs • Consider higher-order markov models

Limits of Motif Finders 0 • Given upstream regions of coregulated genes: • Increasing length makes motif finding harder – random motifs clutter the true ones • Decreasing length makes motif finding harder – true motif missing in some sequences ??? gene

Challenging problem • (k, d)-motif challenge problem • Many algorithms fail at (15, 4)-motif for n = 20 and L = 600 • Combinatorial algorithms usually work better on challenge problem • However, they are usually designed to find (k, d)-motifs • Performance in real data varies d mutations n = 20 k L = 600

(15, 4)-motif • Information content: 11.7 bits • ~ 6mers. Expected occurrence 1 per 3k bp

Actual Results by MEME llr = 163 E-value = 3.2e+005 llr = 177 E-value = 1.5e+006 llr = 88 E-value = 2.5e+005

Motif finding in practice • Now we’ve found some good looking motifs • This is probably the easiest step • What to do next? • Are they real? • How do we find more instances in the rest of the genome? • What are their functional meaning? • Motifs => regulatory networks

How to make sense of the motifs? • Each program usually reports a number of motifs (tens to hundreds) • Many motifs are variations of each other • Each program also report some different ones • Each program has its own way of scoring motifs • Best scored motifs often not interesting • AAAAAAAA • ACACACAC • TATATATAT

How to make sense of the motifs? • Combine results from different algorithms usually helpful • Ones that appeared multiple times are probably more interesting • Except simple repeats like AAAAA or ATATATATA • Cluster motifs into groups. • Compare with known motifs in database • TRANSFAC • JASPAR • YPD (yeast promoter database)

Strategies to improve results • How to tell real motifs (functional) from noises? Statistical test of significance. • Enrichment in target sequences vs background sequences Background set B Target set T Assumed to contain a common motif, P Assumed to not contain P, or with very low frequency Ideal case: every sequence in T has P, no sequence in B has P

Statistical test for significance Background set + target set B + T P • If n / N >> m / M • P is enriched (over-represented) in T • Statistical significance? • If we randomly draw Nsequences from (B+T), how likely we will see at least n sequences having P? Target set T M N P appeared in n sequences P appeared in m sequences

Hypergeometric distribution • A box with M balls (seqs), of which m are red (with motifs), and the rest are blue (without motifs). • Red ball: sequences with motifs • Blue ball: sequences without motifs • We randomly draw N balls (seqs) from the box • What’s the probability we’ll see n red balls? # of choices to have n red balls Total # of choices to draw N balls

Cumulative hypergeometric test for motif significance • We are interested in: if we randomly pick m balls, how likely that we’ll see at leastn red balls? Null hypothesis: our selection is random. Alternative hypothesis: our selection favored red balls. When prob is small, we reject the null hypothesis. Equivalent: we accept the alternative hypothesis (The number of red balls is larger than expected).

Example • Yeast genome has 6000 genes • Select 50 genes believed to be co-regulated by a common TF • Found a motif from the promoter seqs of these 50 genes • The motif appears in 20 of these 50 genes • In the rest of the genome, 100 genes have this motif • M = 6000, N = 50, m = 100+20 = 120, n = 20 • Intuitively: • m/M = 120/6000=1/50. (1 out 50 genes has the motif) • N = 50, would expect only 1 gene in the target set to have the motif • 20-fold enrichment • P-value = cHyperGeom(20; 6000, 50, 120) = 6 x 10-22 • This motif is significantly enriched in the set of genes

ROC curve for motif significance • Motif is usually a PWM • Any word will have a score • Typical scoring function: Log (P(W | M) / P(W | B)) • W: a word. • M: a PWM. • B: background model • To determine whether motif M occurred in a sequence, a cutoff has to be decided • Different cutoffs give different # of occurrences • Stringent cutoff: low occurrence in both + and - sequences • Loose cutoff: high occurrence in both + and - sequences • It may be better to look at a range of cutoffs

ROC curve for motif significance Background set + target set B + T P • With different score cutoff, will have different m and n • Assume you want to use P to classify T and B • Sensitivity: n / N • Specificity: (M-N-m+n) / (M-N) • False Positive Rate = 1 – specificity: (m – n) / (M-N) • With decreasing cutoff, sensitivity , FPR  Target set T M N Given a score cutoff Appeared in n sequences Appeared in m sequences

ROC curve for motif significance A good cutoff ROC-AUC: area under curve. 1: the best. 0.5: random. Motif 1 is more enriched in motif 2. 1 Lowest cutoff. Every sequence has the motif. Sensitivity = 1. specificity = 0. sensitivity Motif 1 Motif 2 Random 0 1-specificity 0 1 Highest cutoff. No motif can pass the cutoff. Sensitivity = 0. specificity = 1.

Other strategies • Cross-validation • Randomly divide sequences into 10 sets, hold 1 set for test. • Do motif finding on 9 sets. Does the motif also appear in the testing set? • Phylogenetic conservation information • Does a motif also appears in the homologous genes of another species? • Strongest evidence • However, will not be able to find species-specific ones

Other strategies • Finding motif modules • Will two motifs always appear in the same gene? • Location preference • Some motifs appear to be in certain location • E.g., within 50-150bp upstream to transcription start • If a detect motif has strong positional bias, may be a sign of its function • Evidence from other types of data sources • Do the genes having the motif always have similar activities (gene expression levels) across different conditions? • Interact with the same set of proteins? • Similar functions? • etc.

To search for new instances • Usually many false positives • Score cutoff is critical • Can estimate a score cutoff from the “true” binding sites Motif finding Scoring function Log (P(W | M) / P(W | B)) A set of scores for the “true” sites. Take mean - std as a cutoff. (or a cutoff such that the majority of “true” sites can be predicted).

To search for new instances • Use other information, such as positional biases of motifs to restrict the regions that a motif may appear • Use gene expression data to help: the genes having the true motif should have similar activities • Risk of circular reasoning: most likely this is how you get the initial sequences to do motif finding • Phylogenetic conservation is the key

References • D’haeseleer P (2006) What are DNA sequence motifs? NATURE BIOTECHNOLOGY, 24 (4):423-425 • D’haeseleer P (2006) How does DNA sequence motif discovery work? NATURE BIOTECHNOLOGY, 24 (8):959-961 • MacIsaac KD, Fraenkel E (2006) Practical strategies for discovering regulatory DNA sequence motifs. PLoS Comput Biol 2(4): e36 • Lawrence CE et. al. (1993) Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment, Science, 262(5131):208-214

CS5263 Bioinformatics