230 likes | 400 Views
Variants of HMMs. Higher-order HMMs. How do we model “memory” larger than one time point? P( i+1 = l | i = k) a kl P( i+1 = l | i = k, i -1 = j) a jkl … A second order HMM with K states is equivalent to a first order HMM with K 2 states. a HHT. state HH. state HT.
E N D
Higher-order HMMs • How do we model “memory” larger than one time point? • P(i+1 = l | i = k) akl • P(i+1 = l | i = k, i -1 = j) ajkl • … • A second order HMM with K states is equivalent to a first order HMM with K2 states aHHT state HH state HT aHT(prev = H) aHT(prev = T) aHTH state H state T aHTT aTHH aTHT state TH state TT aTH(prev = H) aTH(prev = T) aTTH
Modeling the Duration of States 1-p Length distribution of region X: E[lX] = 1/(1-p) • Geometric distribution, with mean 1/(1-p) This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions X Y p q 1-q
Sol’n 1: Chain several states p 1-p X Y X X q 1-q Disadvantage: Still very inflexible lX = C + geometric with mean 1/(1-p)
Sol’n 2: Negative binomial distribution Duration in X: m turns, where • During first m – 1 turns, exactly n – 1 arrows to next state are followed • During mth turn, an arrow to next state is followed m – 1 m – 1 P(lX = m) = n – 1 (1 – p)n-1+1p(m-1)-(n-1) = n – 1 (1 – p)npm-n p p p 1 – p 1 – p 1 – p Y X X X ……
Example: genes in prokaryotes • EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A • Negative binomial with n = 3
Solution 3: Duration modeling Upon entering a state: • Choose duration d, according to probability distribution • Generate d letters according to emission probs • Take a transition to next state according to transition probs Disadvantage: Increase in complexity: Time: O(D2) -- Why? Space: O(D) where D = maximum duration of state X
A state model for alignment M (+1,+1) Alignments correspond 1-to-1 with sequences of states M, I, J I (+1, 0) J (0, +1) -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII
Let’s score the transitions s(xi, yj) M (+1,+1) Alignments correspond 1-to-1 with sequences of states M, I, J s(xi, yj) s(xi, yj) -d -d I (+1, 0) J (0, +1) -e -e -e -e -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII
How do we find optimal alignment according to this model? Dynamic Programming: M(i, j): Optimal alignment of x1…xi to y1…yjending in M I(i, j): Optimal alignment of x1…xi to y1…yj ending in I J(i, j): Optimal alignment of x1…xi to y1…yjending in J The score is additive, therefore we can apply DP recurrence formulas
Needleman Wunsch with affine gaps – state version Initialization: M(0,0) = 0; M(i,0) = M(0,j) = -, for i, j > 0 I(i,0) = d + ie; J(0,j) = d + je Iteration: M(i – 1, j – 1) M(i, j) = s(xi, yj) + max I(i – 1, j – 1) J(i – 1, j – 1) e + I(i – 1, j) I(i, j) = max e + J(i, j – 1) d + M(i – 1, j – 1) e + I(i – 1, j) J(i, j) = max e + J(i, j – 1) d + M(i – 1, j – 1) Termination: Optimal alignment given by max { M(m, n), I(m, n), J(m, n) }
Probabilistic interpretation of an alignment An alignment is a hypothesis that the two sequences are related by evolution Goal: Produce the most likely alignment Assert the likelihood that the sequences are indeed related
1 – 2 – M P(xi, yj) 1 – 2 – 1 – 2 – I P(xi) J P(yj) A Pair HMM for alignments BEGIN 1 – 2 – I M J END
A Pair HMM for unaligned sequences Model R 1 - 1 - P(x, y | R) = (1 – )m P(x1)…P(xm) (1 – )n P(y1)…P(yn) = 2(1 – )m+n i P(xi) j P(yj) J P(yj) I P(xi) END BEGIN END BEGIN 1 - 1 -
To compare ALIGNMENT vs. RANDOM hypothesis 1 – 2 – Every pair of letters contributes: • (1 – 2 – ) P(xi, yj) when matched • P(xi) P(yj) when gapped • (1 – )2P(xi) P(yj) in random model Focus on comparison of P(xi, yj) vs. P(xi) P(yj) M P(xi, yj) 1 – 2 – 1 – 2 – I P(xi) J P(yj) 1 - 1 - J P(yj) I P(xi) END BEGIN END BEGIN 1 - 1 -
To compare ALIGNMENT vs. RANDOM hypothesis Idea: We will divide alignment score by the random score, and take logarithms Let P(xi, yj) (1 – 2 – ) s(xi, yj) = log ––––––––––– + log ––––––––––– P(xi) P(yj) (1 – )2 (1 – 2 – ) P(xi) d = – log –––––––––––––––––––– (1 – ) (1 – 2 – ) P(xi) P(xi) e = – log ––––––––––– (1 – ) P(xi) Every letter b in random model contributes (1 – ) P(b)
The meaning of alignment scores Because , , are small, and , are very small, P(xi, yj) (1 – 2 – ) P(xi, yj) s(xi, yj) = log ––––––––– + log –––––––––– log –––––––– + log(1 – 2) P(xi) P(yj) (1 – )2P(xi) P(yj) (1 – – ) 1 – d = – log –––––––––––––––––– – log –––––– (1 – ) (1 – 2 – ) 1 – 2 e = – log ––––––– – log (1 – )
The meaning of alignment scores • The Viterbi algorithm for Pair HMMs corresponds exactly to the Needleman-Wunschalgorithm with affine gaps • However, now we need to score alignment with parameters that add up to probability distributions • 1/mean length of next gap • 1/mean arrival time of next gap • affine gaps decouple arrival time with length • 1/mean length of aligned sequences (set to ~0) • 1/mean length of unaligned sequences (set to ~0)
The meaning of alignment scores Match/mismatch scores: P(xi, yj) s(a, b) log ––––––––––– (let’s ignore log(1 – 2) for the moment – assume no gaps) P(xi) P(yj) Example: Say DNA regions between human and mouse have average conservation of 50% Then P(A,A) = P(C,C) = P(G,G) = P(T,T) = 1/8 (so they sum to ½) P(A,C) = P(A,G) =……= P(T,G) = 1/24 (24 mismatches, sum to ½) Say P(A) = P(C) = P(G) = P(T) = ¼ log [ (1/8) / (1/4 * 1/4) ] = log 2 = 1, for match Then, s(a, b) = log [ (1/24) / (1/4 * 1/4) ] = log 16/24 = -0.585 Cutoff similarity that scores 0: s*1 – (1 – s)*0.585 = 0 According to this model, a 37.5%-conserved sequence with no gaps would score on average 0.375 * 1 – 0.725 * 0.585 = 0 Why? 37.5% is between the 50% conservation model, and the random 25% conservation model !
Substitution matrices A more meaningful way to assign match/mismatch scores For protein sequences, different substitutions have dramatically different frequencies! PAM Matrices: • Start from a curated set of very similar protein sequences • Construct ancestral sequences (using parsimony) • Calculate Aab: frequency of letters a and b interchanging • Calculate Bab = P(b|a) = Aab/(c≤dAcd) • Adjust matrix B so that a,bqa qb Bab= 0.01 PAM(1) • Let PAM(N) = [PAM(1)]N -- Common PAM(250)
Substitution Matrices BLOSUM matrices: • Start from BLOCKS database (curated, gap-free alignments) • Cluster sequences according to > X% identity • Calculate Aab: # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes • Estimate P(a) = (b Aab)/(c≤d Acd); P(a, b) = Aab/(c≤d Acd)
BLOSUM matrices BLOSUM 50 BLOSUM 62 (The two are scaled differently)