1 / 61

CS5263 Bioinformatics

CS5263 Bioinformatics. Lecture 12: Hidden Markov Models and applications. Project ideas. Implement an HMM including Viterbi decoding, posterior decoding and Baum-Welch learning Construct a model Generate sequences with the model Given labels, estimate parameters Given parameters, decode

cain
Download Presentation

CS5263 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS5263 Bioinformatics Lecture 12: Hidden Markov Models and applications

  2. Project ideas • Implement an HMM including Viterbi decoding, posterior decoding and Baum-Welch learning • Construct a model • Generate sequences with the model • Given labels, estimate parameters • Given parameters, decode • Given nothing, learn the parameters and decode

  3. Project ideas • Implement a progressive multiple sequence alignment with iterative refinement • Use an inferred phylo-genetic tree • Affine gap penalty? • Compare with results in protein families? • Compare with HMM-based?

  4. Project ideas • Implement a combinatorial motif finder • Fast enumeration using suffix tree? • Statistical evaluation • Word clustering? • Test on simulated data – can you find known motifs embedded in sequences • Test on real data – find motifs in some real promoter sequences and compare with what is known about those genes

  5. Project ideas • Pick a paper about some algorithm and implement it • Do your own experiments • Or pick a topic and do a survey

  6. Problems in HMM • Decoding • Predict the state of each symbol • Most probable path • Most probable state for each position: posterior decoding • Evaluation • The probability that a sequence is generated by a model • Basis for posterior decoding • Learning • Decode without knowing model parameters • Estimate parameters without knowing states

  7. Review of last lecture

  8. Decoding Input: HMM & transition and emission parameters, and a sequence Output: the state of each position on the sequence ?

  9. Decoding • Solution 1: find the most probable path • Algorithm: Viterbi

  10. HMM for loaded/fair dice aFF = 0.95 aFL = 0.05 aLL = 0.95 Fair LOADED eF(1) = 1/6 eF(2) = 1/6 eF(3) = 1/6 eF(4) = 1/6 eF(5) = 1/6 eF(6) = 1/6 eL(1) = 1/10 eL(2) = 1/10 eL(3) = 1/10 eL(4) = 1/10 eL(5) = 1/10 eL(6) = 1/2 aLF = 0.05 Transition probability Emission probability Probability of a path is the product of transition probabilities and emission probabilities on the path

  11. HMM unrolled x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 F F F F F F F F F F B L L L L L L L L L L Edge weight w(F, L) = log (aFL) Node weight r(F, x) = log (eF(x)) • Find a path with the following objective: • Maximize the product of transition and emission probabilities  Maximize the sum of weights • Strategy: Dynamic Programming

  12. FSA interpretation aFF= 0.05 aFL= 0.05 aLL= 0.95 w(F,F) = 2.3 w(F,L) = -0.7 w(L,L) = 2.3 Fair LOADED Fair LOADED aLF= 0.05 eL(1) = 1/10 eL(2) = 1/10 eL(3) = 1/10 eL(4) = 1/10 eL(5) = 1/10 eL(6) = 1/2 r(F,1) = 0.5 r(F,2) = 0.5 r(F,3) = 0.5 r(F,4) = 0.5 r(F,5) = 0.5 r(F,6) = 0.5 r(L,1) = 0 r(L,2) = 0 r(L,3) = 0 r(L,4) = 0 r(L,5) = 0 r(L,6) = 1.6 eF(1) = 1/6 eF(2) = 1/6 eF(3) = 1/6 eF(4) = 1/6 eF(5) = 1/6 eF(6) = 1/6 w(L,F) = -0.7 P(L, i+1) = max { P(L, i) aLL eL(xi+1), P(F, i) aFL eL(xi+1)} P(F, i+1) = max { P(L, i) aLF eF(xi+1), P(F, i) aFF eF(xi+1)} V(L, i+1) = max { V(L, i) + W(L, L) + r(L, xi+1), V(F, i) + W(F, L) + r(L, xi+1)} V(F, i+1) = max { V(L, i) + W(L, F) + r(F, xi+1), V(F, i) + W(F, F) + r(F, xi+1)}

  13. More general cases 1 2 K states Completely connected (possibly with 0 transition probabilities) Each state has a set of emission probabilities (emission probabilities may be 0 for some symbols in some states) 3 K … …

  14. HMM unrolled x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 B 3 3 3 3 3 3 3 3 3 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . k k k k k k k k k k

  15. Recurrence V(1,i) + w(1, j) + r(j, xi+1), V(2,i) + w(2, j) + r(j, xi+1), V(j, i+1) = max V(3,i) + w(3, j) + r(j, xi+1), …… V(k,i) + w(k, j) + r(j, xi+1) Or simply: V(j, i+1) = Maxl {V(l,i) + w(l, j) + r(j, xi+1)}

  16. The Viterbi Algorithm x1 x2 x3 ………………………………………..xN Time: O(K2N) Space: O(KN) State 1 2 Vj(i) K

  17. Problems with Viterbi decoding • The most probable path not necessarily the only interesting one • Single optimal vs multiple sub-optimal • Global optimal vs local optimal

  18. For example • Probability for path 1-2-5-7 is 0.4 • Probability for path 1-3-6-7 is 0.3 • Probability for path 1-4-6-7 is 0.3 • The most probable state at step 2 is 6 • 0.4 goes through 5 • 0.6 goes through 6

  19. Another example • The dishonest casino • Say x = 12341623162616364616234161221341 • Most probable path:  = FF……F • However: marked letters more likely to be L than unmarked letters

  20. Posterior decoding • Viterbi finds the path with the highest probability • Posterior decoding ^i = argmaxkP(i = k | x) • Need to know

  21. Posterior decoding • In order to do posterior decoding, we need to know the probability of a sequence given a model, since • This is called the evaluation problem • The solution: Forward-backward algorithm

  22. The forward algorithm

  23. Relation between Forward and Viterbi VITERBI Initialization: P0(0) = 1 Pk(0) = 0, for all k > 0 Iteration: Pk(i) = ek(xi) maxjPj(i-1) ajk Termination: Prob(x, *) = maxkPk(N) FORWARD Initialization: f0(0) = 1 fk(0) = 0, for all k > 0 Iteration: fk(i) = ek(xi) j fj(i-1) ajk Termination: Prob(x) = k fk(N)

  24. 1 This does not include the emission probability of xi

  25. Forward-backward algorithm • fk(i): prob to get to pos i in state k and emit xi • bk(i): prob from i to end, given i is in state k • What is fk(i) bk(i)? • Answer:

  26. The forward-backward algorithm • Compute fk(i), for each state k and pos i • Compute bk(i), for each state k and pos I • Compute P(x) = kfk(N) • Compute P(i=k | x) = fk(i) * bk(i) / P(x)

  27. Sequence state  Forward probabilities Backward probabilities / P(X) Space: O(KN) Time: O(K2N) P(i=k | x)

  28. The Forward-backward algorithm • Posterior decoding ^i = argmaxkP(i = k | x) • Confidence level for the assignment • Similarly idea can be used to compute P(i = k, i+1 = l | x): the probability that a particular transition is used

  29. For example If P(fair) > 0.5, the roll is more likely to be generated by a fair die than a loaded die

  30. Posterior decoding • Sometimes may not give a valid path

  31. Today • Learning • Practical issues in HMM learning

  32. What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG islands for porcupines We suspect the frequency and characteristics of CpG islands are quite different in porcupines How do we adjust the parameters in our model? - LEARNING

  33. Learning • When the states are known • We’ve already done that • Estimate parameters from labeled data (known CpG or non-CpG) • “supervised” learning • Frequency counting is called “maximum likelihood parameter estimation” • The parameters you found will maximizes the likelihood of your data under the model • When the states are unknown • Estimate parameters without labeled data • “unsupervised” learning

  34. Basic idea • We estimate our “best guess” on the model parameters θ • We use θ to predict the unknown labels • We re-estimate a new set of θ • Repeat 2 & 3 Two ways

  35. Viterbi Traininggiven θ estimate π; then re-estimate θ • Make initial estimates (guess) of parameters θ • Find Viterbi path π for each training sequence • Count transitions/emissions on those paths, getting new θ • Repeat 2&3 • Not rigorously optimizing desired likelihood, but still useful & commonly used. • (Arguably good if you’re doing Viterbi decoding.)

  36. Baum-Welch Traininggiven θ, estimate π ensemble; then re-estimate θ • Instead of estimating the new θ from the most probable path  • We can re-estimate θ from all possible paths • For example, according to Viterbi, pos i is in state k and pos (i+1) is in state l • This contributes 1 count towards the frequency that transition kl is used • In Baum-Welch, this transition is counted only partially, according to the probability that this transition is taken by some path • Similar for emission

  37. Question • How to compute P(i = k, i = l | X)? • Evaluation problem • Solvable with the backward-forward algorithm

  38. Answer P(i = k, i+1 = l, x) = P(i = k, x1…xi) * akl * el(xi+1) * P(i+2 = q, xi+3…xn) = fk(i) * akl * el(xi+1) * bl(i+1)

  39. Estimated # of kl transition: New transition probabilities: Estimated # of symbol t emitted in state k: New emission probabilities:

  40. Why is this working? • Proof is very technical (chap 11 in Durbin book) • But basically, • The backward-forward algorithm computes the likelihood of the data P(X | θ) • When we re-estimate θ, we maximize P(X | θ). • Effect: in each iteration, the likelihood of the sequence will be improved • Therefore, guaranteed to converge (not necessarily to a global optimal) • Viterbi training is also guaranteed to converge: every iteration we improve the probability of the most probable path

  41. Expectation-maximization (EM) • Baum-Welch algorithm is a special case of the expectation-maximization (EM) algorithm, a widely used technique in statistics for learning parameters from unlabeled data • E-step: compute the expectation (e.g. prob for each pos to be in a certain state) • M-step: maximum-likelihood parameter estimation • We’ll see EM and similar techniques again in motif finding • k-means clustering is a special case of EM

  42. Does it actually work? • Depend on: • Nature of the problem • Quality of model (architecture) • Size of training data • Selection of initial parameters

  43. Initial parameters • May come from prior knowledge • If no prior knowledge, use multiple sets of random parameters • Each one ends up in a local maxima • Hopefully one will lead to the correct answer

  44. HMM summary • Viterbi – best single path • Forward – sum over all paths • Backward – similar • Baum-Welch – training via EM and forward-backward • Viterbi – another “EM”, but Viterbi-based

  45. HMM structure • We have assumed a fully connected structure • In practice, many transitions are impossible or have very low probabilities • “Let the model find out for itself” which transition to use • Almost never work in practice • Poor model even with plenty of training data • Too many local optima when not constrained • Most successful HMMs are based on knowledge • Model topology should have an interpretation • Some standard topology to choose in typical situations • Define the model as well as you can • Model surgery based on data

  46. Duration modeling • For any sub-path, the probability consists of two components • The product of emission probabilities • Depend on symbols and state path • The product of transition probabilities • Depend on state path

  47. Duration modeling • Model a stretch of DNA for which the distribution does not change for a certain length • The simplest model implies that P(length = L) = (1-p)pL-1 • i.e., length follows geometric distribution • Not always appropriate P Duration: the number of steps that a state is used consecutively without visiting other states p s 1-p L

  48. Duration models P s s s s 1-p Min, then geometric P P P P s s s s 1-p 1-p 1-p 1-p Negative binominal

  49. Explicit duration modeling Exon Intron Intergenic P(A | I) = 0.3 P(C | I) = 0.2 P(G | I) = 0.2 P(T | I) = 0.3 P L Empirical intron length distribution

More Related