1 / 60

Lecture 6. Hidden Markov Models

Lecture 6. Hidden Markov Models. The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics. Lecture outline. Motivating example Hidden Markov model basics The three problems related to HMM Computing data likelihood Forward algorithm Backward algorithm Using a model

Download Presentation

Lecture 6. Hidden Markov Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 6. Hidden Markov Models The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

  2. Lecture outline • Motivating example • Hidden Markov model basics • The three problems related to HMM • Computing data likelihood • Forward algorithm • Backward algorithm • Using a model • Viterbi algorithm • Learning a model • Baum-Welch algorithm • Applications • Gene finding • Protein domains CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  3. Part 1 Motivating Example

  4. Sequence patterns • In the lectures so far, we have been focusing on the similarity between sequences without caring much what kind of sequences they are • Optimal alignments • Dynamic programming • Suffix trie/tree/array, BWT • Heuristic alignments • FASTA • BLAST • Sequence assembly • Now we want to study deeper into their meanings. We start with their sequence patterns CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  5. Gene structure • Recall the structure of a typical eukaryotic (higher organisms, with a cell nucleus) gene: A gene Image credit: Wikipedia CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  6. From sequences to meanings • Suppose you are given the following DNA sequence: • TCGTGTCTAGTCGGATATAGCTATGCATCTAGCTGAGGCGATCTGAGCGGATCGATGCTAGGGCGATCGGAGCTAGCTGAGCTAGCTAGCTGAGCGCTAGCGAGCGTACGAGCGATCGAGCGAGTCTAGCGAGCGATTCTAGCGATCGAGCGTCTACGATCGTATGCTAGCTAGGGCTAGCATGCGGATCTATCGAGCGGCTAT • Some questions: • How likely is it part of a gene? • If it is a gene, where are the • Exons and introns? • Coding sequence (open reading frame) and untranslated regions? CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  7. Useful information • Some patterns that could be useful for answering the questions: • An intron starts with a “donor site”, usually with sequence GT • An intron ends with an “acceptor site”, usually with sequence AG • The coding sequence is composed of codons (nucleotide triplets) • It should start with the start codon ATG • It should end with a stop codon • Some amino acids (and thus the codons) are more common • Some amino acids frequently appear together • Difficulty: There are exceptions and many false positives Image credit: http://svmcompbio.tuebingen.mpg.de/img/splice.png , Wikipedia CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  8. Constructing and using a gene model • General approach to answering the questions: • Identify typical sequence patterns on known genes with annotated exons, introns, etc. • Provide an abstraction of the patterns in the form of a (statistical) model • Use the model to evaluate the potential that an input sequence is part of a gene • We are going to study one type of commonly-used models for this purpose: the hidden Markov models (HMMs) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  9. Illustration of a model • Learning a model for a sequence pattern: • Training examples: • AAACCCCAA • ACCCCCAA • AAAAACAA • AACCCCCAA • Model (abstraction): • Some A’s, followed by some C’s, followed by two A’s • Application: look for new instances • CGAAACCGGCATCGGACCAACTTCAACCGTCACG This model is based on exact rules (a subsequence either matches or not matches the pattern), while the type of models we are going to study next is based on probabilities (a subsequence matches the pattern with a certain probability) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  10. Illustration of a probabilistic model • Learning a model for a sequence pattern: • Training examples: • AACAAAC • ACCAAACA • ACAAAC • AACAAAC • Possible models: • Model 1: For every position, 70% chance A, 30% chance C • Model 2: For every position, 10% chance A, 90% chance C • Which model do you think is more likely correct? CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  11. Part 2 Hidden Markov Model Basics

  12. Sequence of events • Suppose we play this game: • I have two coins in my pocket. • Coin A is fair, with 50% chance of head (H) and 50% chance of tail (T) in a toss. • Coin B is unfair, with 25% chance of H and 75% chance of T in a toss. • Now I do this: • Take out one of the coins from my pocket (in a way that will be described later) without letting you know which one it is. • Throw the coin and let you see the outcome. • Repeat #1 and #2 for a few times (each time I could use a different coin) • Then you need to guess which coin did I use each time. Example modified from one provided in Chapter 12 of Ewens and Grant, Statistical Methods in Bioinformatics (2nd Edition), Springer (2005). CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  13. Example • Suppose you see this sequence of outcomes: • T, H, T • Which coins did I use each time? • Impossible to tell, because each time both coins could have produced the outcome • A more sensible question: Which “coin sequence” is most likely, given the observed “outcome sequence”? • Need to first know something about how I chose the coin • Why is it related to gene modeling? • Suppose you observe ACCGATAGCTTAC... and you want to guess at each position, whether I used the “exon coin” or the “intron coin” to produce the nucleotide... CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  14. Terminology and symbols • To facilitate our discussion, we need to define some terms and symbols: • O is a sequence of observed outcomes. Let m be the length of O, then the m outcomes are written as o1, o2, ..., om • In our example, m=3 and O=THT • S is a set of possible states, known as s1, s2, ... sN. • In our example, each state is a coin, s1=A and s2=B • Q is a sequence of m states that produced the m corresponding observations, known as q1, q2, ..., qm. The actual value of each qt should be one of the Nsi’s • In our example, Q could be AAB, BAB, etc. CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  15. Terminology and symbols • For each state si, the probability that a particular outcome o is produced (“emitted”) is written as ei(o) • In our example, these “emission probabilities” can be summarized by the following table: CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  16. Terminology and symbols • How did I determine which coin to use each time? • First time: Determined randomly based on fixed initial probabilities, i for state si, i.e., i = Pr(q1=si) • In our example, suppose 1=Pr(q1=A)=0.5 and 2=Pr(q1=B)=0.5, i.e., I just randomly took one from my pocket with equal chance • Second time onwards: Determined randomly based on the coins I actually used before • It means I had to determine qt based on q1, q2, ..., qt-1 • If I determined qt based on qt-1 only, it is said that I assumed a first-order Markov model • If I determine qt based on qt-1 and qt-2 only, it is said that I assumed a second-order Markov model • And so on for higher orders CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  17. Terminology and symbols • In our example, we assume a first-order Markov model • Given that si was chosen at time t, the probability of choosing state sj at time t+1 is called the transition probability from si to sj • It is written as pij, which is defined as pij = Pr(qt+1=sj | qt=si) • These probability values do not change over time • In our example, suppose we have the following transition probabilities: CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  18. Terminology and symbols • A hidden Markov model is fully described by the parameters =(, P, E) • Capital letter means the whole set of variables in small letters. For example, P is the set that contains the variables p11, p12, ..., p1N, p21, ..., pNN. • Sometimes we also use the set notation: P={pij}, where i and j are state indices in this case with the following value ranges: 1  i  N and 1  j  N. CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  19. Summary of symbols • States: • si: The i-th possible state. There are N possible states in total • S ={si} • Model parameters: • i: Initial probability to enter state i • pij: Transition probability from state si to state sj • ei(o): Probability of emitting o at state si •  ={i} • P ={pij} • E ={ei(o)} •  =(, P, E) •  Hidden states: • qt: Unknown state at time t. There are m time points in total • Q = {qt} •  Observed outcomes: • ot: Observed outcome at time t • O = {ot} CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  20. Graphical representation • From my perspective: • From your perspective: • Note: The most likely state sequence statistics-wise is not necessarily the actual state sequence The model: The model: A possible run: 0.9 0.9 0.5 0.5 H H B A A 0.5 0.5 A A T T 0.5 0.5 0.1 0.1 0.8 0.8 0.25 0.25 H H 0.5 0.5 B B T T T H T 0.75 0.75 0.2 0.2 ? ? ? T H T CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  21. Linking back to gene model • See the close correspondence between our example and the gene finding scenario using the defined terminology: • Coming back to our question: “Given the outcome sequence T, H, T, which coin sequence was most likely?” • Using our terminology, we can restate it as “Given the observed outcome sequence O and the model parameters , what state sequence Q was most likely among all possible state sequences?” • We assume the initial and transition probabilities are known • Since the states were not observed (i.e., “hidden”), the model is a “hidden” Markov model • Using the symbols defined, we can write the question as finding the following: (“Q such that Pr(Q|O,) is maximized”) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  22. Basic probabilities • Pr(X)Pr(Y|X) = Pr(X and Y) • If there is a 20% chance to rain tomorrow, and whenever it rains, there is a 60% chance that the temperature will drop, then there is a 0.2*0.6=0.12 chance that tomorrow it will both rain and have a temperature drop • Capital letters mean it is true for all values of X and Y • Can also write Pr(X=x)Pr(Y=y|X=x) = Pr(X=x and Y=y) for particular values of X and Y • Law of total probability: (The summation should consider all possible values of Y) • If there is • A 0.12 chance that it will both rain and have a temperature drop tomorrow, and • A 0.08 chance that it will both rain and not have a temperature drop tomorrow • Then there is a 0.12+0.08 = 0.2 chance that it will rain tomorrow • Bayes’ rule: Pr(X|Y) = Pr(Y|X)Pr(X)/Pr(Y) when Pr(Y)  0 • Because Pr(X|Y)Pr(Y) = Pr(Y|X)Pr(X) = Pr(X and Y) • Similarly, Pr(X|Y,Z) = Pr(Y|X,Z)Pr(X|Z)/Pr(Y|Z) when Pr(Y|Z)  0 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  23. Finding the most likely state sequence • Back to our question: • Pr(Q|O,) = Pr(O|Q,)Pr(Q|) / Pr(O|) [Bayes’ rule] • If we want to know which Q would maximize the above, we do not need to care about the part that does not change for different Q, i.e., Pr(O|) • Therefore, we only need to compare Pr(O|Q,)Pr(Q|) for different values of Q • In other words, • Why do we want to write in this way? • Because we cannot compute Pr(Q|O,) directly, but we can compute Pr(O|Q,) and Pr(Q|) as we will see CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  24. How to find Q that maximizes Pr(O|Q,)Pr(Q|)? • Simplest way: Try all possibilities HMM parameters  Observed outcomes: O=THT Emission probabilities E: Initial states : 1=Pr(q1=A)=0.5 2=Pr(q1=B)=0.5 Transition probabilities P: Most likely CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  25. Another example HMM parameters  Observed outcomes: O=HHH Emission probabilities E: Initial states : 1=Pr(q1=A)=0.5 2=Pr(q1=B)=0.5 Transition probabilities P: Most likely CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  26. How much time does the calculation take? • For each Q: compute the product of • m emission probabilities • 1 initial probability • m-1 transition probabilities • Total: 2m, which is OK • Number of possible Q’s • Nm, which is not OK • How to get the answer using less time? • Reusing partial results  Dynamic programming! CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  27. Part 3 The three problems Related to HMM

  28. The three problems • There are three fundamental problems related to hidden Markov models: • Evaluating data likelihood: Given a fully specified HMM (i.e., with known ), compute the probability of getting a sequence of outcomes • Pr(O|) • Using a model: Given a fully specified HMM and a sequence of outcomes, determine the most likely sequence of states (the problem we have studied) • Learning a model: Given a set of possible states S and a set of multiple outcome sequences (without knowing ), determine the parameters of a HMM that maximizes the data likelihood CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  29. Problem 1: Evaluating data likelihood • Given a fully specified HMM, compute the probability of getting a sequence of outcomes • Pr(O|) • Since the actual sequence of states is unknown, we need to try them all: • Pr(O|Q,) is simply a product of emission probabilities • Pr(Q|) is simply a product of initial and transition probabilities • However, the problem is again the exponential number of possible state sequences Q that need to sum up • Now we study dynamic programming algorithms that can solve this problem efficiently CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  30. The forward algorithm • First we define (t,i)=Pr(o1, o2, ..., ot, qt=si|), which is the probability that we are at state si in step t, and we have observed o1, o2, ..., ot. • The problem of computing (t,i) can be solved by solving some sub-problems: • If we were at state sj in step t-1, we could reach the current situation by transiting from state sj to state si, and then emitting ot • Since we do not really know which state we were at in step t-1, so we sum all possibilities CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  31. Derivation of recursive formula CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  32. The forward algorithm (cont’d) • Recall the definition: (t,i)=Pr(o1, o2, ..., ot, qt=si|) • Now we treat  as a dynamic programming table to calculate Pr(O|) – calculate the probabilities forward: • Initializations: When t=1, the table entries can be computed directly for any i:(1,i) = Pr(o1,q1=si|) = Pr(q1=si|) Pr(o1|q1=si ,) = iei(o1) • Recursion: For any i, (t,i) can be computed from all the (t-1,j)’s: • The final value of Pr(O|) can be computed from the (m,j)’s: CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  33. Example Observed outcomes: O=THT • To compute: Pr(O=THT|) • The  table: (t,i)=Pr(o1, o2, ..., ot, qt=si|) • (1,i) = iei(o1) HMM parameters  Emission probabilities E: Initial states : 1=Pr(q1=A)=0.5 2=Pr(q1=B)=0.5 Transition probabilities P: Final answer: 0.128125+0.0234375 = 0.1515625 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  34. Time and space analysis • The table contains Nmentries (N is the number of states, m is the length of your observed sequence) • Computing the value of each entry requires summing up N numbers • Each of these numbers is produced by a constant number of product operations • The total amount of time required is proportional to N2m • Since (t,i) depends only on (t-1,j), we can throw away everything two steps before. The space needed is thus proportional to N • Also N|U| for storing the emission probabilities (where U is the set of possible outcomes), N for the initial probabilities, and N2 for the transition probabilities of the HMM CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  35. The backward algorithm • For the same problem, there could be multiple solutions • Here we can find the data likelihood by another algorithm • First we define (t,i)=Pr(ot+1, ot+2, ..., om | qt=si,) • (t,i) can be computed from (t+1,i): CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  36. Derivation of recursive formula CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  37. The backward algorithm • (t,i)=Pr(ot+1, ot+2, ..., om | qt=si,) • Now we treat  as a dynamic programming table to calculate Pr(O|) – calculate the probabilities backward: • Initializations: • Recursion: • Final answer: CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  38. Problem 2: Using a model • Given a fully specified HMM and a sequence of outcomes, determine the mostly likely state sequence • We have already learned how to do it by listing all possible state sequences. Now we want to do it faster (using dynamic programming) • The first observation is that instead of maximizing the conditional probability, we can maximize the joint probability as well: CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  39. The Viterbi algorithm • Again, we want to break down the original problem into simpler sub-problems: • Consider states and observations up to step t • Assume we know the state at step t is si • Then we can solve each sub-problem recursively by solving its sub-problems • We will first find out the maximum probability , then find out a state sequence Q that produces such a probability CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  40. Find the maximum probability • As discussed, we define a dynamic programming table for the sub-problems: • Initializations: • Recursion: • Final answer: CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  41. Derivation of recursive formula CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  42. Trace back • At each step, (t, i) is determined by taking the maximum of (t-1, j) for different j’s • We can imagine drawing “red arrows” from these (t-1, j) cells to the (t, i) cells. Trace back is then simply following the red arrows backwards (i, t) t=1 t=2 ... t=m-2 t=m-1 t=m i=1 i=2 ... i=N CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  43. Problem 3: Learning a model • Given a set of possible states S and a set of outcome sequences, determine the parameters of a HMM that maximizes the data likelihood • It is a difficult problem with a large “search space” (possible combinations of parameter values) • We will study a heuristic method CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  44. The Baum-Welch algorithm • Assume we have a set of observed outcome sequences: • {O(d)} = {O(1), O(2), ...} • Each element is a sequence of outcomes:O(d)= o1(d), o2(d), ..., om(d) • Correspondingly, there is a set of hidden state sequences: • {Q(d)} = {Q(1), Q(2), ...} • Each element is a sequence of states:Q(d)= q1(d), q2(d), ..., qm(d) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  45. The Baum-Welch algorithm • High-level procedures: • Assign some initial values to each parameter variable i, pij and ei(ok) • Update the values based on current estimates: • i to the expected fraction of times that the first state is si • pij to the expected fraction of times that the next state is sj given the current state is si • ei(ok) to the expected fraction of times that the outcome ok is emitted given the current state is si • Repeat #2 until the increase in Pr({O(d)|}) between two iterations is smaller than a threshold • It can be proved that Pr({O(d)|}) never decreases • However, there is no guarantee that the final set of parameter values is one that maximizes Pr({O(d)|}) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  46. The Baum-Welch algorithm • We will not discuss all the details of the algorithm, but just one interesting part • The updates in #2 require this quantity:which is the probability that in the d-th sequence, the t-th state was si and the (t+1)-th state was sj, given the d-th outcome sequence and the current estimates of the parameter values, • We have learned how to compute the denominator efficiently (Problem 1) CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  47. Forward and backward variables • How about the numerator? • To simplify the notation, let’s suppose there is only one outcome sequence, O • The numerator is then (not showing the derivation): • We have learned how to compute the first two terms (the “forward and backward variables”) • The last two terms can be read directly from the current estimates • When the probability of a hidden state is estimated by both outcomes before and after it, the process is called “smoothing” CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  48. Part 4 Applications

  49. Applying HMMs to biological sequences • In a real situation, the first task is to learn a model, by either • Data with outcomes only (e.g., sequences that contain genes) OR • Data with both outcomes and states (e.g., annotated genes with known exons, introns, etc.) • Then the model is used to scan unannotated sequences to identify those that match the model well CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

  50. Biological applications • Genes • Protein domains • Protein families • Protein binding sites on DNA • Sequence profiles • RNA structures • ... CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2014

More Related