280 likes | 366 Views
Lecture 5 Hidden Markov Model. Bioinformatics. Dr. Aladdin Hamwieh Khalid Al- shamaa Abdulqader Jighly. Aleppo University Faculty of technical engineering Department of Biotechnology. 2010-2011. Gene prediction: Methods. Gene Prediction can be based upon: Coding statistics
E N D
Lecture 5 Hidden Markov Model Bioinformatics Dr. Aladdin Hamwieh Khalid Al-shamaa Abdulqader Jighly Aleppo University Faculty of technical engineering Department of Biotechnology 2010-2011
Gene prediction: Methods • Gene Prediction can be based upon: • Coding statistics • Gene structure • Comparison Statistical approach Similarity-based approach
Gene prediction: Methods • Gene Prediction can be based upon: • Coding statistics • Gene structure • Comparison Statistical approach Similarity-based approach
Gene prediction: Coding statistics • Coding regions of the sequence have different properties than non-coding regions: non random properties of coding regions. • CG content • Codon bias (CODON USAGE).
Markov Model • A Markov model is a process, which moves from state to state depending (only) on the previous n states. • For example, calculating the probability of getting this weather sequence states in one week from march: Sunny, Sunny, Cloudy, Rainy, Rainy, Sunny, Cloudy. • If today is Cloudy, it would be more appropriate to be Rainy tomorrow • On march it’s more appropriate to start with a Sunny day more than other situations • And so on.
Cloudy Sunny Rainy Markov Model 0.25 0.25 0.625 0.5 0.25 0.375 0.375 0.25 0.125 Weather tomorrow Sunny cloudy Rainy Sunny Cloudy Rainy Sunny Cloudy Rainy Weather today
Cloudy Sunny Rainy Example: Σ = P (Sunny , Sunny, Cloudy, Rainy | Model) = Π(sunny)* P (Sunny | Sunny) * P (Cloudy | Sunny) *P (Rainy | Cloudy) = 0.6 * 0.5 * 0.25 * 0.375 = 0.0281 0.25 0.25 0.625 0.5 0.25 0.375 0.375 0.25 0.125 Weather tomorrow Sunny cloudy Rainy Sunny Cloudy Rainy Sunny Cloudy Rainy Weather today
Hidden Markov Models • States are not observable • Observations are probabilistic functions of state • State transitions are still probabilistic
CG Islands and the “Fair Bet Casino” • The CG islands problem can be modeled after a problem named “The Fair Bet Casino” • The game is to flip coins, which results in only two possible outcomes: Head or Tail. • The Fair coin will give Heads and Tails with same probability ½. • The Biased coin will give Heads with prob. ¾.
The “Fair Bet Casino” (cont’d) • Thus, we define the probabilities: • P(H|F) = P(T|F) = ½ • P(H|B) = ¾, P(T|B) = ¼ • The crooked dealer chages between Fair and Biased coins with probability 10%
HMM for Fair Bet Casino (cont’d) HMM model for the Fair Bet Casino Problem
HMM Parameters Σ: set of emission characters. Ex.: Σ = {H, T} for coin tossing Σ = {1, 2, 3, 4, 5, 6} for dice tossing Σ = {A, C, G, T} for DNA sequence Q: set of hidden states, each emitting symbols from Σ. Q={F,B} for coin tossing Q={Non-coding, Coding, Regulatory} for sequences
HMM Parameters (cont’d) A = (akl): a |Q| x |Q| matrix of probability of changing from state k to state l. aFF = 0.9 aFB = 0.1 aBF = 0.1 aBB = 0.9 E = (ek(b)): a |Q| x |Σ| matrix of probability of emitting symbol b while being in state k. eF(0) = ½ eF(1) = ½ eB(0) = ¼ eB(1) = ¾
HMM Yellow Red Green Blue Yellow Red Green Blue Yellow Red Green Blue Q1 Q2 Q3 i+1 turn Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3 ithturn
The three Basic problems of HMMs Problem 1: Given observation sequence Σ=O1O2…OTand model M=(Π, A, E). Compute P(Σ | M). Problem 2: Given observation sequence Σ=O1O2…OT and model M=(Π, A, E) how do we choose a corresponding state sequence Q=q1q2…qT,which best “explains” the observation. Problem 3: How do we adjust the model parameters Π, A, E to maximize P(Σ |{Π, A, E})?
The three Basic problems of HMMs Problem 1: Given observation sequence Σ=O1O2…OTand model M=(Π, A, E) compute P(Σ | M). for example: P ( | M)
Problem 1: Probability of an Observation Sequence • What is ? • The probability of a observation sequence is the sum of the probabilities of all possible state sequences in the HMM. • Naive computation is very expensive. Given T observations and N states, there are NT possible state sequences. • Even small HMMs, e.g. T=10 and N=10, contain 10 billion different paths • Solution to this and problem 2 is to use dynamic programming
Q2 Q2 Q2 Q3 Q3 Q3 Q3 Q2 Q1 Q1 Q1 Q1 Problem 1: Given observation sequence Σ=O1O2…OT and model M=(Π, A, E) compute P(Σ | M). Solution:Forward algorithm Yellow Red Green Blue Example: P( | M). Yellow Red Green Blue Yellow Red Green Blue 0.15 * 0.1 * 0.25 = 0.00375 0.03 * 0.4 * 0.1 = 0.0012 0.065 * 0.2 * 0.65 = 0.00845 Sum= 0.0134 0.15 0.0134 Q1 Q2 Q3 * 0.1 * 0.25 0.25 * * 0.4 * 0.1 0.6 0.03 0.1 * Q1 Q2 Q3 0.3 * 0.2 * 0.65 Q1 Q2 Q3 0.1 0.065 * 0.65
Q? Q? Q? Q? The three Basic problems of HMMs Problem 2: Given observation sequence Σ=O1O2…OTand model M=(Π, A, E) how do we choose a corresponding state sequence Q=q1q2…qT,which best “explains” the observation. For example: What are most probable Q1Q2Q3Q4 given the observation
Problem 2: Decoding • The solution to Problem 1 gives us the sum of all paths through an HMM efficiently. • For Problem 2, we want to find the path with the highest probability.
Q2 Q2 Q2 Q3 Q3 Q3 Q3 Q2 Q1 Q1 Q1 Q1 Yellow Red Green Blue Example: P( | M). 0.15 * 0.1 * 0.25 = 0.00375 0.03 * 0.4 * 0.1 = 0.0012 0.065 * 0.2 * 0.65 = 0.00845 THE LARGEST Yellow Red Green Blue Yellow Red Green Blue 0.15 0.00845 * 0.1 * 0.25 0.25 Q1 Q2 Q3 * * 0.4 * 0.1 0.6 0.03 0.1 * 0.3 * 0.2 * 0.65 Q1 Q2 Q3 0.1 0.065 * 0.65 Q1 Q2 Q3
How is it connected to Gene prediction? Yellow Red Green Blue Yellow Red Green Blue Yellow Red Green Blue Q1 Q2 Q3 i+1 turn Q1 Q2 Q3 Q1 Q2 Q3 Q1 Q2 Q3 ith turn
How is it connected to Gene prediction? T T G A G T G G A A T C T A G C C C C A G A G C T T A A G C T A G C T A G C T Exon Intron UTR
B S D A T F Hidden states 3‘: 3‘ UTR EI: Initial Exon SE: Single Exon I: Intron E: Exon FE: Final Exon 5‘: 5‘ UTR Hidden Markov Models (HMM) for gene prediction • Basic probabilistic model of gene structure. E 5‘ IE FE 3‘ I SE Signals B: Begin sequence S: Start translation A: acceptor site (AG) D: Donor site (GT) T: Stop translation F: End sequence
Eukaryotic Genes Features Hand Over TAG GT AG TG TAG ATG ATG ATG GT AG AG TAG