450 likes | 808 Views
Markov Chains and Hidden Markov Models. Marjolijn Elsinga & Elze de Groot. Andrei A. Markov. Born: 14 June 1856 in Ryazan, Russia Died: 20 July 1922 in Petrograd, Russia Graduate of Saint Petersburg University (1878)
E N D
Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot Marjolijn Elsinga & Elze de Groot
Andrei A. Markov • Born: 14 June 1856 in Ryazan, RussiaDied: 20 July 1922 in Petrograd, Russia • Graduate of Saint Petersburg University (1878) • Work: number theory and analysis, continued fractions, limits of integrals, approximation theory and the convergence of series Marjolijn Elsinga & Elze de Groot
Todays topics • Markov chains • Hidden Markov models - Viterbi Algorithm - Forward Algorithm - Backward Algorithm - Posterior Probabilities Marjolijn Elsinga & Elze de Groot
Markov Chains (1) • Emitting states Marjolijn Elsinga & Elze de Groot
Markov Chains (2) • Transition probabilities • Probability of the sequence Marjolijn Elsinga & Elze de Groot
Key property of Markov Chains • The probability of a symbol xidepends only on the value of the preceding symbol xi-1 Marjolijn Elsinga & Elze de Groot
Begin and End states • Silent states Marjolijn Elsinga & Elze de Groot
Example: CpG Islands • CpG = Cytosine – phosphodiester bond – Guanine • 100 – 1000 bases long • Cytosine is modified by methylation • Methylation is suppressed in short stretches of the genome (start regions of genes) • High chance of mutation into a thymine (T) Marjolijn Elsinga & Elze de Groot
Two questions • How would we decide if a short strech of genomic sequence comes from a CpG island or not? • How would we find, given a long piece of sequence, the CpG islands in it, if there are any? Marjolijn Elsinga & Elze de Groot
Discrimination • 48 putative CpG islands are extracted • Derive 2 models - regions labelled as CpG island (‘+’ model) - regions from the remainder (‘-’ model) • Transition probabilities are set - Where Cst+ is number of times letter t follows letter s Marjolijn Elsinga & Elze de Groot
Maximum Likelihood Estimators • Each row sums to 1 • Tables are asymmetric Marjolijn Elsinga & Elze de Groot
Log-odds ratio Marjolijn Elsinga & Elze de Groot
Discrimination shown Marjolijn Elsinga & Elze de Groot
Simulation: ‘+’ model Marjolijn Elsinga & Elze de Groot
Simulation: ‘-’ model Marjolijn Elsinga & Elze de Groot
Todays topics • Markov chains • Hidden Markov models - Viterbi Algorithm - Forward Algorithm - Backward Algorithm - Posterior Probabilities Marjolijn Elsinga & Elze de Groot
Hidden Markov Models (HMM) (1) • No one-to-one correspondence between states and symbols • No longer possible to say what state the model is in when in xi • Transition probability from state k to l: • πi is the ith state in the path (state sequence) Marjolijn Elsinga & Elze de Groot
Hidden Markov Models (HMM) (2) • Begin state: a0k • End state: a0k • In CpG islands example: Marjolijn Elsinga & Elze de Groot
Hidden Markov Models (HMM) (3) • We need new set of parameters because we decoupled symbols from states • Probability that symbol b is seen when in state k: Marjolijn Elsinga & Elze de Groot
Example: dishonest casino (1) • Fair die and loaded die • Loaded die: probability 0.5 of a 6 and probability 0.1 for 1-5 • Switch from fair to loaded: probability 0.05 • Switch back: probability 0.1 Marjolijn Elsinga & Elze de Groot
Dishonest casino (2) • Emission probabilities: HMM model that generate or emit sequences Marjolijn Elsinga & Elze de Groot
Dishonest casino (3) • Hidden: you don’t know if die is fair or loaded • Joint probability of observed sequence x and state sequence π: Marjolijn Elsinga & Elze de Groot
Three algorithms • What is the most probable path for generating a given sequence? Viterbi Algorithm • How likely is a given sequence? Forward Algorithm • How can we learn the HMM parameters given a set of sequences? Forward-Backward (Baum-Welch) Algorithm Marjolijn Elsinga & Elze de Groot
Viterbi Algorithm • CGCG can be generated on different ways, and with different probabilities • Choose path with highest probability • Most probable path can be found recursively Marjolijn Elsinga & Elze de Groot
Viterbi Algorithm (2) • vk(i) = probability ofmost probable path ending in state k with observation i Marjolijn Elsinga & Elze de Groot
Viterbi Algorithm (3) Marjolijn Elsinga & Elze de Groot
Viterbi Algorithm • Most probable path for CGCG Marjolijn Elsinga & Elze de Groot
Viterbi Algorithm • Result with casino example Marjolijn Elsinga & Elze de Groot
Three algorithms • What is the most probable path for generating a given sequence? Viterbi Algorithm • How likely is a given sequence? Forward Algorithm • How can we learn the HMM parameters given a set of sequences? Forward-Backward (Baum-Welch) Algorithm Marjolijn Elsinga & Elze de Groot
Forward Algorithm (1) • Probability over all possible paths • Number of possible paths increases exponentonial with length of sequence • Forward algorithm enables us to compute this efficiently Marjolijn Elsinga & Elze de Groot
Forward Algorithm (2) • Replacing maximisation steps for sums in viterbi algorithm • Probability of observed sequence up to and including xi, requiring πi = k Marjolijn Elsinga & Elze de Groot
Forward Algorithm (3) Marjolijn Elsinga & Elze de Groot
Three algorithms • What is the most probable path for generating a given sequence? Viterbi Algorithm • How likely is a given sequence? Forward Algorithm • How can we learn the HMM parameters given a set of sequences? Forward-Backward (Baum-Welch) Algorithm Marjolijn Elsinga & Elze de Groot
Backward Algorithm (1) • Probability of observed sequence from xi to the end of the sequence, requiring πi = k Marjolijn Elsinga & Elze de Groot
Disadvantage Algorithms • Multiplying many probabilities gives very small numbers which can lead to underflow errors on the computer can be solved by doing the algorithms in log space, calculating log(vl(i)) Marjolijn Elsinga & Elze de Groot
Backward Algorithm Marjolijn Elsinga & Elze de Groot
Posterior State Probability (1) • Probability that observation xi came from state k, given the observed sequence • Posterior probability of state k at time i when the emitted sequence is known: P(πi = k | x) Marjolijn Elsinga & Elze de Groot
Posterior State Probability (2) • First calculate probability of producing entire observed sequence with the ith symbol being produced by state k • P(x, πi = k) = fk (i) ·bk (i) Marjolijn Elsinga & Elze de Groot
Posterior State Probability (3) • Posterior probabilities will then be: • P(x) is result of forward or backward calculation Marjolijn Elsinga & Elze de Groot
Posterior Probabilities (4) • For the casino example Marjolijn Elsinga & Elze de Groot
Two questions • How would we decide if a short strech of genomic sequence comes from a CpG island or not? • How would we find, given a long piece of sequence, the CpG islands in it, if there are any? Marjolijn Elsinga & Elze de Groot
Prediction of CpG islands • First way: Viterbi Algorithm - Find most probable path through the model - When this path goes through the ‘+’ state, a CpG island is predicted Marjolijn Elsinga & Elze de Groot
Prediction of CpG islands • Second Way: Posterior Decoding - function: - g(k) = 1 for k Є {A+, C+, G+, T+} - g(k) = 0 for k Є {A-, C-, G-, T-} - G(i|x) is posterior probability according to the model that base i is in a CpG island Marjolijn Elsinga & Elze de Groot
Summary (1) • Markov chain is a collection of states where a state depends only on the state before • Hidden markov model is a model in which the states sequence is ‘hidden’ Marjolijn Elsinga & Elze de Groot
Summary (2) • Most probable path: viterbi algorithm • How likely is a given sequence?: forward algorithm • Posterior state probability: forward and backward algorithms (used for most probable state of an observation) Marjolijn Elsinga & Elze de Groot