Algorithms in Computational Biology

Algorithms in Computational Biology Markov Chains and Hidden Markov Model Department of Mathematics & Computer Science

Example: CpG Islands • Dinucleotide CG (CpG to distinguish it from C-G base pair) • C within CG is typically methylated • Methyl-C is more likely to mutate to T • CpG dinucleotides are rarer in genome than would be expected from the independent probabilities of C and G • Methylation process is suppressed in short stretches of the genome • More CpG dinucleotides in promoter regions of genes • CpG islands are regions with many CpGs • Typically a few hundred to a few thousand bases long Department of Mathematics & Computer Science

Questions about CpG Island? • Given a short stretch of genomic sequence, how would we decide if it comes from a CpG island or not? • Given a long piece of sequence, how would we find the CpG islands in it if there are any? Department of Mathematics & Computer Science

A T C G Markov Chains Department of Mathematics & Computer Science

Key Property of a Markov Chain • The probability of each symbol xi depends only on the value of the preceding symbol xi-1 Department of Mathematics & Computer Science

A T C G Modeling the Beginning and End of Sequences B E Department of Mathematics & Computer Science

Using Markov Chains for Discrimination Non-CpG island model CpG island model Department of Mathematics & Computer Science

Cont’ • For discrimination, the log-odds ratio is calculated: Department of Mathematics & Computer Science

Histogram of Length-Normalized Scores Non-CpG Islands CpG islands Department of Mathematics & Computer Science

Locating CpG Islands in a DNA Sequence Input: A long DNA sequence X = {x1, x2, …, xL)* Output: CpG islands along X. • Use Markov chain models • Calculate log-odds score for a window of length k (e.g., 100) • A total of L-k+1 scores will be computed and plotted • CpG islands will stand out with positive values Department of Mathematics & Computer Science

Problems with Markov Chain Models in Locating CpG Islands • CpG islands have sharp boundaries • CpG islands have variable lengths These problems can be better addressed by building a single model for the entire sequence that incorporates both Markov chains Department of Mathematics & Computer Science

Formal Definition of an HMM • A hidden Markov model is a triplet M = (, Q, ), where •  is an alphabet of symbols • Q is a finite set of states, capable of emitting symbols from the alphabet  •  is a set of probabilities, comprised of • State transition probabilities, denoted by akl for each k, l  Q • Emission probabilities, denoted by ek(b) for each state k  Q and b   Department of Mathematics & Computer Science

Cont’ •  (State sequence or path) •  = (1, 2, …, L) • Follows a simple Markov chain (the probability of a state depends only on the previous state) • State transition probability • akl = P{i = l| i-1 = k} • Emission probability • Given a sequence X = (x1, x2, … xL), emission probability ek(b) is defined as: ek(b) = P{xi=b| i = k} • The probability that the sequence X was generated by M given the path  is: Department of Mathematics & Computer Science

An HMM for Detecting CpG Islands in a Long DNA Sequence • Alphabet:  = {A, C, G, T} • States: Q = {A+, C+, G+, T+, A-, C-, G-, T-} • Emissions State: A+ C+G+ T+ A- C- G-T- Emitted symbol: A C G T ACG T The emission probability of each state x+ and x- is 1 for emitting symbol x and 0 for emitting other symbols (special feature of this HMM) Department of Mathematics & Computer Science

Transition Matrix for CpG Island HMM P is the probability of staying in a CpG island, and q is the probability of staying in a non-CpG Island Department of Mathematics & Computer Science

Occasionally Dishonest Casino Dealer • In casino a dealer uses a fair die most of the time, but occasionally he switch to a loaded die. The loaded die has a probability of 0.5 for a six and probability of 0.1 for the numbers one to five. The dealer switches from a fair to a loaded die with probability of 0.05 before each roll, and that the probability of switching back is 0.1. • In each state of the Markov process the outcomes of a roll have different probabilities, thus the process can modeled using a HMM Department of Mathematics & Computer Science

HMM for the Occasionally Dishonest Casino Dealer • Q = {F, L} •  = {1, 2, 3, 4, 5, 6} What is hidden? Department of Mathematics & Computer Science

HMMs Generate Sequences • Generate a sequence via HMM • Choose 1 according to the probabilities a0i • An observation (x1) is emitted according to the probabilities e1 • Choose 2 according to the probabilities a1i • An observation (x2) is emitted according to the probabilities e2 • And so forth …… • P(x) is the probability that sequence x was generated by the model • The joint probability of an observed sequence x and a state sequence : Department of Mathematics & Computer Science

Most Probable State Path • A CpG island example • Sequence CGCG can be emitted by: • (C+, G+, C+, G+),(C-, G-, C-, G-),(C+, G-, C+, G-) • Which state sequence is more likely for the observation? • Most probable path is defined as: • The probability vk(i) of the most probable path ending in state k with observation i is known for all the states k, then vl(i+1) is defined: Department of Mathematics & Computer Science

Finding Most Probable Path Using Viterbi Algorithm Initialization (i = 0): v0(0) = 1, vk(0) = 0 for k > 0 Recursion: (i = 1…L): vl(i) = el(xi)maxk(vk(i-1)akl) ptri(l) = argmaxk(vk(i-1)akl) Termination: P(x, *) = maxk(vk(L)ak0) L*= argmaxk(vk(L)ak0) Traceback (i=L…1): i-1*= ptri(i*) Department of Mathematics & Computer Science

Most probable path for sequence CGCG Viterbi Example Department of Mathematics & Computer Science

Sequence of Die Rolls Predicted by Viterbi Algorithm Department of Mathematics & Computer Science

Finding the Probability of a Sequence for an HMM: the Forward Algorithm Definitions: Algorithm: Initialization ( i = 0): Recursion (i = 1…L): Termination: Department of Mathematics & Computer Science

Posterior State Probability • We want to know the most probable state for an observation xi • We need to find out the probability that observation xi came from each state k given the observed sequence Department of Mathematics & Computer Science

Finding bk(i) Using Backward Algorithm Initialization (i = L): Recursion (i = L-1, …, 1): Termination: Department of Mathematics & Computer Science

Posterior Decoding • Approach 1 • Approach 2 • E.g. Find the posterior probability according to the model that base i is in a CpG island, we can let g(k) = 1 for k  {A+, C+, G+, T+} g(k) = 0 for k  {A-, C-, G-, T-} G(i|k) is precisely the posterior probability Department of Mathematics & Computer Science

Use of Posterior Decoding Shaded areas show when the roll was generated by the loaded die Department of Mathematics & Computer Science

Parameter Estimation for HMMs • Model specification • Structure design • What states there are and how they are connected • Assignment of parameter values • Transition probabilities akl • Emission probabilities ek(b) • Estimation framework • Training sequences x1, …, xn • Work in log space Department of Mathematics & Computer Science

Estimation When the State Sequence is Known Department of Mathematics & Computer Science

Estimation When Paths Are Unknown • Baum (1971) • Calculate Akl and Ek(b) as the expected times each transition or emission is used given the training sequences • Subject to local maxima • Depends only the starting values of the parameters • The probability that akl is used at position i in sequence x is: Department of Mathematics & Computer Science

Expected Transition and Emission Counts • The expected number of times that akl can be obtained by summing over all positions and over all training sequences • The expected number of times that letter b appears in state k Department of Mathematics & Computer Science

Baum-Welch Training (EM algorithm) Initialization: Pick arbitrary model parameters Recurrence: Set all the A and E variables to their pseudocount values r (or to zero) For each sequence j = 1 … n Calculate fk(i) for sequence j using forward algorithm Calculate bk(i) for sequence j using backward algorithm Add the contribution of sequence j to A and E Calculate the new model parameters Calculate the new log likelihood of the model Termination: Stop if the change in log likelihood is less than some predefined threshold or the maximum number of iterations is exceeded Department of Mathematics & Computer Science

Modeling of Labeled Sequences • HMMs can be used to predict the labeling of unannotated sequences • Training for HMMs • Separately train the model for CpG islands and the model for non-CpG islands • Combine them into a larger HMM • Tedious especially if there are more two classes involved • It will be nice to estimate everything at once • Training set includes all classes (e.g., CpG islands and non-CpG islands) • Each sequence is labeled with corresponding classes • Let y = y1, …, yL be the labels on the observation x = x1, …, xL Department of Mathematics & Computer Science

Cont’ • Model can be estimated with a slight modification of Baum-Welch algorithm • Allow only valid paths through the model • A valid path is one where the state labels and sequence labels are the same, i.e., i has label yi • During the forward and backward algorithms this corresponds to setting fl(i) = 0 and bl(i) = 0 for all the states l with a label different from yi Department of Mathematics & Computer Science

Discriminative Estimation • When modeling labeled sequences, the following likelihood is maximized • Obtaining a good prediction of y is our primary interest, it is preferable to maximize the following conditional maximum likelihood Probability calculated by the forward algorithm for the labeled sequences Probability calculated by the forward algorithm disregarding all the labels Department of Mathematics & Computer Science

p p p p 1-p 1-p 1-p HMM Model Structure • Choice of model topology • Fully connected model causes local maxima • In practice, successful HMMs are constructed by carefully deciding which transitions are allowed in the model based on knowledge about the problem under investigation • Duration modeling • Probability decays exponentially on lengths (geometric distribution) • P(L)=(1-p)p^(L-1) (p: self-transition 1-p: probability of leaving it) • Model more complex length distribution • Introduce several states with the same distribution over residues and transitions between each other. • E.g. Non-negative binomial distribution Department of Mathematics & Computer Science

Numerical Stability of HMM Algorithms • Probability gets too low when multiplying many probabilities in the Viterbi, forward and backward algorithms • Consequences • Underflow error • Program would crash • Program would keep running and produce arbitrary wrong numbers Department of Mathematics & Computer Science

Improving Numerical Stability • Log transform • Scaling of probabilities • For each i define a scaling variable si Department of Mathematics & Computer Science

Algorithms in Computational Biology