CSCE555 Bioinformatics

CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555 HAPPY CHINESE NEW YEAR University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.

Roadmap • Probablistic Models of Sequences • Introduction to HMM • Profile HMMs as MSA models • Measuring Similarity between Sequence and HMM Profile model • Summary

Multiple Sequence Alignment • Alignment containing multiple DNA / protein sequences • Look for conserved regions → similar function • Example: #Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC #Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC #OppossumATGGTGCACTTGACTTTT---GAGGAGAAGAACTG #Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT #Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT

Probablistic Model: Position-specific scoring matrices (PSSM) Limitations of PSSM?

Difficulty in biological sequences • Variation in a family of sequences • Gaps of variable lengths • Conserved segments with different degrees • PSSM cannot handle variable-length gaps • Need a statistical sequence model

Regular Expressions Model • Regular expressions • Protein spelling is much more free that English spelling • [AT] [CG] [AC] [ACGT]* A [TG] [GC] Limitation of Regular expression model?

Hidden Markov Model (HMM) • HMM is: • Statistical model • Well suited for many tasks in molecular biology • Using HMM in molecular biology • Probabilistic profile (profile HMM) • From a family of proteins, for searching a database for other members of the family • Resemble the profile and weight matrix methods • Grammatical structure • Gene finding • Recognize signals • Prediction (must follow the rules of a gene)

Detect Cheating in Coin Toss Game • Fair and biased coins could be used • Question: is it possible to determine whether a biased coin has been used based on the output sequence of the Head/Tail sequence? • HTTTHTHTHTTHHHHTHTHTHTHHHHTHT

T H EXAMPLE : Fair Coin Toss • Consider the single coin scenario • We could model the process producing the sequence of H’s and T’s as a Markov model with two states, and equal transition probabilities: 0.5 0.5 0.5 0.5 Only one fair coin is used here

Consider the scenario where there are two coins: Fair coin and Biased coin Visible state do not correspond to hidden state - Visible state : Output of H or T - Hidden state : Which coin was tossed Example: Fair and Biased Coins HTTTHTHTHTTHHHHTHTHTHTHHHHTHT

Hidden Markov Models

Ingredients of a HMM • Collection of states: {S1, S2,…,SN} • State transition probabilities (transition matrix) Aij = P(qt+1 = Si | qt = Sj) • Initial state distribution i = P(q1 = Si) • Observations: {O1, O2,…,OM} • Observation probabilities: Bj(k) = P(vt = Ok | qt = Sj)

Ingredients of Our HMM • States: {Ssunny, Srainy, Ssnowy} • State transition probabilities (transition matrix) A = • Initial state distribution i = (.7 .25 .05) • Observations: {O1, O2,…,OM} • Observation probabilities (emission matrix): B =

Probability of a Sequence of Events P(O) = P(Ogloves, Ogloves, Oumbrella,…, Oumbrella) =  P(O | Q)P(Q) =  P(O | q1,…,q7) = 0.7x0.86x0.32x0.14x0.6 + … q1,…q7 all Q

Typical HMM Problems Annotation Given a model M and an observed string S, what is the most probable path through M generating S Classification Given a model M and an observed string S, what is the total probability of S under M Consensus Given a model M, what is the string having the highest probability under M Training Given a set of strings and a model structure, find transition and emission probabilities assigning high probabilities to the strings

HMM Profiles as Sequence Models • Given the multiple alignment of sequences, we can use HMM to model the sequences • Each column of the alignment may be represented by a hidden state that produced that column • Insertions and deletions may be represented by other states

Profile HMMs • HMM with a structure that in a natural way allows position-dependent gap penalties • Main states • model the columns of the alignment • Insert states • model highly variable regions • Delete states • to jump over one or more columns • i.e. to model the situation when just a few of the sequences have a “-” in the multiple alignment at a position

HMM Sequences Continued

Profile HMM Example • Consider the following six sequences shown below • A multiple sequence alignment of these sequences is the first step towards the processing of inducing the hidden markov model SEQ1 G C C C A SEQ2 A G C SEQ3 A A G C SEQ4 A G A A SEQ5 A A A C SEQ6 A G C

Profile HMM Topology • The topology of HMM is established using consensus sequence • The structure of a Profile HMM is shown below:- • The square box represent match states • Diamonds represent insert states • Circles represent delete states

Profile HMM Example Continued • The aligned columns correspond to either emissions from the match state or to emissions from the insert state • The consensus columns are used to define the match states M1,M2,M3 for the HMM • After defining the match states, the corresponding insert and delete states are used to define the complete HMM topology

Transition Probabilities • The values of the transition probabilities are computed using the frequency of the transitions as each sequence is considered • The model parameters are computed using the state transition sequences shown in the figure below:-

Transition Probabilities Continued • The frequency of each of the transitions and the corresponding emission probabilities are shown below

Emission Probabilities • The emission probability is computed using the formula:- • The emission probability specifies the probability of emitting each of the symbols in |∑ | in the state k

Emission Probabilities Continued • The emission probability for each state is computed as shown below:

Searching the Profile HMM • Sequences can be searched against the HMM to detect whether or not they belong to a particular family of sequences described by the profile HMM • Using a global alignment, the probability of the most probable alignment and sequence can be determined using the Viterbi algorithm • Full probability of a sequence aligning to the profile HMM determined using the forward algorithm

How A Sequence Fit a Model? • Probability depends on the length of the sequence • Not suitable to use as a score

Length-independent Score • Log-odds score • The logarithm of the probability of the sequence divided by the probability according to a null model

Length-independent Score • HMM using log-odds

Summary • HMM • How to build Profile HMM model • Scoring Fit between Sequence and HMM model

Next Lecture • Gene-finding • Reading: • Textbook (CG) chapter 4 • Textbook (EB) chapter 8

CSCE555 Bioinformatics