380 likes | 617 Views
CSCE555 Bioinformatics. Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555. HAPPY CHINESE NEW YEAR. University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.
E N D
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555 HAPPY CHINESE NEW YEAR University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.
Roadmap • Probablistic Models of Sequences • Introduction to HMM • Profile HMMs as MSA models • Measuring Similarity between Sequence and HMM Profile model • Summary
Multiple Sequence Alignment • Alignment containing multiple DNA / protein sequences • Look for conserved regions → similar function • Example: #Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC #Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC #OppossumATGGTGCACTTGACTTTT---GAGGAGAAGAACTG #Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT #Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT
Probablistic Model: Position-specific scoring matrices (PSSM) Limitations of PSSM?
Difficulty in biological sequences • Variation in a family of sequences • Gaps of variable lengths • Conserved segments with different degrees • PSSM cannot handle variable-length gaps • Need a statistical sequence model
Regular Expressions Model • Regular expressions • Protein spelling is much more free that English spelling • [AT] [CG] [AC] [ACGT]* A [TG] [GC] Limitation of Regular expression model?
Roadmap • Probablistic Models of Sequences • Introduction to HMM • Profile HMMs as MSA models • Measuring Similarity between Sequence and HMM Profile model • Summary
Hidden Markov Model (HMM) • HMM is: • Statistical model • Well suited for many tasks in molecular biology • Using HMM in molecular biology • Probabilistic profile (profile HMM) • From a family of proteins, for searching a database for other members of the family • Resemble the profile and weight matrix methods • Grammatical structure • Gene finding • Recognize signals • Prediction (must follow the rules of a gene)
Detect Cheating in Coin Toss Game • Fair and biased coins could be used • Question: is it possible to determine whether a biased coin has been used based on the output sequence of the Head/Tail sequence? • HTTTHTHTHTTHHHHTHTHTHTHHHHTHT
T H EXAMPLE : Fair Coin Toss • Consider the single coin scenario • We could model the process producing the sequence of H’s and T’s as a Markov model with two states, and equal transition probabilities: 0.5 0.5 0.5 0.5 Only one fair coin is used here
Consider the scenario where there are two coins: Fair coin and Biased coin Visible state do not correspond to hidden state - Visible state : Output of H or T - Hidden state : Which coin was tossed Example: Fair and Biased Coins HTTTHTHTHTTHHHHTHTHTHTHHHHTHT
Ingredients of a HMM • Collection of states: {S1, S2,…,SN} • State transition probabilities (transition matrix) Aij = P(qt+1 = Si | qt = Sj) • Initial state distribution i = P(q1 = Si) • Observations: {O1, O2,…,OM} • Observation probabilities: Bj(k) = P(vt = Ok | qt = Sj)
Ingredients of Our HMM • States: {Ssunny, Srainy, Ssnowy} • State transition probabilities (transition matrix) A = • Initial state distribution i = (.7 .25 .05) • Observations: {O1, O2,…,OM} • Observation probabilities (emission matrix): B =
Probability of a Sequence of Events P(O) = P(Ogloves, Ogloves, Oumbrella,…, Oumbrella) = P(O | Q)P(Q) = P(O | q1,…,q7) = 0.7x0.86x0.32x0.14x0.6 + … q1,…q7 all Q
Typical HMM Problems Annotation Given a model M and an observed string S, what is the most probable path through M generating S Classification Given a model M and an observed string S, what is the total probability of S under M Consensus Given a model M, what is the string having the highest probability under M Training Given a set of strings and a model structure, find transition and emission probabilities assigning high probabilities to the strings
Roadmap • Probablistic Models of Sequences • Introduction to HMM • Profile HMMs as MSA models • Measuring Similarity between Sequence and HMM Profile model • Summary
HMM Profiles as Sequence Models • Given the multiple alignment of sequences, we can use HMM to model the sequences • Each column of the alignment may be represented by a hidden state that produced that column • Insertions and deletions may be represented by other states
Profile HMMs • HMM with a structure that in a natural way allows position-dependent gap penalties • Main states • model the columns of the alignment • Insert states • model highly variable regions • Delete states • to jump over one or more columns • i.e. to model the situation when just a few of the sequences have a “-” in the multiple alignment at a position
Profile HMM Example • Consider the following six sequences shown below • A multiple sequence alignment of these sequences is the first step towards the processing of inducing the hidden markov model SEQ1 G C C C A SEQ2 A G C SEQ3 A A G C SEQ4 A G A A SEQ5 A A A C SEQ6 A G C
Profile HMM Topology • The topology of HMM is established using consensus sequence • The structure of a Profile HMM is shown below:- • The square box represent match states • Diamonds represent insert states • Circles represent delete states
Profile HMM Example Continued • The aligned columns correspond to either emissions from the match state or to emissions from the insert state • The consensus columns are used to define the match states M1,M2,M3 for the HMM • After defining the match states, the corresponding insert and delete states are used to define the complete HMM topology
Transition Probabilities • The values of the transition probabilities are computed using the frequency of the transitions as each sequence is considered • The model parameters are computed using the state transition sequences shown in the figure below:-
Transition Probabilities Continued • The frequency of each of the transitions and the corresponding emission probabilities are shown below
Emission Probabilities • The emission probability is computed using the formula:- • The emission probability specifies the probability of emitting each of the symbols in |∑ | in the state k
Emission Probabilities Continued • The emission probability for each state is computed as shown below:
Searching the Profile HMM • Sequences can be searched against the HMM to detect whether or not they belong to a particular family of sequences described by the profile HMM • Using a global alignment, the probability of the most probable alignment and sequence can be determined using the Viterbi algorithm • Full probability of a sequence aligning to the profile HMM determined using the forward algorithm
How A Sequence Fit a Model? • Probability depends on the length of the sequence • Not suitable to use as a score
Length-independent Score • Log-odds score • The logarithm of the probability of the sequence divided by the probability according to a null model
Length-independent Score • HMM using log-odds
Summary • HMM • How to build Profile HMM model • Scoring Fit between Sequence and HMM model
Next Lecture • Gene-finding • Reading: • Textbook (CG) chapter 4 • Textbook (EB) chapter 8