Hidden Markov Models What are the good for?

Hidden Markov ModelsWhat are the good for? Morten Nielsen CBS

Absolutely nothing!

Objectives • Introduce Hidden Markov models and understand that they are just weight matrices with gaps • See the beauty of sequence profiles • Position specific scoring matrices (PSSMs) • Understand what biological problems are best described using HMM’s • And which are not!

What is an HMM What are they good for? How to construct an HMM How to “score” a sequence to an HMM Viterbi decoding HMM’s that made a difference Profile HMMs TMHMM Links to HMM packages Outline

Markov Models • A model with no memory • What I decide depends only on “state” now, not on what I have learned in the past • No dependence on i-1, i-2 …

A Markov model? • No memory • Model generates numbers • 312453666641 The unfair casino: Loaded dice p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1 0.9 0.95 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 0.05 0.10 Loaded Fair

Why hidden? • Model generates numbers • 312453666641 • Does not tell which dice was used • Alignment (decoding) can give the most probable solution/path (Viterby) • FFFFFFLLLLLL • Or most probable set of states • FFFFFFLLLLLL The unfair casino: Loaded dice p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1 0.9 0.95 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 0.05 0.10 Loaded Fair

ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC Example from A. Krogh Core region defines the number of states in the HMM (red) Insertion and deletion statistics are derived from the non-core part of the alignment (black) HMM (a simple example) Core of alignment

HMM construction • 5 matches. A, 2xC, T, G • 5 transitions in gap region • C out, G out • A-C, C-T, T out • Out transition 3/5 • Stay transition 2/5 ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC .4 .2 A C G T .4 .2 .2 .6 .6 .8 A C G T A C G T A C G T .8 A C G T 1 A C G T A C G T 1. 1. 1. 1. .4 .8 .2 .8 .2 .2 .2 .8 .2 ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10-2

Align sequence to HMM ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2=3.3x10-2 TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8=0.0075x10-2 ACAC--AGC =1.2x10-2 Consensus: ACAC--ATC =4.7x10-2, ACA---ATC =13.1x10-2 Exceptional: TGCT--AGG =0.0023x10-2

Score depends strongly on length Null model is a random model. For length L the score is 0.25L Log-odds score for sequence S Log( P(S)/0.25L) Positive score means more likely than Null model ACA---ATG = 4.9 TCAACTATC = 3.0 ACAC--AGC = 5.3 AGA---ATC = 4.9 ACCG--ATC = 4.6 Consensus: ACAC--ATC = 6.7 ACA---ATC = 6.3 Exceptional: TGCT--AGG = -0.97 Align sequence to HMM - Null model Note!

Example: 1245666. What was the series of dice used to generate this output? Log model -0.05 -0.02 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0-78 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 -1.3 -1 Fair Loaded Model decoding (Viterby)

Dynamic programming: computation of scores T C G C A T C C A Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. x Each new score is found by choosing the maximum of three possibilities. For each square in matrix: keep track of where best score came from. Fill in scores one row at a time, starting in upper left corner of matrix, ending in lower right corner. score(x,y-1) - gap-penalty score(x-1,y-1) + substitution-score(x,y) score(x-1,y) - gap-penalty score(x,y) = max

Example: 1245666. What was the series of dice used to generate this output? Log model -0.05 -0.02 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0-78 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 -1.3 -1 Fair Loaded Model decoding (Viterby)

Log model -0.05 -0.02 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0-78 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 -1.3 -1 Fair Loaded Model decoding (Viterby)

Log model -0.05 -0.02 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0-78 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 -1.3 -1 Fair Loaded Model decoding (Viterby) Identify what series of dice was used to generate this output?

Log model -0.05 -0.02 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0-78 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 -1.3 -1 Fair Loaded Model decoding (Viterby) Series of dice is FFFFLLL

HMM’s and weight matrices • In the case of un-gapped alignments HMM’s become simple weight matrices

HMM construction .4 X .2 A C G T .4 .2 .2 .6 .6 .8 A C G T A C G T A C G T .8 A C G T 1 A C G T A C G T 1. 1. 1. 1. .4 .8 .2 .8 .2 .2 .2 .8 .2

HMM construction .8 A C G T A C G T A C G T .8 A C G T 1 A C G T A C G T 1. 1. 1. 1. 1. .8 .2 .8 .2 .2 .2 .8 .2 ACA---ATG sco = 0.8x1x0.8x1x0.8x1x1x1x0.8x1x0.2 = 3.3x10-2 or Log-sco = log(0.8)+log(0.8)+log(0.8)+log(1)+log(0.8)+log(0.2)

HMM’s and weight matrices • In the case of un-gapped alignments HMM’s become simple weight matrices • To achieve high performance, the emission frequencies are estimated using the techniques of • Sequence weighting • Pseudo counts

HMMs. What are they good for? • Weight matrices do not deal with insertions and deletions • In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension • HMM is a natural frame work where insertions/deletions are dealt with explicitly

Profile HMM’s • Alignments based on conventional scoring matrices (BLOSUM62) scores all positions in a sequence in an equal manner • Some positions are highly conserved, some are highly variable (more than what is described in the BLOSUM matrix) • Profile HMM’s are ideal suited to describe such position specific variations

What goes wrong when Blast fails? • Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences

Alignment scoring matrices • Blosum62 score matrix. Fg=1. Ng=0?

Alignment scoring matrices • Blosum62 score matrix. Fg=1. Ng=0? • Score =2+6+6+4-1=17 LAGDS I-GDS

What goes wrong when Blast fails? • Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences • This scoring matrix is identical at all positions in the protein sequence! EVVFIGDSLVQLMHQC X X X X X X AGDS.GGGDS

When Blast works! 1PLC._ 1PLB._

When Blast fails! 1PLC._ 1PMY._

Sequence profiles • In reality not all positions in a protein are equally likely to mutate • Some amino acids (active cites) are highly conserved, and the score for mismatch must be very high • Other amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM score • Sequence profiles can capture these differences

Non-conserved Insertion Conserved Deletion Must have a G Any thing can match Profile HMM’s ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Core: Position with < 2 gaps

HMM vs. alignment • Detailed description of core • Conserved/variable positions • Price for insertions/deletions varies at different locations in sequence • These features cannot be captured in conventional alignments

Profile-profile scoring matrix 1K7C.A 1WAB._

Profile HMM’s All M/D pairs must be visited once L1-Y2A3V4R5-I6 P1D2P3P4I4P5D6P7

Example. Sequence profiles • Alignment of protein sequences 1PLC._ and 1GYC.A • E-value > 1000 • Profile alignment • Align 1PLC._ against Swiss-prot • Make position specific weight matrix from alignment • Use this matrix to align 1PLC._ against 1GYC.A • E-value < 10-22. Rmsd=3.3

Example continued Score = 97.1 bits (241), Expect = 9e-22 Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%) Query: 3 ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56 F + G++ N+ + +G + + Sbjct: 26 ------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79 Query: 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98 A G +F G + ++ G+ G V Sbjct: 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126 Rmsd=3.3 Å Model red Structure blue

HMMs. What are they good for II • Trans membrane helix proteins

HMMs. What are they good for II • Transmembrane helix proteins TMHMM. A. Krogh, 2001

Gene Finding

HMM packages • HMMER(http://hmmer.wustl.edu/) • S.R. Eddy, WashU St. Louis. Freely available. • SAM (http://www.cse.ucsc.edu/research/compbio/sam.html) • R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa Cruz. Freely available to academia, nominal license fee for commercial users. • META-MEME (http://metameme.sdsc.edu/) • William Noble Grundy, UC San Diego. Freely available. Combines features of PSSM search and profile HMM search. • NET-ID, HMMpro(http://www.netid.com/html/hmmpro.html) • Freely available to academia, nominal license fee for commercial users. • Allows HMM architecture construction. • EasyGibbs (http://www.cbs.dtu.dk/biotools/EasyGibbs/) • Webserver for Gibbs sampling of proteins sequences

Hidden Markov Models What are the good for?