1 / 40

Hidden Markov Models What are the good for?

Hidden Markov Models What are the good for?. Morten Nielsen CBS. Absolutely nothing!. Objectives. Introduce Hidden Markov models and understand that they are just weight matrices with gaps See the beauty of sequence profiles Position specific scoring matrices (PSSMs)

Download Presentation

Hidden Markov Models What are the good for?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hidden Markov ModelsWhat are the good for? Morten Nielsen CBS

  2. Absolutely nothing!

  3. Objectives • Introduce Hidden Markov models and understand that they are just weight matrices with gaps • See the beauty of sequence profiles • Position specific scoring matrices (PSSMs) • Understand what biological problems are best described using HMM’s • And which are not!

  4. What is an HMM What are they good for? How to construct an HMM How to “score” a sequence to an HMM Viterbi decoding HMM’s that made a difference Profile HMMs TMHMM Links to HMM packages Outline

  5. Markov Models • A model with no memory • What I decide depends only on “state” now, not on what I have learned in the past • No dependence on i-1, i-2 …

  6. A Markov model? • No memory • Model generates numbers • 312453666641 The unfair casino: Loaded dice p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1 0.9 0.95 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 0.05 0.10 Loaded Fair

  7. Why hidden? • Model generates numbers • 312453666641 • Does not tell which dice was used • Alignment (decoding) can give the most probable solution/path (Viterby) • FFFFFFLLLLLL • Or most probable set of states • FFFFFFLLLLLL The unfair casino: Loaded dice p(6) = 0.5; switch fair to load:0.05; switch load to fair: 0.1 0.9 0.95 1:1/6 2:1/6 3:1/6 4:1/6 5:1/6 6:1/6 1:1/10 2:1/10 3:1/10 4:1/10 5:1/10 6:1/2 0.05 0.10 Loaded Fair

  8. ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC Example from A. Krogh Core region defines the number of states in the HMM (red) Insertion and deletion statistics are derived from the non-core part of the alignment (black) HMM (a simple example) Core of alignment

  9. HMM construction • 5 matches. A, 2xC, T, G • 5 transitions in gap region • C out, G out • A-C, C-T, T out • Out transition 3/5 • Stay transition 2/5 ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC .4 .2 A C G T .4 .2 .2 .6 .6 .8 A C G T A C G T A C G T .8 A C G T 1 A C G T A C G T 1. 1. 1. 1. .4 .8 .2 .8 .2 .2 .2 .8 .2 ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x1x0.8x1x0.2 = 3.3x10-2

  10. Align sequence to HMM ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2=3.3x10-2 TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8=0.0075x10-2 ACAC--AGC =1.2x10-2 Consensus: ACAC--ATC =4.7x10-2, ACA---ATC =13.1x10-2 Exceptional: TGCT--AGG =0.0023x10-2

  11. Score depends strongly on length Null model is a random model. For length L the score is 0.25L Log-odds score for sequence S Log( P(S)/0.25L) Positive score means more likely than Null model ACA---ATG = 4.9 TCAACTATC = 3.0 ACAC--AGC = 5.3 AGA---ATC = 4.9 ACCG--ATC = 4.6 Consensus: ACAC--ATC = 6.7 ACA---ATC = 6.3 Exceptional: TGCT--AGG = -0.97 Align sequence to HMM - Null model Note!

  12. Example: 1245666. What was the series of dice used to generate this output? Log model -0.05 -0.02 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0-78 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 -1.3 -1 Fair Loaded Model decoding (Viterby)

  13. Dynamic programming: computation of scores T C G C A T C C A Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”). => Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities. x Each new score is found by choosing the maximum of three possibilities. For each square in matrix: keep track of where best score came from. Fill in scores one row at a time, starting in upper left corner of matrix, ending in lower right corner. score(x,y-1) - gap-penalty score(x-1,y-1) + substitution-score(x,y) score(x-1,y) - gap-penalty score(x,y) = max

  14. Example: 1245666. What was the series of dice used to generate this output? Log model -0.05 -0.02 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0-78 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 -1.3 -1 Fair Loaded Model decoding (Viterby)

  15. Log model -0.05 -0.02 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0-78 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 -1.3 -1 Fair Loaded Model decoding (Viterby)

  16. Log model -0.05 -0.02 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0-78 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 -1.3 -1 Fair Loaded Model decoding (Viterby) Identify what series of dice was used to generate this output?

  17. Log model -0.05 -0.02 1:-0.78 2:-0.78 3:-0.78 4:-0.78 5:-0.78 6:-0-78 1:-1 2:-1 3:-1 4:-1 5:-1 6:-0.3 -1.3 -1 Fair Loaded Model decoding (Viterby) Series of dice is FFFFLLL

  18. HMM’s and weight matrices • In the case of un-gapped alignments HMM’s become simple weight matrices

  19. HMM construction .4 X .2 A C G T .4 .2 .2 .6 .6 .8 A C G T A C G T A C G T .8 A C G T 1 A C G T A C G T 1. 1. 1. 1. .4 .8 .2 .8 .2 .2 .2 .8 .2

  20. HMM construction .8 A C G T A C G T A C G T .8 A C G T 1 A C G T A C G T 1. 1. 1. 1. 1. .8 .2 .8 .2 .2 .2 .8 .2 ACA---ATG sco = 0.8x1x0.8x1x0.8x1x1x1x0.8x1x0.2 = 3.3x10-2 or Log-sco = log(0.8)+log(0.8)+log(0.8)+log(1)+log(0.8)+log(0.2)

  21. HMM’s and weight matrices • In the case of un-gapped alignments HMM’s become simple weight matrices • To achieve high performance, the emission frequencies are estimated using the techniques of • Sequence weighting • Pseudo counts

  22. HMMs. What are they good for? • Weight matrices do not deal with insertions and deletions • In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension • HMM is a natural frame work where insertions/deletions are dealt with explicitly

  23. Profile HMM’s • Alignments based on conventional scoring matrices (BLOSUM62) scores all positions in a sequence in an equal manner • Some positions are highly conserved, some are highly variable (more than what is described in the BLOSUM matrix) • Profile HMM’s are ideal suited to describe such position specific variations

  24. What goes wrong when Blast fails? • Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences

  25. Alignment scoring matrices • Blosum62 score matrix. Fg=1. Ng=0?

  26. Alignment scoring matrices • Blosum62 score matrix. Fg=1. Ng=0? • Score =2+6+6+4-1=17 LAGDS I-GDS

  27. What goes wrong when Blast fails? • Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences • This scoring matrix is identical at all positions in the protein sequence! EVVFIGDSLVQLMHQC X X X X X X AGDS.GGGDS

  28. When Blast works! 1PLC._ 1PLB._

  29. When Blast fails! 1PLC._ 1PMY._

  30. Sequence profiles • In reality not all positions in a protein are equally likely to mutate • Some amino acids (active cites) are highly conserved, and the score for mismatch must be very high • Other amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM score • Sequence profiles can capture these differences

  31. Non-conserved Insertion Conserved Deletion Must have a G Any thing can match Profile HMM’s ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Core: Position with < 2 gaps

  32. HMM vs. alignment • Detailed description of core • Conserved/variable positions • Price for insertions/deletions varies at different locations in sequence • These features cannot be captured in conventional alignments

  33. Profile-profile scoring matrix 1K7C.A 1WAB._

  34. Profile HMM’s All M/D pairs must be visited once L1-Y2A3V4R5-I6 P1D2P3P4I4P5D6P7

  35. Example. Sequence profiles • Alignment of protein sequences 1PLC._ and 1GYC.A • E-value > 1000 • Profile alignment • Align 1PLC._ against Swiss-prot • Make position specific weight matrix from alignment • Use this matrix to align 1PLC._ against 1GYC.A • E-value < 10-22. Rmsd=3.3

  36. Example continued Score = 97.1 bits (241), Expect = 9e-22 Identities = 13/107 (12%), Positives = 27/107 (25%), Gaps = 17/107 (15%) Query: 3 ADDGSLAFVPSEFSISPGEKI------VFKNNAGFPHNIVFDEDSIPSGVDASKIS 56 F + G++ N+ + +G + + Sbjct: 26 ------VFPSPLITGKKGDRFQLNVVDTLTNHTMLKSTSIHWHGFFQAGTNWADGP 79 Query: 57 MSEEDLLNAKGETFEVAL---SNKGEYSFYCSP--HQGAGMVGKVTV 98 A G +F G + ++ G+ G V Sbjct: 80 AFVNQCPIASGHSFLYDFHVPDQAGTFWYHSHLSTQYCDGLRGPFVV 126 Rmsd=3.3 Å Model red Structure blue

  37. HMMs. What are they good for II • Trans membrane helix proteins

  38. HMMs. What are they good for II • Transmembrane helix proteins TMHMM. A. Krogh, 2001

  39. Gene Finding

  40. HMM packages • HMMER(http://hmmer.wustl.edu/) • S.R. Eddy, WashU St. Louis. Freely available. • SAM (http://www.cse.ucsc.edu/research/compbio/sam.html) • R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa Cruz. Freely available to academia, nominal license fee for commercial users. • META-MEME (http://metameme.sdsc.edu/) • William Noble Grundy, UC San Diego. Freely available. Combines features of PSSM search and profile HMM search. • NET-ID, HMMpro(http://www.netid.com/html/hmmpro.html) • Freely available to academia, nominal license fee for commercial users. • Allows HMM architecture construction. • EasyGibbs (http://www.cbs.dtu.dk/biotools/EasyGibbs/) • Webserver for Gibbs sampling of proteins sequences

More Related