1 / 33

Introduction to

Explore the essentials of Hidden Markov Models (HMM) in bioinformatics, including understanding transition matrices, emission probabilities, and how HMM can detect genes. Learn how Profile HMMs can summarize protein alignments and detect introns and exons efficiently. 8

smorin
Download Presentation

Introduction to

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Bioinformatics

  2. Mini Exam 3

  3. Mini Exam Take a pencil and a piece of paper Please, not too close to your neighbour There a three questions. You have in total 15 minutes for writing down short but clear answers When you are ready please submit your answers to the desk in front

  4. Mini Exam 3 ANSWERS

  5. Introduction to Bioinformatics. LECTURE 4: Hidden Markov Models * Chapter 4: The boulevard of broken dreams

  6. 4.1 The nose knows • * In 2004 Richard Axel and Linda Buck received the Nobel price for elucidating the olfactory system. • * Odorant Receptors (ORs): sense certain molecules outside the cell and signal inside the cell • * ORs contain 7 transmembrane domains • * OR is single largest gene family in human genome with 1000 genes – same as mice, rat, dog • * Most became pseudogenes – we lost smell due to vision Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS

  7. Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS • 4.2 Hidden Markov models • In 1989 Gary Churchill introduced the use of HMM for DNA-segmentation. • CENTRAL IDEAS: • * The string is generated by a system • * The system can be a number of distinct states • * The system can change between states with probability T • * In each state the system emits symbols to the string with probability E

  8. T(1,2) T(2,3) STATE 2 STATE 3 A: pA T: pT C: pC G: pG A: pA T: pT C: pC G: pG A: pA T: pT C: pC G: pG s = h = 4.2 Hidden Markov models Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS STATE 1 TTCACTGTGAACGATCCGA CCAGTACTACG ACGTTGCCAAAGCGCTTAT 1111111111111111111111112222222222222333333333333333333333333

  9. HMM essentials • TRANSITION MATRIX = the probability of a state change: • EMISSION PROBABILITY = symbol probability distribution in a certain state Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS

  10. Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS • HMM essentials • INITIALPROBABILITY of a state : • sequence of the states visited: h • sequence of the generated symbols: s

  11. Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS • HMM essentials • Probability of the hidden states h: • Probability of generated symbol string s given the hidden states h

  12. Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS • HMM essentials • Joint probability of symbol string s and hidden states h:

  13. HMM essentials • Theorem of total probability : • Most likely sequence: Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS

  14. EXAMPLE 4.2: Change points in Labda-phage 0.0002 CG RICH AT RICH 0.9998 0.9998 0.0002 A: 0.2462 C: 0.2476 G: 0.2985 T: 0.2077 A: 0.2700 C: 0.2084 G: 0.1981 T: 0.3236

  15. EXAMPLE 4.2: Change points in Labda-phage 0.0002 CG RICH AT RICH 0.9998 0.9998 0.0002 A: 0.2462 C: 0.2476 G: 0.2985 T: 0.2077 A: 0.2700 C: 0.2084 G: 0.1981 T: 0.3236

  16. Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS • 4.3 Profile hidden Markov models • * Characterize sets of homologous genes and proteins based on common patterns in their sequence. • * Classis approach: multiple alignments of all elements in the family • * Position Specific Scoring Matrices (PSSM) • * Cannot handle variable lengths or gaps • * Profile HMM (pHHM) can do this

  17. Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS • 4.3 Profile hidden Markov models • * See Figure 4.4 for a pHMM for a multiple alignment of: • VIVALASVEGAS • VIVADA-VI--S • VIVADALL--AS

  18. Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS • 4.3 Profile hidden Markov models • * Profile HMM (pHMM) allow to summarize the salient features of a protein alignment in one single model • * Also pHMM can be used to produce multiple alignments

  19. Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS • 4.4 Finding genes with hidden Markov models • * HMMs are better in detecting genes than sequence alignment • * HMMs can detect introns and exons • * Downside: HMMs are computational much more demanding!

  20. Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS • 4.5 Case study: odorant receptors • * The 7-transmembrane (7-TM) G-protein coupled receptors

  21. P(IN-OUT) OUT P(OUT-OUT) IN P(IN-IN) P(OUT-IN) A: 15 R: 11 ... V: 31 A: 15 R: 11 ... V: 31 EXAMPLE 4.7: odorant receptors

  22. Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS • 4.6 Algorithms for HMM computations • Probability of the sequence under the given model is: • the most probable sequence is:

  23. Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS • The VITERBIDynamic Programmingalgorithm • Given a sequence s of lengthnandan HMM with params(T,E): • 1. Create table V of size |H|x(n+1); • 2. Initialize i=0; V(0,0)=1; V(k,0)=0 for k>0; • 3. For i=1:n, compute each entry using the recursive relation: V(j,i) = E(j,s(i))*maxk {V(k,i-1)*T(k,j) } • pointer(i,j) = arg maxk {V(k,i-1)*T(k,j) } • 4. OUTPUT: P(s,h*) = maxk {V(k,n)} • 5. Trace-back: i=n:1, using: h*i-1 = pointer(i, h*i) • 6. OUTPUT: h*(n) = maxk {V(k,n)}

  24. Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS • The FORWARD algorithm • Given a sequence s of lengthnandan HMM with params(T,E): • 1. Create table F of size |H|x(n+1); • 2. Initialize i=0; F(0,0)=1; V(k,0)=0 for k>0; • 3. For i=1:n, compute each entry using the recursive relation: F(j,i) = E(j,s(i))*∑k {F(k,i-1)*T(k,j) } • pointer(i,j) = arg maxk {V(k,i-1)*T(k,j) } • 4. OUTPUT: P(s) = ∑k {F(k,n)}

  25. Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS • The EM(Expectation Maximization)algorithm • Given a sequence s andan HMM with unknown (T,E): • 1. Initialize h, E and T; • 2. Given s and h estimateE and T just by counting the symbols; • 3. Given s, E and T estimate h e.g. with Viterbi-algorithm; • 4. Repeat steps 2 and 3 until some criterion is met.

  26. EXAMPLE: finding genes withVEIL

  27. EXAMPLE: finding genes with VEIL • The Viterbi Exon-Intron Locator (VEIL) was developed by John Henderson, Steven Salzberg, and Ken Fasman at Johns Hopkins University. • Gene finder with a modular structure: • Uses a HMM which is made up of sub-HMMs each to describe a different bit of the sequence: upstream noncoding DNA, exon, intron, … • Assumes test data starts and ends with noncoding DNA and contains exactly one gene. • Uses biological knowledge to “hardwire” part of HMM, eg. start + stop codons, splice sites.

  28. The exon sub-model

  29. Exon Upstream a t g Other submodels • The start codon model is very simple: • The splice junctions are also quite simple and can be hardwired (here is the 5’ splice site):

  30. The overall model Start codon Stop codon Downstream Upstream Exon 3’ splice site 5’ polyA site intron 5’ splice site For more details, see J. Henderson, S.L. Salzberg, and K. Fasman (1997) Journal of Computational Biology 4:2, 127-141.

  31. END of LECTURE 4

  32. Introduction to BioinformaticsLECTURE 4: HIDDEN MARKOV MODELS

More Related