1 / 22

Applications of HMMs in Computational Biology

This article discusses the use of Hidden Markov Models (HMMs) in computational biology, specifically in the tasks of gene finding and protein classification. It explains how HMMs can be applied to locate genes within DNA sequences and predict protein families based on amino acid sequences. The accuracy and metrics used to assess the performance of HMMs are also discussed.

levym
Download Presentation

Applications of HMMs in Computational Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applications of HMMs in Computational Biology BMI/CS 576 www.biostat.wisc.edu/bmi576.html Colin Dewey cdewey@biostat.wisc.edu Fall 2010

  2. The Gene Finding Task • Given: an uncharacterized DNA sequence • Do: locate the genes in the sequence, including the coordinates of individual exons and introns image from the UCSC Genome Browser http://genome.ucsc.edu/

  3. Eukaryotic Gene Structure

  4. Pairs of intron/exon units represent the different ways an intron can interrupt a coding sequence (after 1st base in codon, after 2nd base or after 3rd base) Complementary submodel (not shown) detects genes on opposite DNA strand The GENSCAN HMM for Eukaryotic Gene Finding [Burge & Karlin ‘97] Each shape represents a functional unit of a gene or genomic region

  5. The GENSCAN HMM • for each sequence type, GENSCAN models • the length distribution • the sequence composition • length distribution models vary depending on sequence type • nonparametric (using histograms) • parametric (using geometric distributions) • fixed-length • sequence composition models vary depending on type • 5th-order, inhomogeneous • 5th-order homogenous • 0th and 1st-order inhomogeneous • tree-structured variable memory

  6. Human Intron & Exon Lengths

  7. Representing Exons in GENSCAN • for exons, GENSCAN uses • Histograms to represent exon lengths • 5th-order, inhomogeneous Markov models to represent exon sequences • 5th-order, inhomogeneous models can represent statistics about pairs of neighboring codons

  8. Inference with the Gene-Finding HMM given: an uncharacterized DNA sequence do: find the most probable path through the model for the sequence • This path will specify the coordinates of the predicted genes (including intron and exon boundaries) • The Viterbi algorithm is used to compute this path

  9. Parsing a DNA Sequence The Viterbi path represents a parse of a given sequence, predicting exons, introns, etc ACCGTTACGTGTCATTCTACGTGATCATCGGATCCTAGAATCATCGATCCGTGCGATCGATCGGATTAGCTAGCTTAGCTAGGAGAGCATCGATCGGATCGAGGAGGAGCCTATATAAATCAA

  10. Assessing the Accuracy of a Trained Model • two issues • What data should we use? • Which metrics should we use? • Can we measure accuracy on the data set that was used to train the model? • NO! This will result in accuracy estimates that • are biased (too high).

  11. Assessing the Accuracy of a Trained Model • need to have a test set that is disjoint from the training set • more generally, can use cross validation 3-fold CV illustrated Figure from http://gepas.bioinfo.cipf.es/

  12. Accuracy actual class positive negative false positives (FP) true positives (TP) positive predicted class false negatives (FN) true negatives (TN) negative

  13. sometimes specificity is defined this way Accuracy Metrics actual class positive negative false positives (FP) true positives (TP) positive predicted class false negatives (FN) true negatives (TN) negative

  14. Accuracy of GENSCAN (and TWINSCAN) genes exactly correct? exons exactly correct? nucleotides correct? Figure from Flicek et al., Genome Research, 2003

  15. The Protein Classification Task Given: amino-acid sequence of a protein Do: predict the family to which it belongs GDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVCVLAHHFGKEFTPPVQAAYAKVVAGVANALAHKYH

  16. Alignment of Globin Family Proteins • The sequences in a family may vary in length • Some positions are more conserved than others

  17. A 0.01 R 0.12 D 0.04 N 0.29 C 0.01 E 0.03 Q 0.02 G 0.01 d d d 3 1 2 i i i i 1 2 3 0 start end Insert and match states have emission distributions over sequence characters m m 2 m 3 1 Profile HMMs • profile HMMs are commonly used to model families of sequences Delete states are silent; they Account for characters missing in some sequences Insert states account for extra characters in some sequences Match states represent key conserved positions

  18. Multiple Alignment of SH3 Domain Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences

  19. A Profile HMM Trained for the SH3 Domain insert states delete states (silent) Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences match states

  20. β-galactosidase β-glucanase β-amylase α-amylase Profile HMMs • To classify sequences according to family, we can train a profile HMM to model the proteins of each family of interest • Given a sequence x, use Bayes’ rule to make classification

  21. Profile HMM Accuracy profile HMM-based methods • classifying 2447 proteins into 33 families • x-axis represents the median # of negative sequences that score as high as a positive sequence for a given family’s model BLAST-based methods Figure from Jaakola et al., ISMB 1999

  22. Other Issues in Markov Models • there are many interesting variants and extensions of the models/algorithms we considered here (some of these are covered in BMI/CS 776) • separating length/composition distributions with semi-Markov models • modeling multiple sequences with pair HMMs • learning the structure of HMMs • going up the Chomsky hierarchy: stochastic context free grammars • discriminative learning algorithms (e.g. as in conditional random fields) • etc.

More Related