1 / 32

Hidden Markov Models in Bioinformatics

Hidden Markov Models in Bioinformatics. Example Domain: Gene Finding Colin Cherry colinc@cs. To recap last episode…. Hidden Markov Models (HMMs) Protein Family Characterization Profile HMMs for protein family characterization How profile HMMs can do homology search.

jacob
Download Presentation

Hidden Markov Models in Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry colinc@cs

  2. To recap last episode… • Hidden Markov Models (HMMs) • Protein Family Characterization • Profile HMMs for protein family characterization • How profile HMMs can do homology search

  3. ...picking up where we left off • Profile HMMs were good to start with • Today’s goal: Introduce HMMs as general tools in bioinformatics • I will use the problem of Gene Finding as an example of an “ideal” HMM problem domain

  4. Learning Objectives • When I’m done you should know: • When is an HMM a good fit for a problem space? • What materials are needed before work can begin with an HMM? • What are the advantages and disadvantages of using HMMs? • What are the general objectives and challenges in the gene finding task?

  5. Outline • HMMs as Statistical Models • The Gene Finding task at a glance • Good problems for HMMs • HMM Advantages • HMM Disadvantages • Gene Finding Examples

  6. Statistical Models • Definition: • Any mathematical construct that attempts to parameterize a random process • Example: A normal distribution • Assumptions • Parameters • Estimation • Usage • HMMs are just a little more complicated…

  7. HMM Assumptions • Observations are ordered • Random process can be represented by a stochastic finite state machine with emitting states.

  8. HMM Parameters • Using weather example • Modeling daily weather for a year • Ra Ra Su Su Su Ra.. • Lots of parameters • One for each table entry • Represented in two tables. • One for emissions • One for transitions

  9. HMM Estimation • Called training, it falls under machine learning • Feed an architecture (given in advance) a set of observation sequences • The training process will iteratively alter its parameters to fit the training set • The trained model will assign the training sequences high probability

  10. HMM Usage • Two major tasks • Evaluate the probability of an observation sequence given the model (Forward) • Find the most likely path through the model for a given observation sequence (Viterbi)

  11. Gene Finding(An Ideal HMM Domain) • Our Objective: • To find the coding and non-coding regions of an unlabeled string of DNA nucleotides • Our Motivation: • Assist in the annotation of genomic data produced by genome sequencing methods • Gain insight into the mechanisms involved in transcription, splicing and other processes

  12. Gene Finding Terminology • A string of DNA nucleotides containing a gene will have separate regions (lines): • Introns – non-coding regions within a gene • Exons – coding regions • Separated by functional sites (boxes) • Start and stop codons • Splice sites – acceptors and donors

  13. Gene Finding Challenges • Need the correct reading frame • Introns can interrupt an exon in mid-codon • There is no hard and fast rule for identifying donor and acceptor splice sites • Signals are very weak

  14. What makes a good HMM problem space? Characteristics: • Classification problems There are two main types of output from an HMM: • Scoring of sequences • (Protein family modeling) • Labeling of observations within a sequence • (Gene Finding)

  15. HMM Problem CharacteristicsContinued • The observations in a sequence should have a clear, and meaningful order • Unordered observations will not map easily to states • It’s beneficial, but not necessary for the observations follow some sort of grammar • Makes it easier to design an architecture • Gene Finding • Protein Family Modeling

  16. HMM Requirements So you’ve decided you want to build an HMM, here’s what you need: • An architecture • Probably the hardest part • Should be biologically sound & easy to interpret • A well-defined success measure • Necessary for any form of machine learning

  17. HMM Requirements Continued • Training data • Labeled or unlabeled – it depends • You do not always need a labeled training set to do observation labeling, but it helps • Amount of training data needed is: • Directly proportional to the number of free parameters in the model • Inversely proportional to the size of the training sequences

  18. Why HMMs might be a good fit for Gene Finding • Classification: Classifying observations within a sequence • Order: A DNA sequence is a set of ordered observations • Grammar / Architecture: Our grammatical structure (and the beginnings of our architecture) is right here: • Success measure: # of complete exons correctly labeled • Training data: Available from various genome annotation projects

  19. HMM Advantages • Statistical Grounding • Statisticians are comfortable with the theory behind hidden Markov models • Freedom to manipulate the training and verification processes • Mathematical / theoretical analysis of the results and processes • HMMs are still very powerful modeling tools – far more powerful than many statistical methods

  20. HMM Advantages continued • Modularity • HMMs can be combined into larger HMMs • Transparency of the Model • Assuming an architecture with a good design • People can read the model and make sense of it • The model itself can help increase understanding

  21. HMM Advantages continued • Incorporation of Prior Knowledge • Incorporate prior knowledge into the architecture • Initialize the model close to something believed to be correct • Use prior knowledge to constrain training process

  22. How does Gene Finding make use of HMM advantages? • Statistics: • Many systems alter the training process to better suit their success measure • Modularity: • Almost all systems use a combination of models, each individually trained for each gene region • Prior Knowledge: • A fair amount of prior biological knowledge is built into each architecture

  23. HMM Disadvantages • Markov Chains • States are supposed to be independent • P(y) must be independent of P(x), and vice versa • This usually isn’t true • Can get around it when relationships are local • Not good for RNA folding problems P(x) P(y) …

  24. HMM Disadvantagescontinued • Standard Machine Learning Problems • Watch out for local maxima • Model may not converge to a truly optimal parameter set for a given training set • Avoid over-fitting • You’re only as good as your training set • More training is not always good

  25. HMM Disadvantagescontinued • Speed!!! • Almost everything one does in an HMM involves: “enumerating all possible paths through the model” • There are efficient ways to do this • Still slow in comparison to other methods

  26. HMM Gene Finders:VEIL • A straight HMM Gene Finder • Takes advantage of grammatical structure and modular design • Uses many states that can only emit one symbol to get around state independence

  27. HMM Gene Finders:HMMGene • Uses an extended HMM called a CHMM • CHMM = HMM with classes • Takes full advantage of being able to modify the statistical algorithms • Uses high-order states • Trains everything at once

  28. HMM Gene Finders:Genie • Uses a generalized HMM (GHMM) • Edges in model are complete HMMs • States can be any arbitrary program • States are actually neural networks specially designed for signal finding

  29. Conclusions • HMMs have problems where they excel, and problems where they do not • You should consider using one if: • Problem can be phrased as classification • Observations are ordered • The observations follow some sort of grammatical structure (optional)

  30. Advantages: Statistics Modularity Transparency Prior Knowledge Disadvantages: State independence Over-fitting Local Maximums Speed Conclusions

  31. Some final words… • Lots of problems can be phrased as classification problems • Homology search, sequence alignment • If an HMM does not fit, there’s all sorts of other methods to try with ML/AI: • Neural Networks, Decision Trees Probabilistic Reasoning and Support Vector Machines have all been applied to Bioinformatics

  32. Questions Any Questions?

More Related