Hidden Markov Models in Bioinformatics

Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry colinc@cs

To recap last episode… • Hidden Markov Models (HMMs) • Protein Family Characterization • Profile HMMs for protein family characterization • How profile HMMs can do homology search

...picking up where we left off • Profile HMMs were good to start with • Today’s goal: Introduce HMMs as general tools in bioinformatics • I will use the problem of Gene Finding as an example of an “ideal” HMM problem domain

Learning Objectives • When I’m done you should know: • When is an HMM a good fit for a problem space? • What materials are needed before work can begin with an HMM? • What are the advantages and disadvantages of using HMMs? • What are the general objectives and challenges in the gene finding task?

Outline • HMMs as Statistical Models • The Gene Finding task at a glance • Good problems for HMMs • HMM Advantages • HMM Disadvantages • Gene Finding Examples

Statistical Models • Definition: • Any mathematical construct that attempts to parameterize a random process • Example: A normal distribution • Assumptions • Parameters • Estimation • Usage • HMMs are just a little more complicated…

HMM Assumptions • Observations are ordered • Random process can be represented by a stochastic finite state machine with emitting states.

HMM Parameters • Using weather example • Modeling daily weather for a year • Ra Ra Su Su Su Ra.. • Lots of parameters • One for each table entry • Represented in two tables. • One for emissions • One for transitions

HMM Estimation • Called training, it falls under machine learning • Feed an architecture (given in advance) a set of observation sequences • The training process will iteratively alter its parameters to fit the training set • The trained model will assign the training sequences high probability

HMM Usage • Two major tasks • Evaluate the probability of an observation sequence given the model (Forward) • Find the most likely path through the model for a given observation sequence (Viterbi)

Gene Finding(An Ideal HMM Domain) • Our Objective: • To find the coding and non-coding regions of an unlabeled string of DNA nucleotides • Our Motivation: • Assist in the annotation of genomic data produced by genome sequencing methods • Gain insight into the mechanisms involved in transcription, splicing and other processes

Gene Finding Terminology • A string of DNA nucleotides containing a gene will have separate regions (lines): • Introns – non-coding regions within a gene • Exons – coding regions • Separated by functional sites (boxes) • Start and stop codons • Splice sites – acceptors and donors

Gene Finding Challenges • Need the correct reading frame • Introns can interrupt an exon in mid-codon • There is no hard and fast rule for identifying donor and acceptor splice sites • Signals are very weak

What makes a good HMM problem space? Characteristics: • Classification problems There are two main types of output from an HMM: • Scoring of sequences • (Protein family modeling) • Labeling of observations within a sequence • (Gene Finding)

HMM Problem CharacteristicsContinued • The observations in a sequence should have a clear, and meaningful order • Unordered observations will not map easily to states • It’s beneficial, but not necessary for the observations follow some sort of grammar • Makes it easier to design an architecture • Gene Finding • Protein Family Modeling

HMM Requirements So you’ve decided you want to build an HMM, here’s what you need: • An architecture • Probably the hardest part • Should be biologically sound & easy to interpret • A well-defined success measure • Necessary for any form of machine learning

HMM Requirements Continued • Training data • Labeled or unlabeled – it depends • You do not always need a labeled training set to do observation labeling, but it helps • Amount of training data needed is: • Directly proportional to the number of free parameters in the model • Inversely proportional to the size of the training sequences

Why HMMs might be a good fit for Gene Finding • Classification: Classifying observations within a sequence • Order: A DNA sequence is a set of ordered observations • Grammar / Architecture: Our grammatical structure (and the beginnings of our architecture) is right here: • Success measure: # of complete exons correctly labeled • Training data: Available from various genome annotation projects

HMM Advantages • Statistical Grounding • Statisticians are comfortable with the theory behind hidden Markov models • Freedom to manipulate the training and verification processes • Mathematical / theoretical analysis of the results and processes • HMMs are still very powerful modeling tools – far more powerful than many statistical methods

HMM Advantages continued • Modularity • HMMs can be combined into larger HMMs • Transparency of the Model • Assuming an architecture with a good design • People can read the model and make sense of it • The model itself can help increase understanding

HMM Advantages continued • Incorporation of Prior Knowledge • Incorporate prior knowledge into the architecture • Initialize the model close to something believed to be correct • Use prior knowledge to constrain training process

How does Gene Finding make use of HMM advantages? • Statistics: • Many systems alter the training process to better suit their success measure • Modularity: • Almost all systems use a combination of models, each individually trained for each gene region • Prior Knowledge: • A fair amount of prior biological knowledge is built into each architecture

HMM Disadvantages • Markov Chains • States are supposed to be independent • P(y) must be independent of P(x), and vice versa • This usually isn’t true • Can get around it when relationships are local • Not good for RNA folding problems P(x) P(y) …

HMM Disadvantagescontinued • Standard Machine Learning Problems • Watch out for local maxima • Model may not converge to a truly optimal parameter set for a given training set • Avoid over-fitting • You’re only as good as your training set • More training is not always good

HMM Disadvantagescontinued • Speed!!! • Almost everything one does in an HMM involves: “enumerating all possible paths through the model” • There are efficient ways to do this • Still slow in comparison to other methods

HMM Gene Finders:VEIL • A straight HMM Gene Finder • Takes advantage of grammatical structure and modular design • Uses many states that can only emit one symbol to get around state independence

HMM Gene Finders:HMMGene • Uses an extended HMM called a CHMM • CHMM = HMM with classes • Takes full advantage of being able to modify the statistical algorithms • Uses high-order states • Trains everything at once

HMM Gene Finders:Genie • Uses a generalized HMM (GHMM) • Edges in model are complete HMMs • States can be any arbitrary program • States are actually neural networks specially designed for signal finding

Conclusions • HMMs have problems where they excel, and problems where they do not • You should consider using one if: • Problem can be phrased as classification • Observations are ordered • The observations follow some sort of grammatical structure (optional)

Advantages: Statistics Modularity Transparency Prior Knowledge Disadvantages: State independence Over-fitting Local Maximums Speed Conclusions

Some final words… • Lots of problems can be phrased as classification problems • Homology search, sequence alignment • If an HMM does not fit, there’s all sorts of other methods to try with ML/AI: • Neural Networks, Decision Trees Probabilistic Reasoning and Support Vector Machines have all been applied to Bioinformatics

Questions Any Questions?

Hidden Markov Models in Bioinformatics