350 likes | 636 Views
Hidden Markov Models in Bioinformatics. Example Domains: Gene Finding & Protein Family Modeling. 5 Second Overview. Today’s goal: Introduce HMMs as general tools in bioinformatics I will use the problem of Gene Finding as an example of an “ideal” HMM problem domain
E N D
Hidden Markov Models in Bioinformatics Example Domains: Gene Finding & Protein Family Modeling
5 Second Overview • Today’s goal: Introduce HMMs as general tools in bioinformatics • I will use the problem of Gene Finding as an example of an “ideal” HMM problem domain • I will use the problem of Protein Family Modeling as an example of a clever way to fit HMMs to a problem
Learning Objectives • When I’m done you should know: • When is an HMM a good fit for a problem space? • What materials are needed before work can begin with an HMM? • What are the advantages and disadvantages of using HMMs?
Outline • HMMs as Statistical Models • The example tasks at a glance • Good problems for HMMs • HMM Advantages • HMM Disadvantages • Gene Finding Examples
Statistical Models • Definition: • Any mathematical construct that attempts to parameterize a random process • Example: A normal distribution • Assumptions • Parameters • Estimation • Usage • HMMs are just a little more complicated…
HMM Assumptions • Observations are ordered • Random process can be represented by a stochastic finite state machine with emitting states.
HMM Parameters • Using weather example • Modeling daily weather for a year • Ra Ra Su Su Su Ra.. • Lots of parameters • One for each table entry • Represented in two tables. • One for emissions • One for transitions
HMM Estimation • Called training, it falls under machine learning • Feed an architecture (given in advance) a set of observation sequences • The training process will iteratively alter its parameters to fit the training set • The trained model will assign the training sequences high probability
HMM Usage • Two major tasks • Evaluate the probability of an observation sequence given the model (Forward) • Find the most likely path through the model for a given observation sequence (Viterbi)
Gene Finding(An Ideal HMM Domain) • Our Objective: • To find the coding and non-coding regions of an unlabeled string of DNA nucleotides • Our Motivation: • Assist in the annotation of genomic data produced by genome sequencing methods • Gain insight into the mechanisms involved in transcription, splicing and other processes
Gene Finding Terminology • A string of DNA nucleotides containing a gene will have separate regions (lines): • Introns – non-coding regions within a gene • Exons – coding regions • Separated by functional sites (boxes) • Start and stop codons • Splice sites – acceptors and donors
Gene Finding Challenges • Need the correct reading frame • Introns can interrupt an exon in mid-codon • There is no hard and fast rule for identifying donor and acceptor splice sites • Signals are very weak
Protein Family Modeling (A clever fit of HMMs) • I have a protein sequence. • What family is it in? • Can you give me a quick alignment to the other members of the family? • These amino acids here, do they match the families consensus positions, or are they inserts?
Profile HMM • Square: Match (consensus) state • Diamond: Insert state – notice the loops • Circle: Delete state – allows you to “jump” a match
What makes a good HMM problem space? Characteristics: • Classification problems There are two main types of output from an HMM: • Scoring of sequences • (Protein family modeling) • Labeling of observations within a sequence • (Gene Finding)
HMM Problem CharacteristicsContinued • The observations in a sequence should have a clear, and meaningful order • Unordered observations will not map easily to states • It’s beneficial, but not necessary for the observations follow some sort of grammar • Makes it easier to design an architecture • Gene Finding • Protein Family Modeling
HMM Requirements So you’ve decided you want to build an HMM, here’s what you need: • An architecture • Probably the hardest part • Should be biologically sound & easy to interpret • A well-defined success measure • Necessary for any form of machine learning
HMM Requirements Continued • Training data • Labeled or unlabeled – it depends • You do not always need a labeled training set to do observation labeling, but it helps • Amount of training data needed is: • Directly proportional to the number of free parameters in the model • Inversely proportional to the size of the training sequences
Why HMMs might be a good fit for Gene Finding • Classification: Classifying observations within a sequence • Order: A DNA sequence is a set of ordered observations • Grammar / Architecture: Our grammatical structure (and the beginnings of our architecture) is right here: • Success measure: # of complete exons correctly labeled • Training data: Available from various genome annotation projects
Why HMMs can be made to fit Protein Family Modeling • Classification: What model fits a sequence best? • Order: An amino acid sequence is well ordered • Grammar: Any two matches can be separated by a series of inserts and deletes… okay, maybe the word “grammar” is a bit of a stretch • Success Measure: How many sequences can we correctly label after training?
HMM Advantages • Statistical Grounding • Statisticians are comfortable with the theory behind hidden Markov models • Freedom to manipulate the training and verification processes • Mathematical / theoretical analysis of the results and processes • HMMs are still very powerful modeling tools – far more powerful than many statistical methods
HMM Advantages continued • Modularity • HMMs can be combined into larger HMMs • Transparency of the Model • Assuming an architecture with a good design • People can read the model and make sense of it • The model itself can help increase understanding
HMM Advantages continued • Incorporation of Prior Knowledge • Incorporate prior knowledge into the architecture • Initialize the model close to something believed to be correct • Use prior knowledge to constrain training process
How does Gene Finding make use of HMM advantages? • Statistics: • Many systems alter the training process to better suit their success measure • Modularity: • Almost all systems use a combination of models, each individually trained for each gene region • Prior Knowledge: • A fair amount of prior biological knowledge is built into each architecture
HMM Disadvantages • Markov Chains • States are supposed to be independent • P(y) must be independent of P(x), and vice versa • This usually isn’t true • Can get around it when relationships are local • Not good for RNA folding problems P(x) P(y) …
HMM Disadvantagescontinued • …and then there are the standard Machine Learning Problems • Watch out for local maxima • Model may not converge to a truly optimal parameter set for a given training set • Avoid over-fitting • You’re only as good as your training set • More training is not always good
HMM Disadvantagescontinued • Speed!!! • Almost everything one does in an HMM involves: “enumerating all possible paths through the model” • There are efficient ways to do this • Still slow in comparison to other methods
HMM Gene Finders:VEIL • A straight HMM Gene Finder • Takes advantage of grammatical structure and modular design • Uses many states that can only emit one symbol to get around state independence
HMM Gene Finders:HMMGene • Uses an extended HMM called a CHMM • CHMM = HMM with classes • Takes full advantage of being able to modify the statistical algorithms • Uses high-order states • Trains everything at once
HMM Gene Finders:Genie • Uses a generalized HMM (GHMM) • Edges in model are complete HMMs • States can be any arbitrary program • States are actually neural networks specially designed for signal finding
Conclusions • HMMs have problems where they excel, and problems where they do not • You should consider using one if: • Problem can be phrased as classification • Observations are ordered • The observations follow some sort of grammatical structure (optional)
Advantages: Statistics Modularity Transparency Prior Knowledge Disadvantages: State independence Over-fitting Local Maximums Speed Conclusions
Some final words… • Lots of problems can be phrased as classification problems • Homology search • Build a model of the sequence with a few close homologs, and use the model the search for more distant homologs • Sequence alignment • Align all of these sequences to the model that represents their family
Questions Any Questions?