1 / 28

Hidden Markov Models For DNA Sequence Alignment

Hidden Markov Models For DNA Sequence Alignment. Rich Burns CS 790 – Bioinformatics Spring 2001 Wright State University. Presentation Outline. Introduction What do we want to know? Why do we want to know? How do we find this out? Hidden Markov Models (HMM) What is a HMM?

cormac
Download Presentation

Hidden Markov Models For DNA Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hidden Markov Models For DNA Sequence Alignment Rich Burns CS 790 – Bioinformatics Spring 2001 Wright State University CS 790 Spring 2001

  2. Presentation Outline • Introduction • What do we want to know? • Why do we want to know? • How do we find this out? • Hidden Markov Models (HMM) • What is a HMM? • How does a HMM work? CS 790 Spring 2001

  3. Presentation Outline 2 • Ground Up Examples • Regular Expressions • Motif Example • Profile HMM • Sequence Alignment • References CS 790 Spring 2001

  4. What do we want to know? • Missing Children • Database searching for other members of a sequence family • Multiple Alignments • Align a set of sequences and score their fit to the family CS 790 Spring 2001

  5. Why do we want to know? • “The most important contribution of computational biology has been in the development of methods for extracting information from the bipolymer sequence databases via sequence comparison, characterization and classification… …Sequence alignment methodology is central to all of these methods” [Liu et. Al. 1999] CS 790 Spring 2001

  6. Why do we want to know? • Sequence comparison methods help reveal • Information about structure and function • Information about the process of molecular evolution CS 790 Spring 2001

  7. How do we find out? • Hidden Markov Models CS 790 Spring 2001

  8. What is a HMM? • A statistical model (Customizable) • Describes a series of observations by a hidden stochastic process • Defines a probability distribution over possible sequences • Our case: • Observations = nucleotides • Series = sequence of observations CS 790 Spring 2001

  9. How does a HMM work? • Consider regular expressions • Grep • Consider the following DNA motif A C A - - - A T G T C A A C T A T C A C A C - - A G C A G A - - - A T C A C C G - - A T C CS 790 Spring 2001

  10. How does a HMM work?(Regular Expression) • [AT] [CG] [AC] [ACGT]* A [TG] [GC] A C A - - - A T G T C A A C T A T C A C A C - - A G C A G A - - - A T C A C C G - - A T C CS 790 Spring 2001

  11. How does a HMM work? • The regular expression can: • Determine if the sequence in question fits the criteria of the search or not • The regular expression cannot: • Determine how well the sequence in question fits the criteria of the search CS 790 Spring 2001

  12. Deriving the HMM • Deriving the HMM from a known alignment • Statistics • Each column in the alignment generates a state • Count the occurrence of [ATGC] in each column to determine probabilities for each state • Insertions are trickier CS 790 Spring 2001

  13. Deriving the HMM A C A - - - A T G T C A A C T A T C A C A C - - A G C A G A - - - A T C A C C G - - A T C CS 790 Spring 2001

  14. Using the HMM • Remember the goal: • How well does the given sequence fit the family • Let’s try it • Exceptional Sequence: T G C T - - A G G • Consensus Sequence: A C A C - - A T C CS 790 Spring 2001

  15. Using the HMM • Exceptional Sequence • P(TGCT- -AGG) = (.2*1)*(.2*1)*(.2*.6)*(.2*.6)*(1*1)*(.2) ~=0.0023e-2 • Consensus Sequence • P(ACAC- -ATC) = (.8*1)*(.8*1)*(.8*.6)*(.4*.6)*(1*1)*(.8*1)*(.8) ~= 4.7e-2 CS 790 Spring 2001

  16. Using the HMM CS 790 Spring 2001

  17. Problem with Probability • Exceptional Sequence • P(TGCT- -AGG) = (.2*1)*(.2*1)*(.2*.6)*(.2*.6)*(1*1)*(.2) ~=0.0023e-2 • Consensus Sequence • P(ACAC- -ATC) = (.8*1)*(.8*1)*(.8*.6)*(.4*.6)*(1*1)*(.8*1)*(.8) ~= 4.7e-2 • Sequence length dependent • Not always a good score to use • Penalizes insertions – favors deletions • Bias – who’s to say that insertions are bad and deletions are good? • Log-odds CS 790 Spring 2001

  18. Log-odds • Log-odds is computed as: • P(S) – same as before • 0.25L – null model • Considers the overall sequence of nucleotides as random • Better estimate – use overall frequency of nucleotides in organisms genome CS 790 Spring 2001

  19. Log-odds • Consensus Sequence • LO(ACACATC) = 1.16+0+1.16-0.51+0.47-0.51+1.39+0+1.16+0+1.16 = 6.64 CS 790 Spring 2001

  20. Profile HMM • Much more complex structure • No numerical example • Designed to allow position dependent gap penalties CS 790 Spring 2001

  21. Profile HMM • Bottom Row • Middle Row • Top Row CS 790 Spring 2001

  22. A drawback and pseudocounts • Dangerous to estimate a probability distribution from just a few examples • All professors are interested in bioinformatics • All computers run windows • Pseudocount fake count • Pretend you saw a nucleotide in a position even though it wasn’t there – allows for the small possibility that something else may occur other than what you have observed CS 790 Spring 2001

  23. How pseudocounts help • If for instance you have only the first 2 sequences and you are looking at sequence 4 • P(4) = .5*1*0*1*… = 0 A C A - - - A T G T C A A C T A T C A C A C - - A G C A G A - - - A T C A C C G - - A T C When in fact we already know that sequence 4 is part of the same family CS 790 Spring 2001

  24. Multiple alignments from unaligned sequences • Start with a model of random probabilities • Or a reasonable guess if it is available • Build a model from this alignment • Use the alignment to improve the probabilities • May lead to a slightly different alignment • Stop when alignment fails to change iterate CS 790 Spring 2001

  25. Multiple alignment algorithms • Viterbi • Forward Backward • Baum-Welch CS 790 Spring 2001

  26. Advantages of HMMs • Built on a formal probabilistic basis • Can use Bayesian probability theory to guide the scoring parameters • Probability theory allows a HMM to be trained from unaligned sequences if alignment not known or trusted • Consistent theory behind gap/insertion penalties • Less skill and intervention needed to train a good HMM vs. hand constructed profile • Can make libraries of hundreds of profile HMMs and apply them on a large scale (whole genome) CS 790 Spring 2001

  27. Drawbacks of HMMs • Do not capture higher-order correlations • Assumes identity of a particular position is independent of the identity of all other positions • Scoring by probability / Training heuristics • pseudocounts CS 790 Spring 2001

  28. References • Salzberg et al., Computational Methods in Molecular Biology (Chapter 4: Krogh), Elsevier , 1998 • http://www.cs.jhu.edu/~salzberg/compbio-book.html • http://www.cbs.dtu.dk/krogh/refs.html • Rabiner, L. R., A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77:257-286 • Good introduction to HMMs • Krogh, et al., Hidden Markov Models in Computational Biology (applications to protein modeling). J. Mol. Bio. (1994) 235, 1501-1531 • Krogh is a name I saw a lot in this area • Liu S., et al., Markovian Structures in Biological Sequence Alignments, Journal of the American Statistical Association, March 1999, Vol. 94, No 445 • HMMER user’s guide – http://be.embnet.org/HMMERman/node9.html CS 790 Spring 2001

More Related