1 / 33

HMM for CpG Islands

HMM for CpG Islands. Arti Kelkar Pete Rossetti Peter Warren. HMM for CpG Islands. HMM history General background Three Fundamental problems Evaluation Decoding Training. HMM for CpG Islands. HMM Applications Bioinformatics Non-Bioinformatics CpG Islands Problem CpG Islands

raoul
Download Presentation

HMM for CpG Islands

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HMM for CpG Islands Arti Kelkar Pete Rossetti Peter Warren

  2. HMM for CpG Islands • HMM history • General background • Three Fundamental problems • Evaluation • Decoding • Training

  3. HMM for CpG Islands • HMM Applications • Bioinformatics • Non-Bioinformatics • CpG Islands Problem • CpG Islands • Definition • Why interesting • Hidden Markov Model for CpG • What’s Hidden • Mathematica Implementation • Training • Decoding

  4. Andrei Andreyevich Markov1856-1922

  5. AA Markov • Early 1900s • Markov conceives “Markov chains” including a proof of the Central Limit theorem for Markov Chains • Studies with Chebyshev and takes over his classes at Univ. of St. Petersburg • 1913 • Russian government celebrates the 300th anniversary of the House of Romanov • AA Markov organizes a counter-celebration – the 200th anniversary of Bernoulli’s Law of Large Numbers

  6. HMM – History • 1960s • Use of HMMs developed by a cold-war era research team in a classified program at the Communication Research Division of the Institute for Defense Analyses. (Oscar Rothaus). • 1970s • HMM work is de-classified and is soon being used in many peaceful applications.

  7. Markov Chain • Sunny yesterday • ==> 0.5 probability that it will be sunny today and 0.25 that it will be cloudy or rainy

  8. Hidden Markov Model

  9. HMM Definition • Hidden Markov Model is a triplet (Π, A, B) • Π Vector of initial state probabilities • A Matrix of state transition probabilities • B Matrix of observation probabilities • N Number of hidden states in the model • M Number of observation symbols

  10. HMM – Three Problems • Evaluation • Decoding • Training

  11. HMM - Overview Evaluation Problem Given a set of HMMs, which is the one most likely to have produced the observation sequence? GACGAAACCCTGTCTCTATTTATCC p(HMM-3)? p(HMM-1)? p(HMM-n)? p(HMM-2)? HMM 1 HMM n HMM 3 HMM 2 … Forward Algorithm is used to find Max[p(HMMs)]

  12. HMM - Overview Decoding Problem • States A+,C+,G+,T+,A-,C-,G-,T- A+ A+ A+ A+ A+ C+ C+ C+ C+ C+ G+ G+ G+ G+ G+ T+ T+ T+ T+ T+ A- A- A- A- A- C- C- C- C- C- G- G- G- G- G- T- T- T- T- T- A G C G C Obs seq

  13. AATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTACACGACCCAAGCTCTCTGCTTGAATCCCAAATCTGAGCGGACAGATGAGGGGGCGCAGAGGAAAAACAGGTTTTGGACCCTACATAAANAGAGAGGTTCGTAAATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTTAAATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTACACGACCCAAGCTCTCTGCTTGTAACTTGTTTTNGTCGCAGCTGGTCTTGCCTTTGCTGGGGCTGCTGACAATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTACACGACCCAAGCTCTCTGCTTGAATCCCAAATCTGAGCGGACAGATGAGGGGGCGCAGAGGAAAAACAGGTTTTGGACCCTACATAAANAGAGAGGTTCGTAAATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTTAAATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTACACGACCCAAGCTCTCTGCTTGTAACTTGTTTTNGTCGCAGCTGGTCTTGCCTTTGCTGGGGCTGCTGAC HMM - OverviewTraining Problem From raw seqence data… to Transition Probabilities A+ C+ G+ T+ A- C- G- T- A+ C+ G+ T+ A- C- G- T- How?

  14. HMM - Applications BioInformatics • DNA Sequence analysis • Protein family profiling • Prediction of protein folding • Prediction of genes • Horizontal gene transfer • Radiation hybrid mapping, linkage analysis • Prediction of DNA functional sites. • CpG island prediction • Splicing signals prediction

  15. HMM - Applications Non-BioInformatics • Speech Recognition • Vehicle Trajectory Projection • Gesture Learning for Human-Robot Interface • Positron Emission Tomography (PET) • Optical Signal Detection • Digital Communications • Music Analysis

  16. Some HMM based Bioinformatics Resources • PROBE www.ncbi.nlm.nih.gov/ • BLOCKS www.blocks.fhcrc.org/ • META-MEME www.cse.ucsd.edu/users/bgrundy/metameme.1.0.html • SAM www.cse.ucsc.edu/research/compbio/sam.html • HMMERS hmmer.wustl.edu/ • HMMpro www.netid.com/ • GENEWISE www.sanger.ac.uk/Software/Wise2/ • PSI-BLAST www.ncbi.nlm.nih.gov/BLAST/newblast.html • PFAM www.sanger.ac.uk/Pfam/

  17. HMM for CpG Islands CpG ISLANDS “CpG” means “Cprecedes G” Not CG base pairs

  18. HMM for CpG Islands • Nucleotides - 4 bases in DNA: • A (Adenine) • C (Cytosine) • G (Guanine) • T (Thymine)

  19. HMM for CpG Islands What’s a “CpG Island” CG-poor regions: P(CG) ~ 0.07! CG-rich region: P(CG) ~ 0.25 …… Gene coding region Promoter region

  20. HMM for CpG Islands Why the difference? • Away from gene regions: • The C in CG pairs is usually methylated • Methylation inhibits gene transcription • These CGs tend to mutate to TG • Near promoter and coding regions: • Methylation is suppressed: • CGs remain CGs • Makes transcription easier!

  21. HMM for CpG Islands Motivation: • CpG-rich regions are associated with genes which are frequently transcribed. • Helps to understand gene expression related to location in genome.

  22. HMM for CpG Islands Motivation: • Q: Why an HMM? • It can answer the questions: • Short sequence: does it come from a CpG island or not? • Long sequence: where are the CpG islands? • So, what’s a good model? • Well, we need states for ISLAND bases and NON-ISLAND bases …

  23. A+ P(A) = 1 C+ P(C) = 1 G+ P(G) = 1 T+ P(T) = 1 END START A- P(A) = 1 C- P(C) = 1 G- P(G) = 1 T- P(T) = 1 END START HMM for CpG Islands Straight Markov Models CpG NON-Island (-) CpG Island (+)

  24. A+ P(A) = 1 T+ P(T) = 1 G+ P(G) = 1 C+ P(C) = 1 A- P(A) = 1 T- P(T) = 1 G- P(G) = 1 C- P(C) = 1 END START HMM for CpG Islands Combined Hidden Markov Model CpG Island CpG NON-Island

  25. C CpG Island A G T A- A+ T+ G+ C+ END START G- T- C- CpG NON-Island HMM for CpG IslandsWhat’s “hidden”? Visible: Hidden:

  26. HMM for CpG IslandsThe Three Problems • (Evaluation – not in CpG Islands) • Training • Decoding

  27. CG-RICH sequences AATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTACACGACCCAAGCTCTCTGCTTGAATCCCAAATCTGAGCGGACAGATGAGGGGGCGCAGAGGAAAAACAGGTTTTGGACCCTACATAAANAGAGAGGTTCGTAAATAGAGA CG-POOR sequences GGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTTAAATAGAGAGGTTCGACTCTGCATTTCCCAAATACGTAATGCTTACGGTACACGACCCAAGCTCTCTGCTTGTAACTTGTTTTNGTCGCAGCTGGTCTTGCCTTTGCTGGGGCTGCTGA A+ C+ G+ T+ A- C- G- T- A+ C+ G+ T+ A- C- G- T- A+ C+ G+ T+ A- C- G- T- A+ C+ G+ T+ A- C- G- T- HMM for CpG IslandsTraining Problem HOW? ML or Forward/Backward algorithm

  28. HMM for CpG Islands Decoding Problem Viterbi Algorithm • Decoding- Meaning of observation sequence by looking at the underlying states. • Hidden states A+,C+,G+,T+,A-,C-,G-,T- • Observation sequence CGCGA • State sequences C+,G+,C+,G+,A+ or C-,G-,C-,G-,A- or C+,G-,C+,G-,A+ • Most Probable Path C+,G+,C+,G+,A+

  29. HMM for CpG Islands Decoding Problem II Viterbi Algorithm Hidden Markov model: S, akl, , el(x). Observed symbol sequence E = x1,….,xn. Find - Most probable path of states that resulted in symbol sequence E Let vk(i) be the partial probability of the most probable path of the symbol sequence x1, x2, ….., xi ending in state k. Then: v l(i + 1) = e l(xi+1) max(vk(i) akl)

  30. HMM for CpG Islands Decoding Problem III A+ A+ A+ A+ A+ C+ C+ C+ C+ C+ G+ G+ G+ G+ G+ T+ T+ T+ T+ T+ A- A- A- A- A- C- C- C- C- C- G- G- G- G- G- T- T- T- T- T- A G C G C

  31. HMM for CpG Islands Decoding Problem III Summary • Computationally less expensive than forward algorithm. • Partial probability of reaching final state is the probability of the most probable path. • Decision of best path based on whole sequence, not an individual observation.

  32. HMM for CpG Islands Now, on to our Mathematica implementation…

  33. HMM for CpG Islands References… R.Dubin,S.Eddy, A.Krogh, and G. Mitchison. "Biologiclal Sequence Analysis: Probablistic models of Proteins and nucleic acids. Cambridge University Press, 1998. chapters 3 and 5. A.Krogh,M.Brown,I.Saira Mian,Kimmen Sjolander and David Haussler "Hidden Markov Models in Computational Biology Appications to Protein Modeling J.Mol Biol. (1994) 253, 1501-1531 L. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proceedings of the IEEE, Vol. 77, No. 2, Feb. 1989 On-line tutorial: http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/main.html

More Related