110 likes | 295 Views
DNA Analysis Part II. Amir Golnabi ENGS 112 Spring 2008. What we saw in part I: Markov Chain DNA and Modeling Markovian Models for DNA Sequences HMM for DNA Sequences Part II: DNA Methylation and CpG islands Markov Chain Model Hidden Markov Model Finding the State Path
E N D
DNA AnalysisPart II Amir Golnabi ENGS 112 Spring 2008
What we saw in part I: Markov Chain DNA and Modeling Markovian Models for DNA Sequences HMM for DNA Sequences Part II: DNA Methylation and CpG islands Markov Chain Model Hidden Markov Model Finding the State Path Parameter Estimation for HMMs References
1.DNA Methylation and CpG islands CG base pair in the human genome Modification of Cytosine by methylation High chance of mutation of methyl-C into a T CG dinucleotides are rarer in the genome Methylation is suppressed in short stretches of the genome such as around the promoters or start regions of many genes. more CG dinucleotides: CpG islands "p“: "C" and "G" are connected by a phosphodiester bond Two questions: Given a short stretch of genomic sequence, how would we decide whether it comes from a CpG island? Given a long piece of sequence, how would we find the CpG islands in it?
2.Given a short stretch of genomic sequence, how would we decide whether it comes from a CpG island? Markov Chain: Transition probabilities: Probability of sequences: Beginning and end of sequences: > Silent states
Transition probabilities using Maximum likelihood estimator for CpG islands: Two Markov chain models: CpG islands (the ‘+’ model) Remainder of the sequence (the ‘-’ model) Table of frequencies: Each row sums to 1. Tables are asymmetric.
To use this model for discrimination: Log-odds ratio: x is the sequence β is the log likelihood ratio is corresponding transition probabilities - The histogram of the length-normalized scores ,S(x), for all the sequences(~60,000 nucleotides)
3. Given a long piece of sequence, how would we find the CpG islands in it? Single model for the entire sequence that incorporates both Markov chains: HMM Similar transition probabilities within each set Small chance of switching between + and – regions There is no one-to-one correspondence between states and symbols.
Sequence of states (path Π): Transition probabilities: State sequence is hidden in HMM Sequence of symbols: emission probabilities: Prob. b is seen in state s emission prob. of CpG islands: 0 or 1 A sequence can be generated from a HMM as follows: A state is chosen according to In an observation is emitted according to A new state is chosen according to and so forth…: A sequence of random observations P(x)= prob. X was generated by the model Joint probability of an observed seq x and state seq :
Example: Prob. of sequence ‘CGCG’ being emitted by the state sequence (C+,G-,C-,G+): Not very useful in practice because the path is not known → Path estimation: By finding the most likely one Viterbi Algorithm Forward or Backward Algorithm Example: CpG model: Generating symbol sequence CGCG State sequences: (C+,G+,C+,G+),(C-,G-,C-,G-), (C+,G-,C-,G+) (C+,G-,C-,G+): switching back and forth between + and – (C-,G-,C-,G-): small prob. of CG in ‘-’ group (C+,G+,C+,G+): Best option!
5.Parameter Estimation for HMMs: HMM models: Design the structure: states and their connections Design parameter values: transition and emission probabilities, and Baum-Welch And Viterbi training
7.References Bandyopadhyay, Sanghamitra. Gene Identification: Classical and Computational Ingelligence Approach. 38 vols. IEEE, JAN2008. Durbin, R., S. Eddy, and A. Krogh. Biological Sequence Analysis. Cambridge: Cambridge University, 1998. Koski, Timo. Hidden Markov Models for Bioinformatics. Sweden: Kluwer Academic , 2001. Birney, E. "Hidden Markov models in biological sequence analysis". July 2001: Haussler, David. David Kulp, Martin Reese Frank Eeckman "A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA". Boufounos, Petros, Sameh El-Difrawy, Dan Ehrlich. "HIDDEN MARKOV MODELS FOR DNA SEQUENCING".