100 likes | 381 Views
DNA Analysis. Amir Golnabi ENGS 112 Spring 2008. Outline: Markov Chain DNA and Modeling Markovian Models for DNA Sequences Hidden Markov Models (HMM) HMM for DNA Sequences Future Works References. 1.Markov Chain : Alphabet: are called states, and S is the state space
E N D
DNA Analysis Amir Golnabi ENGS 112 Spring 2008
Outline: Markov Chain DNA and Modeling Markovian Models for DNA Sequences Hidden Markov Models (HMM) HMM for DNA Sequences Future Works References
1.Markov Chain : Alphabet: are called states, and S is the state space Notation > Sequence of random variables: A sequence of random variables is called a Markov Chain, (MC), if for all n>=1 and The conditional probability of a future event depends only upon the immediate past event
1.Markov Chain (cont.) Conditional Probability: Transition Matrix Property: Higher-Order Markov Chains: Second order MC:
2.DNA and Modeling: Bases: {A,T,C,G} Complementary strands > sequence of bases in a single strand Sequences are always read from 5’ to 3’ end. DNA mRNA proteins (transcription and translation) Codons: Triples of bases which code for amino acids 61 + 3 ‘stop’ codons Specific sequence of codons gene Chromosomes genome exons: coding portion of genes introns: non-coding regions Goal: To determine the nucleotide sequence of entire genomes
3.Markov Chains for DNA Sequences Nucleotides are chained linearly one by one local dependence between the bases and their neighbors Markov chains offer computationally effective ways of expressing the various frequencies and local dependencies Alphabet of bases = {A,T,C,G} not uniformly distributed in any sequence and the composition vary within and between sequences The probability of finding a particular base at one position can depend not only on the immediate adjacent bases, but also on several more distant bases upstream or downstream higher order Markov model, (heterogeneous) Gene finding: Markov models of coding and non-coding regions to classify segments as either exons or introns. Segmentation for decomposing DNA sequences into homogeneous regions Hidden Markov Models
4.Hidden Markov Models (HMM) Stochastic process generated by two interrelated probabilistic mechanisms Underlying Markov chain with a finite number of states and a set of random functions, each associated with its respective state Changing the states: according to transition matrix Only the output of the random functions can be seen Advantage: HMM allow for local characteristics of molecular sequences to be modeled and predicted within a rigorous statistical framework, and also allow the knowledge from prior investigations to be incorporated into analysis.
5.HMM for DNA Sequences Every nucleotide in a DNA belongs to either a “Normal” region (N), or a GC-rich region (R). No random distribution: Larger regions of (N) sequence Example of such a sequence: NNNNNNNNNRRRRRNNNNNNNNNNNNNNNNNRRRRRRRNNNN States of HMM: {N,R} Possible DNA sequence with this underlying collection: TTACTTGACGCCAGAAATCTATATTTGGTAACCCGACGCTAA No typical random collection of nucleotides: GC in R regions: 83% vs. 23% in N regions HMM: Identify these types of feature in sequences Ability to capture both the patchiness of N and R and different compositional frequencies within the categories
6.Future work… Better and deeper understanding of HMM Different applications of HMM, such as, Segmentation of DNA Sequence and Gene Finding Build an automata for a simple case 7.References Koski, Timo. Hidden Markov Models for Bioinformatics. Sweden: Kluwer Academic , 2001. Birney, E.. "Hidden Markov models in biological sequence analysis". July 2001: Haussler, David. David Kulp, Martin Reese Frank Eeckman "A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA". Boufounos, Petros, Sameh El-Difrawy, Dan Ehrlich. "HIDDEN MARKOV MODELS FOR DNA SEQUENCING".