220 likes | 309 Views
Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette. Markov chain models. Transition probabilities = Frequencies of N-grams … AGGTC G ATC … …A GGTCG A TC … …AG GTCGA T C …. f AAA. f AAC.
E N D
Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette
Markov chain models Transition probabilities = Frequencies of N-grams …AGGTCGATC … …AGGTCGATC … …AGGTCGATC …
fAAA fAAC = fijk, i,j,k in [A,C,G,T] … fGGG Sliding window AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC width W
fijk(1) fijk correct frame fijk(2) Protein-coding sequences AGGTCGATGAATCCGTATTGACAAATGAATCCGTAATGACATGACAATCCAACATGACAAT bacterial gene
=G “Shadow” genes TCCAGCTTATGAGGCATAACTGTTTACTGAGGCCATACTGTACTGTTAGGTTGTACTGTTA AGGTCGAATACTCCGTATTGACAAATGACTCCGGTATGACATGACAATCCAACATGACAAT shadow gene ,
non-coding When we can detect genes (by their content)? • When non-coding regions are very different in base composition (e.g., different GC-content) • When distances between the phases are large: ,
Simple experiment • Only the forward strands of genomes are used for triplet counting • Every p positions in the sequence, open a window (x-W/2,x+W/2,) of size W and centered at position x • Every window, starting from the first base-pair, is divided into W/3 non-overlapping triplets, and the frequencies of all triplets fijk are calculated • The dataset consists of N = [L/p]points, where L is the entire length of the sequence • Every data point Xi={xis}corresponds to one window and has 64 coordinates, corresponding to the frequencies of all possible triplets s = 1,…,64 ,
1st Principal axis Maximal dispersion 2nd principal axis Principal Component Analysis ,
Caulobacter crescentus (GenBank NC_002696) ,
Helicobacter pylori (GenBank NC_000921) ,
Saccharomyces cerevisiae chromosome IV ,
Model sequences: (random codon usage) ,
Model sequences: (random codon usage+ 50% of frequencies are set to 0) ,
Assessment , Completely blind prediction
W = 51 W = 252 W = 2000 W = 900 Dependence on window size ,
State of art: GLIMMER strategy • Use MM of 5th order (hexamers) • Use interpolation for transition probabilities • Use long ORF (>500bp) as learning dataset • Problems: • The number of hexamers to be evaluatedis still big • Applicable only for collected genomes of good quality (<1frameshift/1000bp) ,
What can we learn from this game? • Learning can be replaced with self-learning • Bacterial gene-finders work relatively well, when concentration of coding sequences is high • Correlations in the order of codons are small • Codon usage is approximately the same along the genome • The method presented allows self-learning on piecesof even uncollected DNA (>150 bp) • The method gives alternative to HMM view on the problem of gene recognition ,
Acknowledgements Professor Alexander Gorban Professor Misha Gromov , My coordinates: http://www.ihes.fr/~zinovyev