1 / 22

Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev

Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette. Markov chain models. Transition probabilities = Frequencies of N-grams … AGGTC G ATC … …A GGTCG A TC … …AG GTCGA T C …. f AAA. f AAC.

shauna
Download Presentation

Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette

  2. Markov chain models Transition probabilities = Frequencies of N-grams …AGGTCGATC … …AGGTCGATC … …AGGTCGATC …

  3. fAAA fAAC = fijk, i,j,k in [A,C,G,T] … fGGG Sliding window AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC AGGTCGATCAATCCGTATTGACAATCCAATCCGTATTGACATGACAATCCAACATGACAATC width W

  4. fijk(1) fijk correct frame fijk(2) Protein-coding sequences AGGTCGATGAATCCGTATTGACAAATGAATCCGTAATGACATGACAATCCAACATGACAAT bacterial gene

  5. =G “Shadow” genes TCCAGCTTATGAGGCATAACTGTTTACTGAGGCCATACTGTACTGTTAGGTTGTACTGTTA AGGTCGAATACTCCGTATTGACAAATGACTCCGGTATGACATGACAATCCAACATGACAAT shadow gene ,

  6. non-coding When we can detect genes (by their content)? • When non-coding regions are very different in base composition (e.g., different GC-content) • When distances between the phases are large: ,

  7. Simple experiment • Only the forward strands of genomes are used for triplet counting • Every p positions in the sequence, open a window (x-W/2,x+W/2,) of size W and centered at position x • Every window, starting from the first base-pair, is divided into W/3 non-overlapping triplets, and the frequencies of all triplets fijk are calculated • The dataset consists of N = [L/p]points, where L is the entire length of the sequence • Every data point Xi={xis}corresponds to one window and has 64 coordinates, corresponding to the frequencies of all possible triplets s = 1,…,64 ,

  8. 1st Principal axis Maximal dispersion 2nd principal axis Principal Component Analysis ,

  9. ViDaExpert tool ,

  10. Caulobacter crescentus (GenBank NC_002696) ,

  11. “Path” of sliding window ,

  12. Helicobacter pylori (GenBank NC_000921) ,

  13. Saccharomyces cerevisiae chromosome IV ,

  14. Model sequences: (random codon usage) ,

  15. Model sequences: (random codon usage+ 50% of frequencies are set to 0) ,

  16. Graph of coding phase ,

  17. Assessment , Completely blind prediction

  18. Dependence on window size ,

  19. W = 51 W = 252 W = 2000 W = 900 Dependence on window size ,

  20. State of art: GLIMMER strategy • Use MM of 5th order (hexamers) • Use interpolation for transition probabilities • Use long ORF (>500bp) as learning dataset • Problems: • The number of hexamers to be evaluatedis still big • Applicable only for collected genomes of good quality (<1frameshift/1000bp) ,

  21. What can we learn from this game? • Learning can be replaced with self-learning • Bacterial gene-finders work relatively well, when concentration of coding sequences is high • Correlations in the order of codons are small • Codon usage is approximately the same along the genome • The method presented allows self-learning on piecesof even uncollected DNA (>150 bp) • The method gives alternative to HMM view on the problem of gene recognition ,

  22. Acknowledgements Professor Alexander Gorban Professor Misha Gromov , My coordinates: http://www.ihes.fr/~zinovyev

More Related