1 / 29

Hier a r c hi ca l C lus t er S t ru ct ures a nd Symme t ries in G enomi c Sequen c es

Hier a r c hi ca l C lus t er S t ru ct ures a nd Symme t ries in G enomi c Sequen c es. Andrei Zinovyev Ins t i t u t des H a u t es É t udes S c ien t ifiques Math@Bio group of M.Gromov. Plan of the talk. Genomic sequences: geometric approach, clustering Genomic sequence as text

Download Presentation

Hier a r c hi ca l C lus t er S t ru ct ures a nd Symme t ries in G enomi c Sequen c es

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov

  2. Plan of the talk Genomic sequences: geometric approach, clustering • Genomic sequence as text • Basic 7-cluster structure • Global structure of codon frequencies • Internal structure of codon frequencies • Applications

  3. Introduction Frequency dictionaries

  4. tagggrcgcacgtggtgagctgatgctagggrcgacgtgg gggrcgccacgttggtgagctgatgctagggrcgacgtgg agggrcgcacgtggtgagctgatgctagggrcgacgtggc tagggrcgcacgtggtgagctgatgctaggg frequency dictionaries: N = 4=41 N = 16=42 N = 64=43 N=256=44 t a g g g r c g c a c g t g g t g a g c t g a t g c t a g g g ta gg gr cg ca cg tg gt ga gc tg at gc ta gg tag ggr cgc acg tgg tga gct gat gct agg tagg grcg cacg tggt gagc tgat gcta gggr Genomic sequence as a text in unknown language ..cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc…

  5. From text to geometry cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc 107 cgtggtgagctgatgctagggrcgcac ggtgagctgatgctagggrcgcacact tgagctgatgctagggrcgcacaattc gtgagctgatgctagggrcgcacggtg …… gagctgatgctagggrcgcacaagtga length~300-400 3000-4000 fragments RN

  6. R2 PCA plot R2 Method of visualizationprincipal components analysis RN

  7. Chapter 1 Basic 7-cluster structure (level 1 of non-randomness)

  8. singles N=4 doublets N=16 triplets N=64 quadruplets N=256 !!! the information in genomic sequence is encoded by non-overlapping triplets Caulobacter crescentus

  9. First explanation cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc

  10. gct gat gct agg grc gca cgt ctg atg cta ggg rcg cac gtg tga tgc tag ggr cgc acg tgg gtgaatcggtgggtgaqtgtgctgctatgagc atc ggt ggg tga gtg tgc tgc tcg gtg ggt gag tgt gct gct cgg tgg gtg agt gtg ctg ctg Basic 7-cluster structure gtgagctgatgctagggrcgcacgtggtgagc

  11. Non-coding parts Point mutations: insertions, deletions a gtgagctgatgctagggr cgcacgaat

  12. Mean-field approximationfor triplet frequencies FIJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ): FAAA , FAAT , FAAC … FGGC , FGGG : 64 numbers letter frequency + correlations : 12 numbers

  13. -+0 0+- +0- -0+ Why hexagonal symmetry? GC-content = PC + PG +-0 0-+

  14. Chapter 2 Global structure of codon frequencies (143 complete bacterial genomes)

  15. ggtgaATG gat gct agg … gtc gca cgc TAAtgagct 12 frequencies PI1 , PJ2 , PK3 Genome codon usageand mean-field approximation correct frameshift … ggtgaATG gat gct agg … gtc gca cgc TAAtgagct 64 frequencies FIJK

  16. archaea eubacteria Global structure of codon frequencies

  17. PIJ are linear functions of GC-content

  18. eubacteria perpendicular triangles parallel triangles degenerated flower-like Four symmetry typesof the basic 7-cluster structure

  19. Chapter 3 Internal structure of codon frequencies (level 2 of non-randomness)

  20. Second level of hierarchy ?

  21. Distribution of genes function2 function1 function3 R64

  22. Fast-growing bacteria Genes of class I (most of) I Genes of class II (higly expressed) III II Genes of class III (unusual) IV Genes of class IV (hydrophobic proteins)

  23. Escherichia coli Genes of class I (most of) Genes of class II (higly expressed) Genes of class III (unusual) Genes of class IV (hydrophobic proteins)

  24. Chapter 4 Applications

  25. Accuracy >90% Computational gene prediction

  26. Protein expression optimization gene sequence S, protein A I III II IV gene sequence S’, same protein A, higher expression

  27. Web-site cluster structures in genomic sequences http://www.ihes.fr/~zinovyev/7clusters

  28. Papers Gorban A, Popova T, Zinovyev A Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences. 2004. Arxive e-print. Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributions. 2003. In Silico Biology. V.3, 0039. Zinovyev A, Gorban A, Popova T Self-Organizing Approach for Automated Gene Identification. 2003. Open Systems and Information Dynamics10 (4).

  29. People Dr. Tanya Popova Institute of Computational Modeling Russia ProfessorAlexander Gorban University of Leicester UK

More Related