290 likes | 362 Views
Hier a r c hi ca l C lus t er S t ru ct ures a nd Symme t ries in G enomi c Sequen c es. Andrei Zinovyev Ins t i t u t des H a u t es É t udes S c ien t ifiques Math@Bio group of M.Gromov. Plan of the talk. Genomic sequences: geometric approach, clustering Genomic sequence as text
E N D
Hierarchical Cluster Structures and Symmetries in Genomic Sequences Andrei Zinovyev Institut des Hautes Études Scientifiques Math@Bio group of M.Gromov
Plan of the talk Genomic sequences: geometric approach, clustering • Genomic sequence as text • Basic 7-cluster structure • Global structure of codon frequencies • Internal structure of codon frequencies • Applications
Introduction Frequency dictionaries
tagggrcgcacgtggtgagctgatgctagggrcgacgtgg gggrcgccacgttggtgagctgatgctagggrcgacgtgg agggrcgcacgtggtgagctgatgctagggrcgacgtggc tagggrcgcacgtggtgagctgatgctaggg frequency dictionaries: N = 4=41 N = 16=42 N = 64=43 N=256=44 t a g g g r c g c a c g t g g t g a g c t g a t g c t a g g g ta gg gr cg ca cg tg gt ga gc tg at gc ta gg tag ggr cgc acg tgg tga gct gat gct agg tagg grcg cacg tggt gagc tgat gcta gggr Genomic sequence as a text in unknown language ..cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc…
From text to geometry cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc 107 cgtggtgagctgatgctagggrcgcac ggtgagctgatgctagggrcgcacact tgagctgatgctagggrcgcacaattc gtgagctgatgctagggrcgcacggtg …… gagctgatgctagggrcgcacaagtga length~300-400 3000-4000 fragments RN
R2 PCA plot R2 Method of visualizationprincipal components analysis RN
Chapter 1 Basic 7-cluster structure (level 1 of non-randomness)
singles N=4 doublets N=16 triplets N=64 quadruplets N=256 !!! the information in genomic sequence is encoded by non-overlapping triplets Caulobacter crescentus
First explanation cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc
gct gat gct agg grc gca cgt ctg atg cta ggg rcg cac gtg tga tgc tag ggr cgc acg tgg gtgaatcggtgggtgaqtgtgctgctatgagc atc ggt ggg tga gtg tgc tgc tcg gtg ggt gag tgt gct gct cgg tgg gtg agt gtg ctg ctg Basic 7-cluster structure gtgagctgatgctagggrcgcacgtggtgagc
Non-coding parts Point mutations: insertions, deletions a gtgagctgatgctagggr cgcacgaat
Mean-field approximationfor triplet frequencies FIJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ): FAAA , FAAT , FAAC … FGGC , FGGG : 64 numbers letter frequency + correlations : 12 numbers
-+0 0+- +0- -0+ Why hexagonal symmetry? GC-content = PC + PG +-0 0-+
Chapter 2 Global structure of codon frequencies (143 complete bacterial genomes)
… ggtgaATG gat gct agg … gtc gca cgc TAAtgagct 12 frequencies PI1 , PJ2 , PK3 Genome codon usageand mean-field approximation correct frameshift … ggtgaATG gat gct agg … gtc gca cgc TAAtgagct 64 frequencies FIJK
archaea eubacteria Global structure of codon frequencies
eubacteria perpendicular triangles parallel triangles degenerated flower-like Four symmetry typesof the basic 7-cluster structure
Chapter 3 Internal structure of codon frequencies (level 2 of non-randomness)
Distribution of genes function2 function1 function3 R64
Fast-growing bacteria Genes of class I (most of) I Genes of class II (higly expressed) III II Genes of class III (unusual) IV Genes of class IV (hydrophobic proteins)
Escherichia coli Genes of class I (most of) Genes of class II (higly expressed) Genes of class III (unusual) Genes of class IV (hydrophobic proteins)
Chapter 4 Applications
Accuracy >90% Computational gene prediction
Protein expression optimization gene sequence S, protein A I III II IV gene sequence S’, same protein A, higher expression
Web-site cluster structures in genomic sequences http://www.ihes.fr/~zinovyev/7clusters
Papers Gorban A, Popova T, Zinovyev A Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences. 2004. Arxive e-print. Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributions. 2003. In Silico Biology. V.3, 0039. Zinovyev A, Gorban A, Popova T Self-Organizing Approach for Automated Gene Identification. 2003. Open Systems and Information Dynamics10 (4).
People Dr. Tanya Popova Institute of Computational Modeling Russia ProfessorAlexander Gorban University of Leicester UK