330 likes | 473 Views
Seven clusters and four types of symmetry in microbial genomes. Andrei Zinovyev Bioinformatics service Math@Bio group of M.Gromov. Tatyana Popova R&D Centre in Biberach, Germany. Alexander Gorban Centre for
E N D
Seven clusters and four types of symmetry in microbial genomes Andrei Zinovyev Bioinformatics service Math@Bio group of M.Gromov Tatyana Popova R&D Centre in Biberach, Germany Alexander Gorban Centre for Mathematical Modelling
tagggrcgcacgtggtgagctgatgctagggrcgacgtgg gggrcgccacgttggtgagctgatgctagggrcgacgtgg agggrcgcacgtggtgagctgatgctagggrcgacgtggc tagggrcgcacgtggtgagctgatgctaggg frequency dictionaries: N = 4=41 N = 16=42 N = 64=43 N=256=44 t a g g g r c g c a c g t g g t g a g c t g a t g c t a g g g ta gg gr cg ca cg tg gt ga gc tg at gc ta gg tag ggr cgc acg tgg tga gct gat gct agg tagg grcg cacg tggt gagc tgat gcta gggr Genomic sequence as a text in unknown language ..cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc…
From text to geometry cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc 107 cgtggtgagctgatgctagggrcgcac ggtgagctgatgctagggrcgcacact tgagctgatgctagggrcgcacaattc gtgagctgatgctagggrcgcacggtg …… gagctgatgctagggrcgcacaagtga length~200-400 10000-20000 fragments RN
R2 PCA plot R2 Method of visualizationprincipal components analysis RN
singles N=4 doublets N=16 triplets N=64 quadruplets N=256 !!! the information in genomic sequence is encoded by non-overlapping triplets (Nature, 1961) Caulobacter crescentus
First explanation cgtggtgagctgatgctagggrcgcacgtggtgagctgatgctagggrcgacgtggtgagctgatgctagggrcgc
gct gat gct agg grc gca cgt ctg atg cta ggg rcg cac gtg tga tgc tag ggr cgc acg tgg gtgaatcggtgggtgaqtgtgctgctatgagc atc ggt ggg tga gtg tgc tgc tcg gtg ggt gag tgt gct gct cgg tgg gtg agt gtg ctg ctg Basic 7-cluster structure gtgagctgatgctagggrcgcacgtggtgagc
Non-coding parts Point mutations: insertions, deletions a gtgagctgatgctagggr cgcacgaat
GenScan Seven classes vs Seven clusters TIGR Georgia Institute of Technology Stanford
Accuracy >90% Computational gene prediction
Mean-field approximationfor triplet frequencies FIJK : Frequency of triplet IJK ( I,J,K {A,C,G,T} ): FAAA , FAAT , FAAC … FGGC , FGGG : 64 numbers position-specific letter frequency + correlations : 12 numbers
-+0 0+- +0- -0+ Why hexagonal symmetry? GC-content = PC + PG +-0 0-+
… ggtgaATG gat gct agg … gtc gca cgc TAAtgagct 12 frequencies PI1 , PJ2 , PK3 Genome codon usageand mean-field approximation correct frameshift … ggtgaATG gat gct agg … gtc gca cgc TAAtgagct 64 frequencies FIJK
PIJ are linear functions of GC-content eubacteria archae
THE MYSTERY OF TWOSTRAIGHT LINES ??? R64 R12 FIJK = P1IP2JP3K + correlations
eubacteria perpendicular triangles parallel triangles degenerated flower-like Four symmetry typesof the basic 7-cluster structure
B.Halodurans (GC=44%) F.Nucleatum (GC=27%) E.Coli (GC=51%) S.Coelicolor (GC=72%)
Web-site cluster structures in genomic sequences http://www.ihes.fr/~zinovyev/7clusters
Human genome (chr19) triplets doublets singles non-repetitive sequences repetitive sequences
Letter frequencies (3 dimensions) Purine- Pyrimidine (33%) Amino- Keto (17%) GC-content (50%) a c a t a g c g t c g t
Non-linear good 2D representation(elastic principal manifolds) 100% T A 0% G C
G G A A C C T T Measuring densities
Contrasting density distribution (two ideas) • Noise is Gaussian • Noise is smooth
G A C T Contrasted density G A C T
G A C T Excluding repeats G A C T
G A C T Excluding repeats G A C T
Papers (type Zinovyev in Google) Gorban A, Zinovyev A PCA deciphers genome. 2005. Arxiv preprint Gorban A, Popova T, Zinovyev A Codon usage trajectories and 7-cluster structure of 143 complete bacterial genomic sequences. 2005. Physica A 353, 365-387 Gorban A, Popova T, Zinovyev A Four basic symmetry types in the universal 7-cluster structure of microbial genomic sequences. 2005. In Silico Biology 5, 0025 Gorban A, Zinovyev A, Popova T Seven clusters in genomic triplet distributions. 2003. In Silico Biology. V.3, 0039. Zinovyev A, Gorban A, Popova T Self-Organizing Approach for Automated Gene Identification. 2003. Open Systems and Information Dynamics10 (4).
People Dr. Tanya Popova Institute of Computational Modeling Russia ProfessorAlexander Gorban University of Leicester UK