380 likes | 520 Views
A Temporal Profile for Animal Transmembrane Gene Duplication (Insights into the Coupling of Duplication and Macroevolution). Presented by Guohui Ding R&D, SIBS, CAS. Background. Gene Duplication[1,2]
E N D
A Temporal Profile for Animal Transmembrane Gene Duplication (Insights into the Coupling of Duplication and Macroevolution) Presented by Guohui Ding R&D, SIBS, CAS
Background • Gene Duplication[1,2] • The predominant mechanism by which genes with new functions and associated phenotypic novelties arise • Several models try to explain the process of gene duplications • Positive selection play a key role in the neo/subfunctionalization (?). It gives the chance to study the interplay of physical and biotic factors. • Macroevolution[3] • The dynamics of evolution above species level • Biogeographic/geochemic/Palaeontologyical/Ecological data (e.g. fossil data, ocean chemistry data) • TM proteins[4] • At least one transmembrane helix • Such as active transport, ion flows, energy transduction, and signal transduction et al • Information exchange between the cell and the environment
Gene Duplication Accumulation of mutations Environment/Genetic background selection Genetics. 1999 Apr;151(4):1531-45.
Macroevolution’s Evidences/Data Science. 2002 Aug 16;297(5584):1137-42. Review. Nature. 2005 Mar 10;434(7030):208-10. Nature. 2000 Mar 9;404(6774):177-80. Science. 2000 Dec 1;290(5497):1758-61.
TM Proteins • …… • Old and from long, long ago • Not a good choice for evolution theory study but maybe a suitable model illustrating the interaction between environment and life Nature. 2004 Oct 21;431(7011):913.
The Question and The Logic • What will the temporal profile of Animal TM gene duplications look like? Is it a uniform distribution? If not, what scenarios can be used to explain the distribution? (Null hypothesis: Neutral theory) • Are large-scale cycles and patterns found in phanerozoic fossil records leaving some imprint in the TM Gene duplication temporal profile, if God adjusts the macroevolution by the microevolution or genes? How important is gene duplication to the speciation?[1,5] (Just a little extrapolation) Does duplication events synchronize with the speciation or origination/extinction? When they are asynchronous, what it want to tell us? • Can this logic/method be applied in understanding the macroevolution? In general, the sequence data are far more readily attainable than the fossil data. Also, it shows a second way. The logic: Duplicates selected by environmentMore duplicates at a time implying more “diversity” in the environment that time
Methods • TM protein prediction • Family construction • Estimation of molecular time scale • Duplication events detection • Data processing
TM protein prediction • Data • NCBI Reference Sequence (RefSeq) Database (Release 7, September 12, 2004) • 13 eukaryotic genomes, (61 bacterial genomes, 11 archaebacterial genomes) • Transmembrane Topology Models prediction • Conpred II • Identification of TM Proteins • At least one transmembrane helix Nucleic Acids Res. 2004 Jul 1;32:W390-3.
Family construction • Detection and masking of widespread, typically repetitive domains. • Filtering by SEG and all to all comparison of protein sequence by using gapped BLAST program with default setting. • E value determination based on overall distribution of E value over the entire protein space. • Detection of transitive best hit. • Single-linkage clustering to the best hit and get the symmetrical best hit. • Remove the fragment sequence. • Single-linkage clustering again. • Detection of cluster that has no cut-edges (bridge). • Detection of cluster with as least one triangle of mutually consistent, genome-specific best hits (BeTs). • Iterative multiple alignment. • Detection of triangles of mutually consistent, genome-specific best hits • Case by case analysis of each candidate family.
About E value • Based on the overall distribution of expectation values over the entire protein space. • The distribution shown may be thought as the average distribution of E value for a ‘typical’ protein sequence as a query. • The steep slope at high E value indicates a rapid growth in the number of sequences that are unrelated to the query sequence. • Every sequence has its own, only the threshold derived by the averaged distribution is reliable. • The deviation from straight line starts around 1e-5 in my work. Proteins. 1999 Nov 15;37(3):360-78.
About transitive/ symmetrical best hit • A threshold of 1e-5 for E value of HSPs. • HSPs are not compatible with a global alignment. • The remaining HSPs cover at least 80% of the proteins length. • Their similarity is greater or equal to 50% • Both sequences are complete Genome Res. 2000 Mar;10(3):379-85.
Cut edges About cluster that has no cut-edges • It detects densely connected regions in large protein-protein similarity networks. • Splitting the large family
About triangle of mutually consistent, genome-specific best hits (BeTs) • Triangle • Mutually consistent • Genome-sepcific Science. 1997 Oct 24;278(5338):631-7. Review
Iterative multiple alignment • Multiple sequence alignment with CLUSTAL W (1.83) in default value • Boot-strapping with 500 bootstraps • If the tree branch’s bootstrap value less than 50%, break the branch and get two subfamily. • Multiple sequence alignment with the subfamily’s members until there is no branch whose bootstrap value is less than 50% in the family (11 times to the end).
Estimation of molecular time scale • Inference of phylogenetic tree • Calibration time • Maximum likelihood estimation of protein divergence times
Inference of phylogenetic tree • Neighbor-Joining method with Poisson distance • Prokaryotic or other non-animal sequence as the outgroup to find the root. In the absence of outgroup sequence, the root is given at the midpoint of the longest route connecting two proteins(midpoint rooting). • Software: LINTREE by N. Takezaki Mol Biol Evol. 1987 Jul;4(4):406-25.
1399_1.trees Calibration time • Several calibration • Mouse-rat: 41mya • Primate-rodent: 91mya • Mammal-bird: 310Mya • Vertebrate-Drosophila: 993Mya • Vertebrate-nematodes: 1177Mya • Animal-plant-fungi: 1576Mya • Mapping it to the phylogenetic tree manually • Mark each orthlogous with an evolution rate group Nat Rev Genet. 2002 Nov;3(11):838-49. Review. Trends Genet. 2003 Apr;19(4):200-6. To: ppt21
Maximum likelihood estimation of protein divergence times • Specify a empirical mode: mtREV24.dat. • Gamma shape parameter is estimate by the soft itself. • Global clock and local clock all will be used. (For robust test) • Software: PAML 3.14 by Ziheng Yang Syst. Biol. 52(5):705-716, 2003
Global clock vs. Local clock Global clock Local clock
Global clock vs. Local clock The coefficient of pearson correlation is 0.7439441 (p < 2.2e-16). : y = x : regression lines for local clock vs. global clock.
Duplication events detection • Outparalog[6]: paralogs in the given lineage that evolved by gene duplications that happened before the radiation (speciation) event. • Orthologous along with the corresponding duplication event have at least two paralogs from different species. • Exclude gene families that was sharply in conflict with the uncontested animal phylogeny. • We identified 1651 duplication events in the final data set with 786 gene families. All the duplication events were noted with the time point it happened. • As 31 duplication events’ time is larger than 4.5 Gya, we only keep 1620 duplication events’ time point. See: ppt 17
Distribution of the Taxonomy • 100% mouse • 97% rat • 92% human • 60% chicken • 27% fly • 23% worm • 10% cress • 6% fission yeast • 6% baker yeast
Data processing • Overall distribution • Duplication and the extinction/origination • Periodogram analysis (FFT)
Overall distribution Control Kernel density estimates with gaussian method.
Result … • About the control • Randomly sample 1620 time point from all the nodes marked with time point to generate a distribution, without replacement. • Repeat 10,000 times to get 10,000 randomly generated profile. • An average distribution from the generated distribution by the means of every bins. (Red line in the graph is the average distribution by random). • The distance/correlation between the randomly generated/observed distribution and the average distribution are calculated. By the distribution of the distance/correlation, p << 0.00001. • By ~2.75Gya, the observed distribution deviate from the control. We use the data after 2.75Gya following. • Strikingly, the overall distribution of duplication after 2.75 Gya is not a uniform distribution. (D = 0.5318, p < 2.2e-16, Kolmogorov-Smirnov test) • The distribution of the data conform to a random walk. • Random walk is the model of the form • Sequence of ε is gotten and a KS uniform test is applied to it. As D = 0.9982, p = 0.2730, we can’t reject the null hypothesis. (注明:该处统计有误,当时做的统计实际上是ks.test(x, max(x), min(x))。具体的统计应该是ks.test(x, ‘punif’, max(x), min(x)), 但是统计上不能通过。或者做Box.test()统计white noise)
Discussion … • ~ 2.75 Gya is a very important time point in the rise of the atmospheric oxygen. There are two scenarios surround this question[7]. Out data show something changed ~2.75 Gya consisting with the evolution of oxygenic photosynthesis by 2.7Gya supported by organic biomarker and carbon stable isotope evidence. In this scenario, we can see the TM Gene’s duplication increased when the oxygenic content of the air changed (e. g, flower plant(~0.146Gya), platsid(~1.58Gya), mitochondria(~1.8Gya), et al). • Two Great Oxidation Event[8]: 2.0 ~ 2.4Gya; 0.55~0.8Gya • Snowball earth[9]: 0.58 ~ 0.75 Gya • The emergence of platsid/mitochondrion may take an import role in the TM protein evolution. Organelle has more membrane structure. The rise of complex multicellular life(1~ 1.5Gya) also is the cause[10]. • The rate of the TM protein duplication is non-uniform. This conforms to the result that both large- and small-scale duplications in the evolution. • The random-walk model of distribution suggests that either these variables were correlated with environmental variables that follow a random walk or so many mechanisms were affecting these variable, in different ways, that the resultant trends appear random.[11]
Result .. • About the extinction • Early cambrian (512Mya) • End ordovician(439Mya) • Frasnian-Famennian(376Mya) • End-Permian(251Mya) • End-Triassic(206Mya) • Cretaceous-Tertiary(65Mya) • Almost all the major mass extinction corresponding to a duplication peak, but two peak has no corresponded extinction record. • Base on the fossil data of marine animal, origination/ extinction rates were computed by linear interpolation for the appropriate time. The correlation of origination/ extinction rates and duplication number are calculated. • Extinction rates displays positive correlation with duplication profile, but not significant. (r = 0.0259369, p = 0.5483) (r = 0.07933089, p = 0.4144) • Origination rates shows significant negative correlation with duplication profile. (r = -0.1546602, p = 0.0003174) (r = -0.1230396, p = 0.2046) • For diversity (r = -0.3018349, p = 0.0015) • Kernel density estimates. (Genetics 147:1965-1975)
Discussion … • A funny and plausible mode (creator by extinction)(divergent resolution?) • When the environment changed dramatically, the population of most species will be smaller, even extinct (extinction). In the gene duplication’s mode, the sudden and various positive selection will fix more new duplicates in neo/sub function. On the other hand, a change which is deleterious to the gene’s function is readily to escape purifying selection in a small population[12]. In the population, its redundance and robust all increase[13]. So the genome structure isn’t a optimized one, but good for survival (note: TM protein mostly belong to dosage-sensitive gene). If the environment level off, the population must increase and migrate. For a redundant genome, it will subfunctionalize some duplicates. This time, most new species will emerge. (Is this one of the possible logic among duplication, extinction, origination?) • The correlation analysis between origination/ extinction rates and duplication profile may need more data. But they can say something. • Life is not only a passive process, especially the ecosystem. (about the two conflicts in the figure) (consist with the evolution of oxygenic photosynthesis) • ~0.3Gya, Gymnosperms begin to diversify widely. • ~0.13Gya, Angiosperm plants evolve flowers, structures that attract insects and other animals to spread pollen. The evolution of the angiosperms cause a major burst of animal evolution. Nature, vol 400, 58~ 61 (For flower plant)
Periodogram analysis (FFT) a=56.72948; b=-51.95965; c=12.79374; d=0.07294
FFT … Amp=0.15357741 Period =0.06230346 Gya Phrase=1.09601472 (radians) Account for 8.5% of the variance.(>5%)
FFT …Model R/W Monte Carlo simulation P=0.1294 α=0.05 P=0.0138 α=0.05 R model control observation W model
Result … • 0.062Gya cycles is evident in the Phanerozoic in the fourier spectrum, but can’t reject the Random walk null hypothesis. (R: p = 0.1294; W: p = 0.0138; V: 8.52%) • Several others: 0.0912Gya (0.2956/0.0039, 10.63%); 0.0275Gya (0.0071/0.1037, 4.22%); 0.0162(1e-4/0.1047). • Ten thousand Monte Carlo simulations were done. • Overall Periodogram after ~ 2.75Gya • Not a good question. It is difficult to choose an appropriate trends function to detrend the data. • The phase is different between fossil diversity and duplication’s 62 Mya cycles’ wave. • 5.21 (radians) - 1.1 (radians) = 4.1 (radians) = 1.305π Nature. 2005 Mar 10;434(7030):208-10
Discussion … • The 62-million-year wave is surprisingly strong and— so far – there is no good explanation for it (the wave from the GOD^_^). We have detected it in an independent data applying the same trend functions. Is it an egg-chicken question? • It implicates some essence question about the life and the environment. What cause it? • We give a second way to discuss this question. • About the phase shifting • 1.305π ≠ π. In my story, it must be 1.5 π, but that is not the true. • The phase shifting indicates the asynchronism between duplication profile and genus diversity. Nature. 2005 Mar 10;434(7030):208-10
Some references • [1]Jianzhi Zhang. Evolution by gene duplication: an update. TRENDS in Ecology and Evolution 18, 292-298(2003). • [2]Michael Lynch & Vaishali Katju. The altered evolutionary trajectories of gene duplicates. TRENDS in Genetics 20, 544-549(2004). • [3]David Jablonski. The interplay of physical and biotic factors in macroevolution. Evolution Planet Earth(book). • [4]U Lehnert, Y Xia et al. Computational analysis of membrane proteins: genomic occurrence, structure prediction and helix interactions. Quaterly Review in Biophysics (in press). • [5]Lynch M & Conery JS. The evolutionary fate and consequences of duplicate genes. Science 290(5494), 1151-5(2000). • [6]Sonnhammer EL, Koonin EV. Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet 18(12), 619-20(2002). • [7]Canfield DE, Habicht KS, Thamdrup B. The Archean sulfur cycle and the early history of atmospheric oxygen. Science. 2000 Apr 28;288(5466):658-61. • [8]Hayes JM. Biogeochemistry: a lowdown on oxygen. Nature. 2002 May 9;417(6885):127-8. • [9]Hoffman PF, Kaufman AJ, Halverson GP, Schrag DP. A neoproterozoic snowball earth. Science. 1998 Aug 28;281(5381):1342-6. • [10]Hedges SB, Blair JE, Venturi ML, Shoe JL. A molecular timescale of eukaryote evolution and the rise of complex multicellular life. BMC Evol Biol. 2004 Jan 28;4(1):2. • [11]Cornette JL, Lieberman BS. Random walks in the history of life.Proc Natl Acad Sci U S A. 2004 Jan 6;101(1):187-91. • [12]Sidow A. Gen(om)e duplications in the evolution of early vertebrates. Curr Opin Genet Dev. 1996 Dec;6(6):715-22. • [13]Gu Z, Steinmetz LM, Gu X, Scharfe C, Davis RW, Li WH. Role of duplicate genes in genetic robustness against null mutations. Nature. 2003 Jan 2;421(6918):63-6.
。。。 • Function clustering • The methodology discussion • ……
Acknowledge • Dr Qi Wang Prof Yixue Li • Dr Qi Liu Prof Gang Pei • Ziliang Qian Prof Tieliu Shi • Yongzhang Zhu • Guang Li • PeiLin Jia • Changzheng Dong • Fudong Yu • ……