410 likes | 426 Views
Minimal Model for the Growth and Evolution of Genomes. International Symposium on Frontiers of Science Tsing hua University, Beijing, 2002 June 17-19. HC LEE National Central University Computational Biology Laboratory. Plan of Presentation. The Human Genome Project
E N D
Minimal Model for the Growth and Evolution of Genomes International Symposium on Frontiers of Science Tsinghua University, Beijing, 2002 June 17-19 HC LEE National Central University Computational Biology Laboratory
Plan of Presentation • The Human Genome Project • Life Science in silico • Some statistical properties of genomes • Models for evolution and growth of genomes • Some preliminary results • Discussion
Genome - book of four letter Genome - Book of Life written in four letters DNA - a polymer of nucleotides Nucleotide – backbone + bases Four types of bases: A, C, G, T (the four letters) Gene – coded sequence of bases Genome – set of all genes; set of all chromosomes packaged pair of DNA strands with double helix structure CBL@NCU
The Human Genome Project The Human Genome Project • 1984 to 1986 – first proposed at US DOE • 1988 - endorsed by US National Research Council • creation of genetic, physical and sequence maps of the human genome • parallel efforts in key model organisms: bacteria, yeast,worms, flies and mice; • develop of supporting technology • ethical, legal and social issues (ELSI) • 1990 – Human Genome Project (NHGRI) • Later – UK, France, Japan, Germany, China
Growth of sequenced genome data exploded after 1995 (GenBank: as of 2002 January 13) Genome data exploded after 1995 Millions of sequences CBL@NCU
First working draft of Human Genome Sequencing of first working draft ofHuman Genome published in 2001 February Nature, 409, February 15, 860-921 (2001) Science, 291, February 16, 1304-1351 (2001)
Many completed genomes Many completed genomes 1995-2002 – Bacteria 細菌(about 75 organisms); 0.5-5 Mb; hundreds to 2000 genes 1996 April –Yeast 酵母(Saccharomyces cerevisiae) 12 Mb, 5,500 genes 1998 Dec. -Worm 線蟲(Caenorhabditis elegans) 97 Mb, 19,000 genes 2000 March –Fly 果蠅 (Drosophila melanogaster) 137 Mb, 13,500 genes 2000 Dec. - Mustard 芥末子(Arabidopsis thaliana) 125 Mb, 25,498 genes 2001 Feb. – Human 人類 (Homo sapiens) 3000 Mb, 35,000~40,000 genes CBL@NCU
New way to do Life Science Research • in vivo 在活體裡 • in vitro 在試管中 • in silico 在電腦中 CBL@NCU
CBL@NCU [biology] + [computer-science] + [math & physics] + [sequence data] = Life Science in silico Life Science in silico “It is much easier to teach biology to people from a math, physics or computer-science background than to teach a biologist how to code well.” - Nature, February 15, 2001, p963
Structure of genome is complex • Many levels – genes, intergenic region, regulatory sections • Gene – network of introns and exons • Genome – network of genes • Random mutation • Genes are products of “blind watchmaker” • Once made, gene is repeatedly copied • paralogues, orthologues and pseudogenes • Genes are protected against rapid mutation
Genome as text • Genome is a text of four letters – A,C,G,T • Frequencies of k-mers characterize the whole genome • E.g. counting frequen- cies of 7-mers with a “sliding window” N(GTTACCC) = N(GTTACCC) +1
8-mer portraits of bacterial genomes 8-mer portraits of bacterial genomes Hao Lee Zhang (2000)
ConstructingTree of Lifewith k-mers(16S rRNA 35 organisms) Bacteria A. aeolicus LF Luo FM Ji LC Hsieh HC Lee (2001) T. maritima Eukarya Archaea Black tree: dist’n of 8-mers. Red tree: sequence aligment.
Textual statistics of genome almost random but NOT TRIVIALLY so • Distribution of GC content • GC content across genome correlated with density of coding regions • Distribution of frequencies of k-mers • Characterizes whole genomes • Same in coding and non-coding regions • Typically 10~15 time wider than normal
If genome grows randomly by single nucleotide then distribution is Poisson Poisson P(f=k) =lke-l/k! <f> = l, D (stand. dev.) = <f>1/2 Gamma G(f) = fa-1e-f/b /baG(a) <f> = ab, D = a1/2 b Random single nucleotide; D = 15.5 E. coli, a=3.05, b = 80.0; D = 140
________ 0 11.4 1 26.6 2 62.0 3 144 4 337 5 787 6 1837 kfk Non-uniform nucleotide composition breaks the n-mer Poisson distribution into n+1 peaks 62.0 Given [at]/[cg]=70/30. If mean frequency is 244, then mean frequency of 6-mers with K a or t’s and 6-k c or g’s is fk = 244 (0.7)k (0.3)6-k/(.5)6 144 Random single nucleotide 26.6 337 Number of 6-mers 787 M. janaschii 11.4 1837 Frequency of 6-mers
Standard deviation of distribution of GC content in Human Genome 15 times wider than normal The Human Genome International consortium Nature 409 (2001) 860-921
Distribution is broad gamma rather than Poisson • Narrow Poisson - few objects (k-mers) distributed into many boxes (long genome) • Gamma • power arise and exponential tail • Broad gamma - many objects distributed into few boxes • many more entries (k-mers, GC contents, etc) with very high or very low frequencies • Problem: number of k-mers (4k, k < 10) much less than genome length (>1 M for bacteria) • E.g. 46/1M = 4096/1,000,000 = 244
How does a genome evolve and grow? • Evolve by random mutation • replacement, insertion, deletion • Plus selection • affects only coding regions • not globally important • Cannot grow by random mutation alone • Otherwise Poisson distribution • Must grow to long length while retaining statistical characteristics of SHORT genome
The genome mutates and copies itself • 50%, probably much more, of human genome composed of repeats • Many traces of repeats obliterated by mutation • Lower organisms may have longer genomes • Five types of repeats • transposable elements; processed pseudogenes; simple k-mer repeats; segmental duplications (10-300 kb); (large) blocks of tandemly repeated sequences
A Conjecture on Genome Growth • Random early growth • Followed by • random self-copying and • random mutation Self copying – strategy for retaining and multiple usage of hard-to-come-by coded sequences (i.e. genes)
The Model • The genome grows by random single base addition from nothing to an initial length much shorter than final length • Thereafter the genome evolves by random mutation and random self-copying, with a fixed frequency ratio
The Model (continued) • Mutation is standard single point mutation: replacement, insertion and deletion • Random self-copying • random selection of site of copied segment • weighed random selection of length of copied segment • random selection of insertion site of copied segment
An explicit one-parameter model l is copied Segment length 0 <y <1 is a random number l =
Some results • Distribution of 6-mer frequency • Starting genome length 1000 • Final genome length 1 million • Mutation to self-copying event ratio 100 < h < 4000 • Length scale for copied segments 2500 < s < 100 K • Compared withE. coli (4.5 Mbp), B. subtilis (4.2 Mbp), M. jannaschii (1.7 Mbp) (all normalized to 1 Mbp)
E. coli [at]/[cg]=50/50 E. coli vs mutation + repeat Ratio 500:1 Sigma = 15k D= 140, 144 E. coli vs random D= 140, 15.5 Number of 6-mers Frequency of 6-mers
B. subtilis [at]/[cg]=60/40 B. subtilis vs mutation + repeat Ratio 600:1 Sigma = 15k D= 167, 169 B. subtilis vs random D= 167, 79 Number of 6-mers Frequency of 6-mers
M. jannaschii [at]/[cg]=70/30 M. jannaschii vs mutation + repeat Ratio 600:1 Sigma = 15k D= 320, 321 M. jannaschii vs random D= 320, 265 Number of 6-mers Frequency of 6-mers
Gamma function reproduce highermoments Organism [at]/[gc] a b D(2) D(3) D(4) D(5) E. coli 50/50 140 147 213 252 gamma distribution 3.05 80.0 140 146 208 243 radom w/o self-copy (Poisson) 15.6 3.6 20.7 10 w/ self-copy (h = 500 s = 15K) 144 148 212 247 B. subtilis 60/40 168 223 316 400 gamma distribution 2.12 115 168 186 261 310 radom w/o self-copy (Poisson/7) 79 68 109 117 w/ self-copy (h = 600 s = 15K) 169 194 266 311 M. jannaschii 70/30 320 465 650 810 gamma distribution 0.58 418 320 439 609 767 radom w/o self-copy (Poisson/7) 264 369 500 603 w/ self-copy (h = 600 s = 15K) 321 462 635 783 Gamma distribution: D(x) = xa-1 b-aexp(-x/b)/G(a) D(n) = (<(x - <x>)n>)1/n; <x> = 244 = a b; D(2) = a1/2 b
Result sensitive to values of two parameters • Mutation to self-copying event ratioh • bacterial genomes, 200 < h~ 0.04s < 800 • If h >> 800(@ s ~ 15K) • too many mutations • gets long genome with Poisson distribution • If h << 200(@ s ~ 15K) • too much self-copying • too few mutations • gets multiple copies of random short (initial) genome (distribution too wide)
h = 100 h = 250 h = 500 h = 2000 h = 4000 Mutation to self-copy ratio is 500 +/- 100 Mutation/self-copy = h Scale of repeat length = s = 15K P(l)/P(l’) = exp{-(l-l’)/s} [at]:[cg] = 70:30 (genome-like)
Result sensitive to values of two parameters (cont’d) • Length scale s for copied segments • s ~ 10 K to 25 K for bacterial genomes • If s << 5 K(@h ~ 600) • genome grows too slowly • too many mutations • gets long genome with Poisson distribution • If s >> 25 K(@h ~ 600) • genome grows too quickly • too few mutations • gets multiple copies of random short (initial) genome (distribution too wide)
= 0.5K s =2.5K s =15K s =50K s =1000K Scale of repeat length cannot be too short Scale of repeat length = s P(l)/P(l’) = exp{-(l-l’)/s} Mutation/self-copy = h = 500 [at]:[cg] = 70:30 (genome-like)
Summary & Discussion • Genomes have overabundances of extremely frequent and extremely rare oligos (EFEROs) • Genomes have statistical properties of very SHORT sequences • Suggests genome grew by mutation and random self-copying • Minimal model with two parameters – length scale and event ratio - explains frequency of occurrence of k-mers (oligos) very well
Darwinian gradualism or Punctuated equilibrium? • Palaeontologists have long debated the mode of evolution • Gradual evolution of Classical Darwinism (species variaties; Dawkins et al.) • Change by spurts, as in “Punctuated Equilibrium” (Burgess shale, missing link; Gould et al.) • Minimal model already accommodates two competing modes of evolution • Mutation - Classical Darwinism • Self-copying of long sequences - Punctuated Equilibrium • Seems Nature uses both modes
A peek at the Universal Ancestor • Since extremely frequent and extremely rare oligos (EFERO) are the remnant of early sequence, they characterize the common ancestor of phylogenetically related genomes • Should be able to use the set of EFEROs in whole genomes to construct phylogenetic trees of whole genomes. • At each node of the tree would be an ancestor sequence characterized by a set of EFEROs. • The ancestor of Life would be characterized by the minimum set of EFEROs.
Outlook • Punctuated equilibrium • more evidence in textual detail? • Time scale for evolution • time scale from mutation to self-copy ratio? • length scale of repeats verifiable? • more textual detail needed to refine model? • Universal Ancestor • Can we build a good tree using EFEROs? • Does a Universal Ancestor exist in terms of its EFEROs? • If so, can we reconstruct the sequence of the Universal Ancestor?
The End 謝謝大家! All computation by 謝立青 CBL webpage: www.phy.ncu.edu.tw/hclee/index_eng.htm Preprint: www.phy.ncu.edu.tw/hclee/preprints/gro_prsub.pdf