410 likes | 423 Views
This presentation discusses the Human Genome Project, statistical properties of genomes, models for evolution and growth of genomes, and preliminary results. It also explores the concept of the genome as a text of four letters and the distribution of nucleotide frequencies.
E N D
Minimal Model for the Growth and Evolution of Genomes International Symposium on Frontiers of Science Tsinghua University, Beijing, 2002 June 17-19 HC LEE National Central University Computational Biology Laboratory
Plan of Presentation • The Human Genome Project • Life Science in silico • Some statistical properties of genomes • Models for evolution and growth of genomes • Some preliminary results • Discussion
Genome - book of four letter Genome - Book of Life written in four letters DNA - a polymer of nucleotides Nucleotide – backbone + bases Four types of bases: A, C, G, T (the four letters) Gene – coded sequence of bases Genome – set of all genes; set of all chromosomes packaged pair of DNA strands with double helix structure CBL@NCU
The Human Genome Project The Human Genome Project • 1984 to 1986 – first proposed at US DOE • 1988 - endorsed by US National Research Council • creation of genetic, physical and sequence maps of the human genome • parallel efforts in key model organisms: bacteria, yeast,worms, flies and mice; • develop of supporting technology • ethical, legal and social issues (ELSI) • 1990 – Human Genome Project (NHGRI) • Later – UK, France, Japan, Germany, China
Growth of sequenced genome data exploded after 1995 (GenBank: as of 2002 January 13) Genome data exploded after 1995 Millions of sequences CBL@NCU
First working draft of Human Genome Sequencing of first working draft ofHuman Genome published in 2001 February Nature, 409, February 15, 860-921 (2001) Science, 291, February 16, 1304-1351 (2001)
Many completed genomes Many completed genomes 1995-2002 – Bacteria 細菌(about 75 organisms); 0.5-5 Mb; hundreds to 2000 genes 1996 April –Yeast 酵母(Saccharomyces cerevisiae) 12 Mb, 5,500 genes 1998 Dec. -Worm 線蟲(Caenorhabditis elegans) 97 Mb, 19,000 genes 2000 March –Fly 果蠅 (Drosophila melanogaster) 137 Mb, 13,500 genes 2000 Dec. - Mustard 芥末子(Arabidopsis thaliana) 125 Mb, 25,498 genes 2001 Feb. – Human 人類 (Homo sapiens) 3000 Mb, 35,000~40,000 genes CBL@NCU
New way to do Life Science Research • in vivo 在活體裡 • in vitro 在試管中 • in silico 在電腦中 CBL@NCU
CBL@NCU [biology] + [computer-science] + [math & physics] + [sequence data] = Life Science in silico Life Science in silico “It is much easier to teach biology to people from a math, physics or computer-science background than to teach a biologist how to code well.” - Nature, February 15, 2001, p963
Structure of genome is complex • Many levels – genes, intergenic region, regulatory sections • Gene – network of introns and exons • Genome – network of genes • Random mutation • Genes are products of “blind watchmaker” • Once made, gene is repeatedly copied • paralogues, orthologues and pseudogenes • Genes are protected against rapid mutation
Genome as text • Genome is a text of four letters – A,C,G,T • Frequencies of k-mers characterize the whole genome • E.g. counting frequen- cies of 7-mers with a “sliding window” N(GTTACCC) = N(GTTACCC) +1
8-mer portraits of bacterial genomes 8-mer portraits of bacterial genomes Hao Lee Zhang (2000)
ConstructingTree of Lifewith k-mers(16S rRNA 35 organisms) Bacteria A. aeolicus LF Luo FM Ji LC Hsieh HC Lee (2001) T. maritima Eukarya Archaea Black tree: dist’n of 8-mers. Red tree: sequence aligment.
Textual statistics of genome almost random but NOT TRIVIALLY so • Distribution of GC content • GC content across genome correlated with density of coding regions • Distribution of frequencies of k-mers • Characterizes whole genomes • Same in coding and non-coding regions • Typically 10~15 time wider than normal
If genome grows randomly by single nucleotide then distribution is Poisson Poisson P(f=k) =lke-l/k! <f> = l, D (stand. dev.) = <f>1/2 Gamma G(f) = fa-1e-f/b /baG(a) <f> = ab, D = a1/2 b Random single nucleotide; D = 15.5 E. coli, a=3.05, b = 80.0; D = 140
________ 0 11.4 1 26.6 2 62.0 3 144 4 337 5 787 6 1837 kfk Non-uniform nucleotide composition breaks the n-mer Poisson distribution into n+1 peaks 62.0 Given [at]/[cg]=70/30. If mean frequency is 244, then mean frequency of 6-mers with K a or t’s and 6-k c or g’s is fk = 244 (0.7)k (0.3)6-k/(.5)6 144 Random single nucleotide 26.6 337 Number of 6-mers 787 M. janaschii 11.4 1837 Frequency of 6-mers
Standard deviation of distribution of GC content in Human Genome 15 times wider than normal The Human Genome International consortium Nature 409 (2001) 860-921
Distribution is broad gamma rather than Poisson • Narrow Poisson - few objects (k-mers) distributed into many boxes (long genome) • Gamma • power arise and exponential tail • Broad gamma - many objects distributed into few boxes • many more entries (k-mers, GC contents, etc) with very high or very low frequencies • Problem: number of k-mers (4k, k < 10) much less than genome length (>1 M for bacteria) • E.g. 46/1M = 4096/1,000,000 = 244
How does a genome evolve and grow? • Evolve by random mutation • replacement, insertion, deletion • Plus selection • affects only coding regions • not globally important • Cannot grow by random mutation alone • Otherwise Poisson distribution • Must grow to long length while retaining statistical characteristics of SHORT genome
The genome mutates and copies itself • 50%, probably much more, of human genome composed of repeats • Many traces of repeats obliterated by mutation • Lower organisms may have longer genomes • Five types of repeats • transposable elements; processed pseudogenes; simple k-mer repeats; segmental duplications (10-300 kb); (large) blocks of tandemly repeated sequences
A Conjecture on Genome Growth • Random early growth • Followed by • random self-copying and • random mutation Self copying – strategy for retaining and multiple usage of hard-to-come-by coded sequences (i.e. genes)
The Model • The genome grows by random single base addition from nothing to an initial length much shorter than final length • Thereafter the genome evolves by random mutation and random self-copying, with a fixed frequency ratio
The Model (continued) • Mutation is standard single point mutation: replacement, insertion and deletion • Random self-copying • random selection of site of copied segment • weighed random selection of length of copied segment • random selection of insertion site of copied segment
An explicit one-parameter model l is copied Segment length 0 <y <1 is a random number l =
Some results • Distribution of 6-mer frequency • Starting genome length 1000 • Final genome length 1 million • Mutation to self-copying event ratio 100 < h < 4000 • Length scale for copied segments 2500 < s < 100 K • Compared withE. coli (4.5 Mbp), B. subtilis (4.2 Mbp), M. jannaschii (1.7 Mbp) (all normalized to 1 Mbp)
E. coli [at]/[cg]=50/50 E. coli vs mutation + repeat Ratio 500:1 Sigma = 15k D= 140, 144 E. coli vs random D= 140, 15.5 Number of 6-mers Frequency of 6-mers
B. subtilis [at]/[cg]=60/40 B. subtilis vs mutation + repeat Ratio 600:1 Sigma = 15k D= 167, 169 B. subtilis vs random D= 167, 79 Number of 6-mers Frequency of 6-mers
M. jannaschii [at]/[cg]=70/30 M. jannaschii vs mutation + repeat Ratio 600:1 Sigma = 15k D= 320, 321 M. jannaschii vs random D= 320, 265 Number of 6-mers Frequency of 6-mers
Gamma function reproduce highermoments Organism [at]/[gc] a b D(2) D(3) D(4) D(5) E. coli 50/50 140 147 213 252 gamma distribution 3.05 80.0 140 146 208 243 radom w/o self-copy (Poisson) 15.6 3.6 20.7 10 w/ self-copy (h = 500 s = 15K) 144 148 212 247 B. subtilis 60/40 168 223 316 400 gamma distribution 2.12 115 168 186 261 310 radom w/o self-copy (Poisson/7) 79 68 109 117 w/ self-copy (h = 600 s = 15K) 169 194 266 311 M. jannaschii 70/30 320 465 650 810 gamma distribution 0.58 418 320 439 609 767 radom w/o self-copy (Poisson/7) 264 369 500 603 w/ self-copy (h = 600 s = 15K) 321 462 635 783 Gamma distribution: D(x) = xa-1 b-aexp(-x/b)/G(a) D(n) = (<(x - <x>)n>)1/n; <x> = 244 = a b; D(2) = a1/2 b
Result sensitive to values of two parameters • Mutation to self-copying event ratioh • bacterial genomes, 200 < h~ 0.04s < 800 • If h >> 800(@ s ~ 15K) • too many mutations • gets long genome with Poisson distribution • If h << 200(@ s ~ 15K) • too much self-copying • too few mutations • gets multiple copies of random short (initial) genome (distribution too wide)
h = 100 h = 250 h = 500 h = 2000 h = 4000 Mutation to self-copy ratio is 500 +/- 100 Mutation/self-copy = h Scale of repeat length = s = 15K P(l)/P(l’) = exp{-(l-l’)/s} [at]:[cg] = 70:30 (genome-like)
Result sensitive to values of two parameters (cont’d) • Length scale s for copied segments • s ~ 10 K to 25 K for bacterial genomes • If s << 5 K(@h ~ 600) • genome grows too slowly • too many mutations • gets long genome with Poisson distribution • If s >> 25 K(@h ~ 600) • genome grows too quickly • too few mutations • gets multiple copies of random short (initial) genome (distribution too wide)
= 0.5K s =2.5K s =15K s =50K s =1000K Scale of repeat length cannot be too short Scale of repeat length = s P(l)/P(l’) = exp{-(l-l’)/s} Mutation/self-copy = h = 500 [at]:[cg] = 70:30 (genome-like)
Summary & Discussion • Genomes have overabundances of extremely frequent and extremely rare oligos (EFEROs) • Genomes have statistical properties of very SHORT sequences • Suggests genome grew by mutation and random self-copying • Minimal model with two parameters – length scale and event ratio - explains frequency of occurrence of k-mers (oligos) very well
Darwinian gradualism or Punctuated equilibrium? • Palaeontologists have long debated the mode of evolution • Gradual evolution of Classical Darwinism (species variaties; Dawkins et al.) • Change by spurts, as in “Punctuated Equilibrium” (Burgess shale, missing link; Gould et al.) • Minimal model already accommodates two competing modes of evolution • Mutation - Classical Darwinism • Self-copying of long sequences - Punctuated Equilibrium • Seems Nature uses both modes
A peek at the Universal Ancestor • Since extremely frequent and extremely rare oligos (EFERO) are the remnant of early sequence, they characterize the common ancestor of phylogenetically related genomes • Should be able to use the set of EFEROs in whole genomes to construct phylogenetic trees of whole genomes. • At each node of the tree would be an ancestor sequence characterized by a set of EFEROs. • The ancestor of Life would be characterized by the minimum set of EFEROs.
Outlook • Punctuated equilibrium • more evidence in textual detail? • Time scale for evolution • time scale from mutation to self-copy ratio? • length scale of repeats verifiable? • more textual detail needed to refine model? • Universal Ancestor • Can we build a good tree using EFEROs? • Does a Universal Ancestor exist in terms of its EFEROs? • If so, can we reconstruct the sequence of the Universal Ancestor?
The End 謝謝大家! All computation by 謝立青 CBL webpage: www.phy.ncu.edu.tw/hclee/index_eng.htm Preprint: www.phy.ncu.edu.tw/hclee/preprints/gro_prsub.pdf