270 likes | 283 Views
This article explores the idea that evolution seeks to maximize biodiversity within the constraints of time and energy. It applies this concept to various biological systems, such as DNA replication, protein synthesis, sexual reproduction, and speciation. The mathematical theory of communication by Claude Shannon is used to understand the transmission of information in these systems. The article also discusses the importance of equaprobability in maximizing the transmission rate and applies this concept to DNA replication.
E N D
Entropy Driven Evolution: Why DNA is Coded in 4 Bases and Reproduction Takes 2 Sexes? Bo Deng Department of Mathematics UNL IIT, 14 Feb. 2011 http://www.math.unl.edu/~bdeng1
Working Hypothesis Evolution is driven to maximize biodiversity against constraints in time and energy across all biological scales • Applied to all informational systems: • DNA Replication • Protein Synthesis • Sexual Reproduction • Speciation to Phylogenetic Tree • Ecological Community • Animal Brain • Consciousness • Language • Social, Economical, Political Structures
C. E. Shannon, ``A mathematical theory of communication,'' Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656, July and October, 1948. Claude E. Shannon (1916-2001) Channel
Internet What is Information? and What Matters the Most? All about choices Transmission Speed Comparison
# of sequences of length log2 n =# of choicesn Bit Unit: 0 or 1 …… Mathematical Measure of Information: What is in a bit? One Bit = One Binary Digit Dead Channel --- Transmit only one kind of symbol all the times e.g. 0000….. 0 bit 0 bit information Live Channel --- Transmit one of many possible symbols each time, e.g. 011101… in a binary channel Each transmitted symbol is either 0 or 1 Each symbol contains 1 bit information Pop Quiz:How many bits in a quaternary symbol, 1, 2, 3, 4? or in a symbol of n alphabets, 1, 2, 3, …, n? Answer:H4 = 2 bits, and Hn = log2 n bits respectively because 4 = 2 log24, n = 2 log2n Key Assumption: Each transmitted symbol is just one of nequally probable choices Ex: { a, b, c, d } = { 00, 01, 10, 11}
What is in the transmission rate? • Lettkbe time needed to transmit symbol k • Then the average transmission time per base is • Tn = (t1 + t2 + t3 +…+ tn) / n • And the mean rate is • Rn= Hn / Tn = n log2 n /(t1 + t2 + t3 +…+ tn) • The definition implicitly assumes that all symbols occur • equally probable. • Why, or is it reasonable?
1/p1=#of sequences of length log21/p1 Bit Unit: 0 or 1 …… • Example: Pick a marble from • a bag of 2 blue, and • 5 read marbles • Probability for picking • a blue marble: • pblue = 2/7 • Number of choices for each blue picked 1 / pblue = 7/2 =3.5 Recall: Rn= Hn / Tn = n log2n / (t1 + t2 + t3 +…+ tn) All-purpose Channel • Each transmitted Symbol 1 is just one choice out of 1/p1 • many possible choices and therefore Symbol 1 contains • log2 1/p1bits information • since 1/p1= 2 log21/p1 • Similarly, Symbol k contains log2 1/pkbits information • The average bits per symbol for our video only source is • H(p) =p1log2 1/p1+…+ pnlog2 1/pn • Internet message types: video, audio, pictures, spams, …etc • Each has different frequency distribution in the encoding symbols Important fact: H(p) =p1log2 1/p1+…+ pnlog2 1/pn<= Hn = log2 n Equiprobability Conclusion: For an all-purpose channel, the mean rate is calculated not for any particular source entropy but for the maximal source entropy, Hn , which is reached with equaprobability distribution of the transmitting symbols. • Example of Possible Non-equiprobability: • If we know all video files that have ever transmitted • over the internet, then we can make an accurate • frequency table: say p1 for Symbol 1, p2 for 2, etc, and • pn for symbol n
.... • Encoding states: • Symbols: 1 2 3 …. n • Trans. Times: t1 t2 t3 … tn • Assume: • t1 = 1 sec,t2 = 2 sec, t3 = 3 sec, … , tn= n sec Then Rn= Hn / Tn = n log2n /(t1 + t2 + t3 +…+ tn) = 2log2 n /(n+1) Design Criterion To choose n so that Rn= Hn / Tn is the largest! Example
DNA Replication James D. Watson (1928 -), Francis Crick (1916 - 2004), Molecular structure of nucleic acids, Nature, 171(1953), pp.737--738. http://www.mun.ca/biology/scarr/An11_01_DNA_replication.mov Deoxyribonucleic Acid A (adenine), T (Thymine),C (cytosine), G (guanine)?
Communication Model for DNA Replication • Fact: • DNA replication is the same for all genomes • Replication is a sequential process – one base a time • Observation: • Each species genome is an information source • Genome upon replication is a transmitted message Conceptual Model: DNA replication is an all-purpose channel Questions: Why 4 bases: A, T , C , G?
Replication Mean Rate: Rn= Hn / Tn, (per-base diversity rate) • Assumption: • Weaker chemical bonds take longer to replicate (Heisenberg’s Uncertainty Principle: t E ~ constant ) • Paring times of high energy bonds • are ignored (as a first attempt/order approximation • for the pairing time) • tA = tT = pairing time of one H…O bond = t0 • tG = tC = pairing time of two H…O bond = 2 t0 • t5 = t6 = pairing time of three H…O bond = 3 t0, etc. • (by Watson and Crick’s base paring principle) Time scale of a single Hydrogen bond pairing: 4X10-15 sec.
The Result Let k = # of base pairs, and n = # of bases Then n = 2 k Since t2m-1 =t2m= mt0form = 1,2, …, k Rn= Hn / Tn = log2 n / [2(t1 + t3 + …+ t2k-1) /n] = log2 n /[(n/2+1) t0/2]
1.8267 A further refined model predicts 1.65 <tC,G/tA,T< 3 R4 = the optimal rate
2 Sexes Problem Sexual Reproduction is a process of information exchange
Reproduction Mean Ratio: Sn= Hn / En, • Assumption: • Information payoff per-crossover base for n sexes: • Hn = log2 n • 1:1 sex ratio with M members for each sex • Cost to sexual reproduction in energy and time is • inversely proportional to the probability of having • a reproductive group of n members having exactly • one sex each • Reproductive group is formed by random encounter
Reproductive Probability: Reproductive Group in k Tries: Expected Tries for One Reproductive Group : Expected Tries for One Reproductive Group for Large Population :
Genetic Entropy Exchange without Sexual but Existential Cost :
Multiparous Strategy Multiparous Entropy: Multiparous Cost : Multiparous Entropy to Cost Ratio : With Mixed (Random & Wedlock) Cost :
Rn / R4 a = 2 n = 4 Slower by Evolutionary Set-back by n = 2 < 0.75 > 25% > 1 billion yrs n = 6 < 0.98 > 2% > 80 million yrs Discussions Evolutionary Clock Set-back with 3 Sexes: • Life on Earth could have not evolved faster and have had a richer diversity at the same time • Consistent with Darwinian Theory of Survival-of- the-Fittest theory but at the molecular level Question: Was the origin of life driven by informational selection?
The Role of Mathematics • Why is the per-base diversity measure by Hn = log2 n or H( p) = Spk log2 1/pk log2 1/(p1 p2) = log2 1/p1 + log2 1/p2 Information is additive • Mathematics is driven by open problems • Science is driven by existing solutions • Mathematical modeling is to discover the mathematics • to which Nature fits as a solution • Exception to the rule is the rule in biology
Acknowledgements • Dr. Reg Garrett,Department of Biology, University of Virginia, regarding the GC transcription elongation problem • Dr. David Ussery,Center for Biological Sequence Analysis, Technical University of Denmark, on most base frequency data • Dr. Daniel Smith,Department of Biology, Oregon State University, regarding the base frequencies of P. ubique • Dr. Tony Joern,Department of Biology, UNL, Kansas State University • Dr. Etsuko Moriyama,the Beadle Center for Genetics Research, University of Nebraska-Lincoln • Dr. Hideaki Moriyama,Dr. Xiao-Cheng Zhen, Department of Chemistry, University of Nebraska-Lincoln • Irakli Loladze, David Logan, Department of Mathematics, UNL
The show of life is on your DNA channel We are consumers of reproductive entropy
* Base frequency for the chromosome 14 which has the largest d.
Viruses are taking advantage of the replication system by having the near maximal per-base diversity entropy and having their hosts do the replication for them. To Maximize Stationary Entropy: H(p) =p1log2 1/p1+…+ pnlog2 1/pn
1.8267 1.8267 * Base frequency for the chromosome 14 which has the largest d.
Others have to scramble with individual and absolute Channel Capacities, i.e., Objective:Max.R(p) = H (p)/ T (p) Subject to:p1+ p2+ …+ pn = 1, pk > 0 • Optimization Result: • pA=pT, pG=pC • pG=pAa, a = tG,C /tA,T • K = max R(p) = (log2 1/pA) /tA,T