330 likes | 341 Views
This article explores the emergence and evolution of biological codes, such as the genetic code, proteins, and DNA. It discusses how these codes are optimized for efficient information flow and adapt to inherent noise through selection and mutation. Theories on the origin of the genetic code, such as the load minimization hypothesis and frozen accident hypothesis, are also examined.
E N D
The Birth of Smooth Biological Codes in a Rough Evolutionary World Shalev Itzkovitz, Guy Shinar, Uri Alon T T
Biological codes are information channels or maps with natural ‘fitness’ measure. • Codes are evolved and selected according to their fitness or ‘smoothness’. • The emergence of a code is a phase transition in an information channel. • Topology of errors (noise) governs the emergent code.
The genetic code Proteins DNA Biological codes are (often) maps • Biological code is a mapping between two sets of molecules: • Transcription net: Proteins → DNA binding sites • Protein-protein recognition: immune system… • Protein synthesis: DNA → Proteins
Thr Glu Val Pro Information flows from DNA to RNA to proteins through the genetic code DNA ACGGAGGTACCC 4 letters RNA ACGGAGGUACCC 4 letters Protein 20 letters The 20 letters are the amino acids. Proteins are amino acid polymers.
Each of the 20 amino acidshas specific chemistry Amino acid = backbone + specific side group. Some amino acids are hydrophilic, hydrophobic, basic, acidic… The diversity of amino acids allows proteins to perform a wide variety of functions efficiently.
Each of the 20 amino acids isencoded by a triplet of RNA letters Glu ACG • Genetic Code = mapping triplets to amino acids. • 64 = 43 triplet codons encode only 20 amino acids (degeneracy) • Only 48 discernable codons due to U-C “wobble” at 3rd base. Thr Val GAG GUA Pro CCC
The genetic code is smooth, degenerate and compact • Redundancy – only 20 of 48. • Degeneracy – mostly in the 3rd base • Close codons separated by a single letter (Hamming Distance = 1) • Smoothness – Close codons encode chemically similar amino acids. ( Hydrophobic xUx, hydrophilic xAx). • Compactness – single contiguous domain per each amino-acid. • The code is highly nonrandom • (“one in a million” [Haig & Hurst] ). Shades: lighter (darker) – low (high) polarity. Letters: black (white) – hydrophobic (hydrophilic) yellow – medium. [Knight, Freeland, Landweber]
Biological codes evolve(d) to cope with inherent noise • Messages are written in molecular words that are read and interpreted by other molecules, which calculate the response etc… • Typical energy scale ~ a few kBT. • Thermal noise → errors. • Information channels adapt to errors through evolutionary of selection-mutation • Some errors = mutations are essential to evolution …
W V U The code is an information channel with an average distortion misreading encoding decoding j i , distortion HUV = ∑paths Pαijβ Dαβ= ∑α,I,j,βPαUαiWijVjβDαβ • Uand V are binary matrices that determine the code • W is the misreading (noise) stochastic matrix
Fitter code is one with less distortion • The ‘error-load’ H measures the difference between desired and the reproduced amino-acids. • H is a natural measure for the fitness of the code. • For better codes the encoding Uand the decoding V are optimized with respect to the reading W. • The decoded amino-acids must be diverse enough to map diverse chemical properties. • However, to minimize the impact of errors it is preferable to decode fewer amino-acids.
Theories on the origin of the code: Frozen accident or optimization? Load minimization hypothesis: Darwinian dynamics optimize the code to minimize errors in information flow (due to mutations, misreading). [Sonneborn, Zuckerkandl & Pauling… 1965] Frozen accident hypothesis: Any change in the code affects all the proteins in the cell and therefore will be too harmful: Life began with very few amino-acids. New amino-acids were added until eventually the code became frozen in its present form. [Crick 1968]
Variant codes - evidence for ongoing optimization of the code • Variants of the “universal” genetic code in many organisms [Osawa, Jukes 1992]. • All variants use the same twenty amino-acids (universal invariant?) • Continuity - Most changes are to a neighboring amino-acid. (‘hydrodynamic’ flow ?)
Biological codes are information channels or maps with natural ‘fitness’ measure. • Codes are evolved and selected according to their fitness. • The emergence of a code is a phase transition in an information channel. • Topology of errors (noise) governs the emergent code.
Codes compete by their error-load • One letter change in DNA can change one amino acid in one protein. If the new amino acid is similar to the original the upset is minimal. • The organism with the smallest error-load takes over the population. • - relatively small population - high noise levels in protein synthesis weak selection forces « random drift
Code’s evolution reaches steady-state • Small effective population and strong drift. • Population is in detailed balance and therefore P(fitness) ~ exp(fitness/T) [Lassig,Sella & Hirsh] • Smaller population is hotter:T ~ 1/Neff. • The Boltzmannian probability PUV~ exp(-HUV/T) minimizes a ‘free energy’ F= <H>-TS = ∑HUV PUV + ∑PUV logPUV • F is used to optimize information channels …
At high T no code is chosen • At high T (small populations) Boltzmann implies that all codes are equally probable: <Uαi> = 1/NC • The natural order parameter is uαi= <Uαi>-1/NC • At high T the state is random ‘non-coding’ uαi=0 • Stability of F is determined by • w – the preference of the reading w = W − 1/NC d – normalized chemical distance matrix δF ~ ut(TIδ×Iw – w2×d)u
Biological codes are information channels or maps with natural ‘fitness’ measure. • Codes are evolved and selected according to their fitness. • The emergence of a code is a phase transition in an information channel. • Topology of errors (noise) governs the emergent code.
Code emerges at a phase transition • When T is decreased below Tcan inhomogeneous coding state appears δF ~ ut(TIδ×Iw – w2×d)u • Critical temperatureTc = λw2× λd • The code is the mode uαiof F that corresponds to these maximal eigenvalues. • Tc increases with the accuracy of readingw . • The phase transition is continuous (2nd order). • Analogous phase transition in information channels
Why twenty amino-acids? • Code is the modeuαithat minimizes the free energy. • This mode corresponds to the maximal w - eigenvalue. • Knowledge of w at the phase transition yields code. • What can we say without such knowledge? (Why 20?) • More amino-acids more sensitivity to errors. • Fewer amino-acids reduce functionality of proteins. • Historical mechanisms : Freezing, Biosynthetic etc.. • Twenty as a topological feature of generic evolutionary phase transition?
Biological codes are information channels or maps with natural ‘fitness’ measure. • Codes are evolved and selected according to their fitness. • The emergence of a code is a phase transition in an information channel. • Topology of errors (noise) governs the emergent code.
U U UC X X C C A A A G G G The probable errors define the graph and the topology of the genetic code • Graph = codon vertices + one-letter difference edges ( Hamming = 1 ) K4XK4XK3
U U X C C A A Topology and genus of a simpler code Doublet Code with 3 bases is imbedded on a torus Each codon has 4 neighbors V = vertices, E = edges, F = faces Euler’s characteristicχ = V – E + F Euler Genus (# holes) γ = 1 - (1/2) χ Faces are quadrilateral mutation cycles F=V (d/4)= 9 ; E=V (d/2)=18
The genetic code graph is holey K4X K4XK3 • The 48-codon graph : • Each codon has degree d = 3+3+2 = 8 therefore • E = 48 (d/2) = 192 edges • F = 48 (d/4) = 96 faces • The Euler characteristic is χ = V – E + F = -48 and • Euler’s genus is γ = 1 - (1/2) χ = 25 (24 holes + Klein) • Embedding by group Automorphism analysis • Can one hear the shape of The code? K
The genetic code has a spectrum • uαi is average preference of codon i to encode α. • Every mode corresponds to an amino-acid -> number of modes = number of amino-acids. • Misreading w is actually the graph Laplacian w = -(Δ-Δrandom) where Δij=-Wij Δii=Σj≠iWij • Δ measures the difference between codons and their neighbors, a natural measure for error load. • Maximal mode of w is the 2nd eigenmode of Δ • Courant’s theorem: uαi have a single maximum -> single contiguous domain for each amino-acid.
Topology optimizes amino-acid assignment is in compact domains • uαihavesingle compact domains with one maximum and one minimum (Courant’s theorem). • Compact organization reduces impact of errors • Single domain in any direction (linearity) Σnαuαi Embedding in RN-1 is tight → The code graph contains complete graph KN [Banchoff 1965, Colin de Verdiére’s 1987] amino-acids # = N = chr(γ)
Coloring number of graph code is an upper limit for the number of amino-acids • What is the minimal number of colors required in a map so that no two adjacent regions have the same color? • The coloring number is a topological invariant and therefore a function of the genus solely. • Heawood’s conjecture [Ringel & Youngs, Appel & Haken]
The genetic code coevolves with increasing accuracy of translation • A path for evolution of codes: from early codes with higher codon degeneracy and fewer amino acids to lower degeneracy codes with more amino acids. • Preliminary simulations • Twenty amino acids is invariant even in variant codes. 21st and 22nd amino acids are context dependent. K4X K4
Summary • The 64 3-letter triplet code is patterned and degenerate, maps only 20 amino acids. • The governing evolutionary dynamics is interplay between protein diversity and error penalty described by stochastic diffusion equation. • The 1st excited state of this diffusive mapping dynamics on the high-genus surface of the code yield a pattern of ordered 20 amino acids (20 = the coloring number of the graph). • Topology + dynamics Coloring (?)
Pol TF DNA Transcription network is a code that relates DNA sites and binding proteins • Reading DNA to synthesize proteins is controlled by a system of protein-DNA interactions (transcription net). • Presence/absence of transcription factor may repress/enhance synthesis of protein from nearby gene. • The transcription network is actually a code that relates proteins with their DNA targets. • Like the genetic code, transcription is subject to evolutionary forces and adapts to minimize errors.
Probable recognition errors define the binding sequence space Overlap and continuity sphere packing (Shannon) TF AA Codon binding site Typical binding site: 4 base pairs = 12 bit Hamming = 1 K46 -> 4096 ‘codons’
Probable recognition errors define the binding sequence space • Coloring number estimate: v = 4L (L=6) e ~ 4L(3/2)L f ~ 4L(3/4)L -> γ ~ 4L(3/8)L • The coloring # chr(γ) ~ 300
???? • Why does the code exhaust the coloring limit? • Other population dynamics models (‘quasi-species’) • Glassy 'almost-frozen' dynamics? • The necessity of the wobble (64/48)? 25 acids? • Generic phase transition scenario that does not depend finely on missing details of the evolutionary pathway. • Although not much is known about the primordial environment, minimal assumptions about the topology of probable errors can yield characteristics of biological codes. • Esp. the number of twenty amino-acids in the present picture is reminiscent of a 'shell magic number‘.
Shalev Itzkovitz Guy Shinar Uri Alon Guy Sella J. –P. Eckmann Elisha Moses