The Birth of Smooth Biological Codes in a Rough Evolutionary World

The Birth of Smooth Biological Codes in a Rough Evolutionary World Shalev Itzkovitz, Guy Shinar, Uri Alon T T

Biological codes are information channels or maps with natural ‘fitness’ measure. • Codes are evolved and selected according to their fitness or ‘smoothness’. • The emergence of a code is a phase transition in an information channel. • Topology of errors (noise) governs the emergent code.

The genetic code Proteins DNA Biological codes are (often) maps • Biological code is a mapping between two sets of molecules: • Transcription net: Proteins → DNA binding sites • Protein-protein recognition: immune system… • Protein synthesis: DNA → Proteins

Thr Glu Val Pro Information flows from DNA to RNA to proteins through the genetic code DNA ACGGAGGTACCC 4 letters RNA ACGGAGGUACCC 4 letters Protein 20 letters The 20 letters are the amino acids. Proteins are amino acid polymers.

Each of the 20 amino acidshas specific chemistry Amino acid = backbone + specific side group. Some amino acids are hydrophilic, hydrophobic, basic, acidic… The diversity of amino acids allows proteins to perform a wide variety of functions efficiently.

Each of the 20 amino acids isencoded by a triplet of RNA letters Glu ACG • Genetic Code = mapping triplets to amino acids. • 64 = 43 triplet codons encode only 20 amino acids (degeneracy) • Only 48 discernable codons due to U-C “wobble” at 3rd base. Thr Val GAG GUA Pro CCC

The genetic code is smooth, degenerate and compact • Redundancy – only 20 of 48. • Degeneracy – mostly in the 3rd base • Close codons separated by a single letter (Hamming Distance = 1) • Smoothness – Close codons encode chemically similar amino acids. ( Hydrophobic xUx, hydrophilic xAx). • Compactness – single contiguous domain per each amino-acid. • The code is highly nonrandom • (“one in a million” [Haig & Hurst] ). Shades: lighter (darker) – low (high) polarity. Letters: black (white) – hydrophobic (hydrophilic) yellow – medium. [Knight, Freeland, Landweber]

Biological codes evolve(d) to cope with inherent noise • Messages are written in molecular words that are read and interpreted by other molecules, which calculate the response etc… • Typical energy scale ~ a few kBT. • Thermal noise → errors. • Information channels adapt to errors through evolutionary of selection-mutation • Some errors = mutations are essential to evolution …

W V U The code is an information channel with an average distortion misreading encoding decoding  j  i , distortion HUV = ∑paths Pαijβ Dαβ= ∑α,I,j,βPαUαiWijVjβDαβ • Uand V are binary matrices that determine the code • W is the misreading (noise) stochastic matrix

Fitter code is one with less distortion • The ‘error-load’ H measures the difference between desired and the reproduced amino-acids. • H is a natural measure for the fitness of the code. • For better codes the encoding Uand the decoding V are optimized with respect to the reading W. • The decoded amino-acids must be diverse enough to map diverse chemical properties. • However, to minimize the impact of errors it is preferable to decode fewer amino-acids.

Theories on the origin of the code: Frozen accident or optimization? Load minimization hypothesis: Darwinian dynamics optimize the code to minimize errors in information flow (due to mutations, misreading). [Sonneborn, Zuckerkandl & Pauling… 1965] Frozen accident hypothesis: Any change in the code affects all the proteins in the cell and therefore will be too harmful: Life began with very few amino-acids. New amino-acids were added until eventually the code became frozen in its present form. [Crick 1968]

Variant codes - evidence for ongoing optimization of the code • Variants of the “universal” genetic code in many organisms [Osawa, Jukes 1992]. • All variants use the same twenty amino-acids (universal invariant?) • Continuity - Most changes are to a neighboring amino-acid. (‘hydrodynamic’ flow ?)

Biological codes are information channels or maps with natural ‘fitness’ measure. • Codes are evolved and selected according to their fitness. • The emergence of a code is a phase transition in an information channel. • Topology of errors (noise) governs the emergent code.

Codes compete by their error-load • One letter change in DNA can change one amino acid in one protein. If the new amino acid is similar to the original the upset is minimal. • The organism with the smallest error-load takes over the population. • - relatively small population - high noise levels in protein synthesis weak selection forces « random drift

Code’s evolution reaches steady-state • Small effective population and strong drift. • Population is in detailed balance and therefore P(fitness) ~ exp(fitness/T) [Lassig,Sella & Hirsh] • Smaller population is hotter:T ~ 1/Neff. • The Boltzmannian probability PUV~ exp(-HUV/T) minimizes a ‘free energy’ F= <H>-TS = ∑HUV PUV + ∑PUV logPUV • F is used to optimize information channels …

At high T no code is chosen • At high T (small populations) Boltzmann implies that all codes are equally probable: <Uαi> = 1/NC • The natural order parameter is uαi= <Uαi>-1/NC • At high T the state is random ‘non-coding’ uαi=0 • Stability of F is determined by • w – the preference of the reading w = W − 1/NC d – normalized chemical distance matrix δF ~ ut(TIδ×Iw – w2×d)u

Code emerges at a phase transition • When T is decreased below Tcan inhomogeneous coding state appears δF ~ ut(TIδ×Iw – w2×d)u • Critical temperatureTc = λw2× λd • The code is the mode uαiof F that corresponds to these maximal eigenvalues. • Tc increases with the accuracy of readingw . • The phase transition is continuous (2nd order). • Analogous phase transition in information channels

Why twenty amino-acids? • Code is the modeuαithat minimizes the free energy. • This mode corresponds to the maximal w - eigenvalue. • Knowledge of w at the phase transition yields code. • What can we say without such knowledge? (Why 20?) • More amino-acids more sensitivity to errors. • Fewer amino-acids reduce functionality of proteins. • Historical mechanisms : Freezing, Biosynthetic etc.. • Twenty as a topological feature of generic evolutionary phase transition?

U U UC X X C C A A A G G G The probable errors define the graph and the topology of the genetic code • Graph = codon vertices + one-letter difference edges ( Hamming = 1 ) K4XK4XK3

U U X C C A A Topology and genus of a simpler code Doublet Code with 3 bases is imbedded on a torus Each codon has 4 neighbors V = vertices, E = edges, F = faces Euler’s characteristicχ = V – E + F Euler Genus (# holes) γ = 1 - (1/2) χ Faces are quadrilateral mutation cycles F=V (d/4)= 9 ; E=V (d/2)=18

The genetic code graph is holey K4X K4XK3 • The 48-codon graph : • Each codon has degree d = 3+3+2 = 8 therefore • E = 48 (d/2) = 192 edges • F = 48 (d/4) = 96 faces • The Euler characteristic is χ = V – E + F = -48 and • Euler’s genus is γ = 1 - (1/2) χ = 25 (24 holes + Klein) • Embedding by group Automorphism analysis • Can one hear the shape of The code? K

The genetic code has a spectrum • uαi is average preference of codon i to encode α. • Every mode corresponds to an amino-acid -> number of modes = number of amino-acids. • Misreading w is actually the graph Laplacian w = -(Δ-Δrandom) where Δij=-Wij Δii=Σj≠iWij • Δ measures the difference between codons and their neighbors, a natural measure for error load. • Maximal mode of w is the 2nd eigenmode of Δ • Courant’s theorem: uαi have a single maximum -> single contiguous domain for each amino-acid.

Topology optimizes amino-acid assignment is in compact domains • uαihavesingle compact domains with one maximum and one minimum (Courant’s theorem). • Compact organization reduces impact of errors • Single domain in any direction (linearity) Σnαuαi Embedding in RN-1 is tight → The code graph contains complete graph KN [Banchoff 1965, Colin de Verdiére’s 1987] amino-acids # = N = chr(γ)

Coloring number of graph code is an upper limit for the number of amino-acids • What is the minimal number of colors required in a map so that no two adjacent regions have the same color? • The coloring number is a topological invariant and therefore a function of the genus solely. • Heawood’s conjecture [Ringel & Youngs, Appel & Haken]

The genetic code coevolves with increasing accuracy of translation • A path for evolution of codes: from early codes with higher codon degeneracy and fewer amino acids to lower degeneracy codes with more amino acids. • Preliminary simulations • Twenty amino acids is invariant even in variant codes. 21st and 22nd amino acids are context dependent. K4X K4

Summary • The 64 3-letter triplet code is patterned and degenerate, maps only 20 amino acids. • The governing evolutionary dynamics is interplay between protein diversity and error penalty described by stochastic diffusion equation. • The 1st excited state of this diffusive mapping dynamics on the high-genus surface of the code yield a pattern of ordered 20 amino acids (20 = the coloring number of the graph). • Topology + dynamics  Coloring (?)

Pol TF DNA Transcription network is a code that relates DNA sites and binding proteins • Reading DNA to synthesize proteins is controlled by a system of protein-DNA interactions (transcription net). • Presence/absence of transcription factor may repress/enhance synthesis of protein from nearby gene. • The transcription network is actually a code that relates proteins with their DNA targets. • Like the genetic code, transcription is subject to evolutionary forces and adapts to minimize errors.

Probable recognition errors define the binding sequence space Overlap and continuity sphere packing (Shannon) TF  AA Codon  binding site Typical binding site: 4 base pairs = 12 bit Hamming = 1 K46 -> 4096 ‘codons’

Probable recognition errors define the binding sequence space • Coloring number estimate: v = 4L (L=6) e ~ 4L(3/2)L f ~ 4L(3/4)L -> γ ~ 4L(3/8)L • The coloring # chr(γ) ~ 300

???? • Why does the code exhaust the coloring limit? • Other population dynamics models (‘quasi-species’) • Glassy 'almost-frozen' dynamics? • The necessity of the wobble (64/48)? 25 acids? • Generic phase transition scenario that does not depend finely on missing details of the evolutionary pathway. • Although not much is known about the primordial environment, minimal assumptions about the topology of probable errors can yield characteristics of biological codes. • Esp. the number of twenty amino-acids in the present picture is reminiscent of a 'shell magic number‘.

Shalev Itzkovitz Guy Shinar Uri Alon Guy Sella J. –P. Eckmann Elisha Moses

The Birth of Smooth Biological Codes in a Rough Evolutionary World

The Birth of Smooth Biological Codes in a Rough Evolutionary World

Presentation Transcript

birth control in the developing world

What Is The Biological and Evolutionary Cost Of A Free-Living Lifestyle

The Birth of a Nation

The Birth of Smooth Biological Codes in a Rough Evolutionary World

A Diamond in the Rough

Biological/Genetic/Evolutionary Perspective

Collie (Rough and Smooth)

“Diamond in the Rough”

The Birth of a Nation

Evolutionary history of Biological diversity

A Diamond in the Rough

The Spanish-American War: The birth of a World Power

Diamonds in the Rough

CONSTRUCTION CODES new codes for a changing world

BioPAX The Birth of A Data Exchange Language for Biological Pathways

Subgroup Analyses: Can We ‘Smooth' out the Rough Edges?

“Diamond in the Rough”

THE BIRTH OF A SCIENCE

The Rough and the Smooth: what the radar saw in Pine Island

Rough or Smooth?

Birth of the New World

The Rough and the Smooth