Quantifying contributions of mutations and homologous recombination to E. coli genomic diversity

Quantifying contributions of mutations and homologous recombination to E. coli genomic diversity Sergei Maslov Department of BiosciencesBrookhaven National Laboratory, New York

Bacterial genome evolution happens in cooperation with phages + =

Variation between E. coli strains FW Studier, P Daegelen, RE Lenski, S Maslov, JF Kim, JMB (2009) Pan-genome of E. coli Comparison of B vs K-12 strains of E. coli M Touchon et al. PLoS Genetics (2009) Copy and Insert Copy and Replace

Usual suspects are there but do not explain heterogeneity • Negative correlation with protein abundance: 2.5% of variation, P-value=10-5 • Positive correlation with distance from origin of replication: 0.4% of variation, P-value=10-2

High SNP numbers are clustered along the chromosome

Clonal Recombined

P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)

Clonal regions Recombined regions SNPs by recombination/SNPs by clonal mutations r/μ=6±1 Recombined regions P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)

Strains: K-12 vsETEC-H10407 HS O157-H7-Sakai P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013) Neutral model: Mutations and Recombinations among 70 “genes”, population of 104 C. Fraser et al.(2007) and (2009)

Phase transition Δc=1.5% P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)

P. Dixit, T. Y. Pang, Studier FW, Maslov S, PNAS submitted (2013)

Why exponential tail? • Time to coalescence: Prob(t)= 1/Ne (1-1/Ne)t-1=exp(  exponential slope =1/2μNe or 1/θ • Population size Ne=1±0.1 x 109consistent with earlier estimates

Why Ne<< N ? • Phages: • But: there are phages that cross species boundaries. • Also slope is similar for different species • Restriction modification system: • Recombined segments are not continuous[Milkman R, Bridges MM. Genetics 1990] • Recombination efficiency: • Need 20-30 identical bases to start recombination • Our slope predicts 60 bases which roughly matches30 in the neginnng and 30 in the end • Species are defined by recombination

Are our 30+ strains a representative sample? • Fully sequenced genomes: • 1000s of genes (unbiased and complete) • 10s of strains (biased) • MLST data: • 10s of genes (biased) • 1000s of strains (unbiased, I hope) • Databasehttp://mlst.ucc.ie • ∼3000 E. coli strains • 7 short regions of ~500 base pairs eachin housekeeping genes

MLST • -- Genomes

Is it really phages? 1kb: gene length K-12 to B comparison Phage capacity: 20kbOther strains up to 40kb

Does neutral model explain everything? • At 3 standard deviations • 19 1kb regions supervariable • 29 1kb regionssuperconserved

Collaborators& funding • Bill Studier (BNL) • Purushottam Dixit (BNL) • Tin Yau Pang (Stony Brook) • Rich Lenski (Michigan State) • Patrick Daegelen (France) • JinhyunKim (Korea) • DOE Systems Biology Knoledgebase (KBase) • Adam Arkin (Berkley) • Rick Stevens (Argonne) • Bob Cottingham (Oak Ridge) • Mark Gerstein (Yale) • Doreen Ware (Cold Spring Harbor) • Mike Schatz (Cold Spring Harbor) • Dave Weston (ORNL) • 60+ other collaborators

Thank you!

~ Genes encoded in bacterial genomes Packages installed on Linux computers

Complex systems have many components • Genes (Bacteria) • Software packages (Linux OS) • Components do not work alone: they need to be assembled to work • In individual systems only a subset of components is used • Genome (Bacteria) – bag of genes • Computer (Linux OS) – installed packages • Components have vastly differentfrequencies of use

IKEA: has many components Justin Pollard, http://www.designboom.com

They need to be assembled to work Justin Pollard, http://www.designboom.com

Different frequencies of use vs Common Rare

What determines the frequency of use? • Popularity: AKA preferential attachment • Frequency ~ self-amplifying popularity • Relevant for social systems: WWW links, facebook friendships, scientific citations • Functional role: • Frequency ~ breadth or importance of the functional role • Relevant for biological and technologicalsystems where selection adjusts undeserved popularity

Empirical data on component frequencies • Bacterial genomes (eggnog.embl.de): • 500 sequenced prokaryotic genomes • 44,000 Orthologous Gene families • Linux packages (popcon.ubuntu.com): • 200,000 Linux packages installed on • 2,000,000 individual computers • Binary tables: component is either present or not in a given system

Frequency distributions Cloud Shell Core ORFans P(f)~ f-1.5 except the top √N “universal” components with f~1

How to quantify functional importance? • Components do not work alone • Breadth/Importance ~ Component is needed for proper functioning of other components • Dependency network • A  B means A depends on B for its function • Formalized for Linux software packages • For metabolic enzymes given by upstream-downstream positions in pathways • Frequency ~ dependency degree, Kdep • Kdep= thetotal number of components that directly or indirectly depend on the selected one

Frequency is positively correlated with functional importance Correlation coefficient ~0.4 for both Linux and genes Could be improved by using weighted dependency degree

Tree-like metabolic network TCA cycle Kdep=15 Kdep=5

Dependency degree distribution on a critical branching tree • P(K)~K-1.5for a critical branching tree • Paradox: Kmax-0.5 ~ 1/N  Kmax=N2>N • Answer: parent tree size imposes a cutoff:there will be √N “core” nodes with Kmax=N • present in almost all systems (ribosomal genes or core metabolic enzymes) • Need a new model: in a tree D=1, while in real systems D~2>1

Dependency network evolution • New components added gradually over time • New component depends on D existing components selected randomly • Kdep(t) ~(t/N)-D • P(Kdep(t)>K)=P(t/N<K-1/D)=K-1/D • P(Kdep)=Kdep-(1+1/D) =Kdep-1.5for D=2 • Nuniversal=N(D-1)/D=N0.5 forD=2

Kdep decreases layer number Linux Model with D=2

Zipf plot for Kdep distributions Metabolic enzymes vs Model Linux vs Model

Frequency distributions Cloud Core Shell ORFans P(f)~ f-1.5 except the top √N “universal” components with f~1

Why should we care about P(f)?

Metagenomes and pan-genomes For P(f) ~ f -1.5: (Pan-genome size)~ ~(# of samples)0.5 The Human MicrobiomeProject Consortium, Nature (2012)

Pan-genome of E. coli strains M Touchon et al. PLoS Genetics (2009)

Genome evolution in E. coliStudier FW, Daegelen P, Lenski RE, Maslov S, Kim JF J. Mol Biol. (2009)P. Dixit, T. Y. Pang, Studier FW, Maslov S, submitted (2013)

S. Maslov, TY Pang, K. Sneppen, S. Krishna, PNAS (2009) TY Pang, S. Maslov, PLoS Comp Bio (2011) How many transcription factorsdoes an organism need? Regulator genes Worker genes

Figure adapted from S. Maslov, TY Pang, K. Sneppen, S. Krishna, PNAS (2009) NR~ NG2 NR/NG ~ NG +

Cyril Northcote Parkinson (1909 -1993) “… bureaucracy grew by 5-7% per year "irrespective of any variation in the amount of work (if any) to be done." Why? "An official wants to multiply subordinates, not rivals" "Officials make work for each other.“ so that “Work expands so as to fill the time available for its completion” Is this what happens in bacterial genomes? Probably not!

Economies of scale in bacterial evolution • NR=NG2/80,000  NG/NR=80,000/NG • Economies of scale: as genome gets larger: new pathways get shorter

nutrient Horizontal gene transfer:entire pathways could be added in one step nutrient Redundant enzymes are removed Central metabolic core  anabolic pathways  biomass production

Minimal metabolic pathwaysfrom reactions in KEGG database NR NG Adapted from “scope-expansion” algorithm by R. Heinrich et al. (# of pathways or their regulators) ~(# of enzymes )2

Quantifying contributions of mutations and homologous recombination to E. coli genomic diversity