Conformation Networks: an Application to Protein Folding

Center for Nonlinear Studies Conformation Networks: an Application to Protein Folding Zoltán Toroczkai Erzsébet Ravasz Center for Nonlinear Studies Gnana Gnanakaran (T-10) Theoretical Biology and Biophysics Los Alamos National Laboratory

Proteins • the most complex molecules in nature • globular or fibrous • basic functional units of a cell • chains of amino acids (50 – 103) • peptide bonds link the backbone Native state • unique 3D structure (native physiological conditions) • biological function • fold in nanoseconds to minutes • about 1000 known 3D structures: X-ray crystallography, NMR

Myoglobin 153 Residues, Mol. Weight=17181 [D], 1260 Atoms Main function: primary oxygen storage and carrier in muscle tissue It contains a heme (iron-containing porphyrin ) group in the center. C34H32N4O4FeHO

Protein conformations Amino-acid • defined by dihedral angles • 2 angles with 2-3 local minima of the torsion energy • N monomers  about 10N different conformations

Levinthal’s paradox • Anfinsen: thermodynamic hypothesis • native state is at the global minimum of the free energy Epstain, Goldberger, & Anfinsen, Cold Harbor Symp. Quant. Biol.28, 439 (1963) • Levinthal’s paradox, 1968 • finding the native state by random sampling is not possible • 40 monomer polypeptide  1013 conf/s  3 1019 years to sample all  universe ~ 2 1010 years old Levinthal, J. Chim. Phys. 65, 44-45 (1968) Wetlaufer, P.N.A.S. 70, 691 (1973) • nucleation • folding pathways

Free energy landscapes • Bryngelson & Wolynes, 1987 • free energy landscape Bryngelson & Wolynes, P.N.A.S. 84, 7524 (1987) • a random hetero-polymer typically does NOT fold • Experiment: • random sequences • GLU, ARG, LEU • 80-100 amino-acids • ~ 95% did not fold • in a stable manner Davidson & Sauer, P.N.A.S.91, 2146 (1994)

Funnels Given any amino-acid sequence: can we tell if it is a good folder? • experiments (X-ray, NMR) • molecular dynamics simulations • homology modeling • Leopold, Mortal & Onuchic, 1992 Leopold, Mortal & Onuchic, P.N.A.S. 89, 8721 (1992) Energy funnels Difficult and slow • many folding pathways

Molecular dynamics Sanbonmatsu, Joseph & Tung, P.N.A.S. 102 15854 (2005) • State of the art • supercomputer (LANL) • Ribosome in explicit solvent: • targeted MD • 2.64x106 atoms (2.5x105 + water) • Q machine, 768 processors • 260 days of simulation (event: 2 ns) ~ 1016 times slower • distributed computing (Stanford, Folding@home) • more than 100,000 CPU’s • simulation of complete folding event • BBA5, 23-residue, implicit water • 10,000 CPU days/folding event (~1s) Shirts & Pande, Science290, 1903 (2000) Snow, Nguyen, Pande, Gruebele, Nature420,102 (2002)

Configuration networks Ramachandran map PDB structures • Configuration networks Protein conformations • dihedral angles have few preferred values Ramachandran & Sasisekharan, J.Mol.Biol.7, 95 (1963) • Helix • Sheet • other NODE configuration LINK change of one degree of freedom (angle) • refinement of angle values  continuous case

Why networks? • VERY LARGE: 100 monomers  10100 nodes. However: Generic features of folding are determined by STATISTICAL properties of the configuration network • degree distribution • average distance • clustering • degree correlations • toolkit from network research • captures the high dimensionality Albert & Barabási, Rev. Mod. Phys.74, 67 (2002); Newman, SIAM Rev.45, 167 (2003) • faster algorithms to simulate folding events • pre-screening synthetic proteins • insights into misfolding

A real example • The Protein Folding Network: F. Rao, A. Caflisch, J.Mol.Biol, 342, 299 (2004) • beta3s: 20 monomers, antiparallel beta sheets • MD simulation, implicit water • 330K, equilibrium folded  random coil NODE -- 8 letters / AA (local secondary struct) LINK -- 2ps transition

Its native conformation has been studied by NMR experiments: De Alba et.al. Prot.Sci. 8, 854 (1999). Beta3s in aqueous solution forms a monomeric triple-stranded antiparallel beta sheet in equilibrium with the denaturated state. • Simulations @ 330K • The average folding time from denaturated state ~ 83ns • The average unfolding time ~83ns • Simulation time ~12.6s • Coordinates saved at every 20ps (5105 snapshots in 10s) • Secondary structures: H,G,I,E,B,T,S,- (-helix, 310 helix, -helix, extended, isolated -bridge, hydrogen-bonded turn, bend and unstructured). • The native state: -EEEESSEEEEEESSEEEE- • There are approx. 818 1016 conformations. • Nodes: conformations, transitions: links.

Scale-free network Many real-world networks are scale free • hubs beta3s randomized • co-authorship (=1 - 2.5) • citations (=3) • sexual contacts (=3.4) • movie actors (=2.3) • Internet (y=2.4) • World Wide Web (=2.1/2.5) • Genetic regulation (=1.3) • Protein-protein interactions ( =2.4) • Metabolic pathways (=2.2) • Food webs (=1.1) Barabási & Albert, Science 286, 509, (1999); Many reasons behind SF topology • Why is the protein network scale free? • Why does the randomized chain have similar degree distribution? • Why is  = - 2 ?

Robot arm networks 12 n=0 22 02 11 01 21 0 2 1 20 n=1 00 10 00 22 n=2 01 21 02 20 021 021 10 12 11 020 020 010 010 000 000 100 100 200 200 • Steric constraints? • missing nodes • missing links • n-dimensional hypercube • binomial degree distribution Homogeneous Swiss cheese

A bead-chain model • Beads on a chain in 3D: robot arm model • similar to C protein models • rod-rod angle  • 3 positions around axis Honeycutt & Thirumalai, Biopolymers32, 695 (1992) N=6;  = 90 N=18;  = 120 2212112212111122 • Homogeneous network

Another example: L = 7,  = 75 , r = 0.25 “00100” state “00100” allowed state forbidden state

Adding monomers not only increases the number of nodes in the network but also its dimensionality!! The combined effect is small-world.

Shortcuts in Folding Space

The “dilemma” SCALE FREE • from polypeptide MD simulations • beta3s • randomized version HOMOGENEOUS • from studies of conformation networks • bead chain • robot arm ?

Gradient Networks Gradients of a scalar (temperature, concentration, potential, etc.) induce flows (heat, particles, currents, etc.). Naturally, gradients will induce flows on networks as well. Ex.: Load balancing in parallel computation and packet routing on the internet Y. Rabani, A. Sinclair and R. Wanka, Proc. 39th Symp. On Foundations of Computer Science (FOCS), 1998:“Local Divergence of Markov Chains and the Analysis of Iterative Load-balancing Schemes” References: Z. T. and K.E. Bassler, “Jamming is Limited in Scale-free Networks”, Nature, 428, 716 (2004) Z. T., B. Kozma, K.E. Bassler, N.W. Hengartner and G. Korniss “Gradient Networks”, http://www.arxiv.org/cond-mat/0408262

Setup: Let G=G(V,E) be an undirected graph, which we call the substrate network. The vertex set: The edge set: A simple representation of E is via the Nx N adjacency (or incidence) matrix A (1) Let us consider a scalar field Set of nearest neighbor nodes on G of i :

Definition 1 The gradient h(i) of the field {h} in node i is a directed edge: (2) Which points from i to that nearest neighbor for G for which the increase in the scalar is the largest, i.e.,: (3) The weight associated with edge (i,) is given by: . . The self-loop is a loop through i with zero weight. Definition 2 The set F of directed gradient edges on G together with the vertex set V forms the gradient network: If (3) admits more than one solution, than the gradient in i is degenerate.

In the following we will only consider scalar fields with non-degenerate gradients. This means: Theorem 1 Non-degenerate gradient networks form forests. Proof:

0.48 0.82 0.67 0.46 0.65 0.6 0.53 0.44 0.5 0.67 0.7 0.2 0.22 0.65 0.1 0.19 0.16 0.87 0.14 0.15 0.32 0.2 0.2 0.18 0.05 0.55 0.44 0.15 0.05 0.24 0.43 0.16 0.8 0.13 0.65 0.65 Theorem 2 The number of trees in this forest = number of local maxima of {h} on G.

For Erdős - Rényi random graph substrates with i.i.d random numbers as scalars, the in-degree distribution is:

The Configuration model A. Clauset, C. Moore, Z.T., E. Lopez, to be published.

Generating functions: K-th Power of a Ring

Power law with exponent =- 3 2K+l

The energy landscape • Energy associated with each node (configuration) • the gradient network • most favorable transitions • T=0 backbone of the flow • MD simulation • tracks the flow network • biased walk close to the gradient network • trees • basins of local minima What generates  = - 2 ? The REM generates an exponent of -1.

Model ingredients k, E increases constrained (folded) small kconf lower energy loose (random coil) large kconf higher energy • A network model of configuration spaces • network topology • homogeneous • degree correlations • how to associate energies

Random geometric graph R=0.113, <k>=20 • Energy proportional to connectivity E k • random geometric graph Dall & Christensen, Phys.Rev.E 66, 026121 (2002) • in higher D: similar to hypercube with holes • degree correlations

N=30000, <k> = 1000, d=2.

Exponent is - 2 2 essential ingredients: k1-k2 correlations <E> with k monotonic

Bead-chain model Lennard-Jones potential Repulsive Attractive • more realistic model: bead-chain • configuration network • excluded volume • energy: Lennard-Jones

L = 30,  = 75

The case of the -helix AKA peptide • ALA: orange • LYS: blue • TYR: green MD simulations, no water.

The MD traced network T = 400 More than one simulation 3 different runs: yellow, red and green The role of temperature

Conclusions • A network approach was introduced to study sterically constrained conformations of ball-chain like objects. • This networks approach is based on the “statistical dogma” stating that generic features must be the result of statistical properties of the networks and should not depend on details. • Protein conformation dynamics happens in high dimensional spaces that are not adequately described by simplistic reaction coordinates. • The dynamics performs a locally biased sampling of the full conformational network. For low enough temperatures the sampled network is a gradient graph which is typically a scale-free structure. • The -2 degree exponent appears at and bellow the temperature where the basins of the local energy minima become kinetically disconnected. • Understanding the protein folding network has the potential of leading to faster simulation algorithms towards closing the gap between nature’s speed and ours. Coming up: conditions on side chain distributions for the existence of funneled energy landscapes.

Conformation Networks: an Application to Protein Folding