The Inverse Protein Folding Problem*

Canada-China Industrial Workshop, 2005 Hong Kong Baptist University The Inverse Protein Folding Problem* Arvind Gupta Simon Fraser University May 24, 2005 *Joint work with J. Manuch, C. Mead, L. Stacho, B. Bhattacharyya, X. Huang

Outline • Background • Forces in Protein Folding • Hydrophobic-Polar Model • Protein Databank • Determining Attributes of the Ideal Lattice • Future Steps

DNA • Genetic code • A “string” of nucleotides over A C G T • Code for all proteins • Self-replicating

Proteins • A “string” over 20 amino acids • In solvent will fold into a unique 3D spatial structure with minimal energy

Protein Structure • Structure determines protein function. • Proteins normally are in an aqueous environment • Proteins are globular.

Proteins in the body • Proteins are involved in all processes in the body: Insulin Hemoglobin

Proteins and diseases M. Thorpe, Protein Folding, HIV and Drug Design, Physics and Technology Forefronts (2003).

Forward Protein Folding Problem • Identify the protein structure for a specific amino acid sequence. MAGWTRLS.. • Central open problem in biology • NP-hard under most models

Inverse Protein Folding Problem • Given a structure (or a functionality) identify an amino acid sequence whose fold will be that structure (exhibit that functionality). • Crucial problem in drug design. • NP-hard under most models.

Forces acting on Proteins • Hydrogen Bonding • Van der Waals interactions • Ion pairing • Disulfide bonds • Intrinsic properties (conformational preference) • Hydrophobicity: the dominant force in protein folding (Dill, 1990) • Hydro (water) • philic (loving) • phobic (fearing)

Hydrophobic Interactions • Each amino acid can be classified as either hydrophobic or hydrophilic (polar) • Hydrophobic [Polar] are in a higher [lower] energy state in an aqueous environment.

Hydrophobic – Polar (HP) Model • Introduced by Dill (1985) and Chan (1985) • “0” for polar; “1” for hydrophobic • Protein sequence embedded on lattice • Each amino acid in exactly one cell • Interactions across adjacent cells • Empty lattice cells contain water • Given protein maximize hydrophobic interactions (native fold). • IE: Given 0-1 string embed onto a lattice, maximizing adjacent 1’s.

The 2-D Square Lattice Protein: • Hydrophobic “1”: Polar “0”: • Peptide bond: Hydrophobic interaction: • Example.

Inverse protein folding • Problem: For a given shape find a protein (amino acid string) with a native fold approximating the shape. • Example.

Constructible structures Theorem: For any constructible structure S, there exists a protein p(S) with a native fold exactly filling the structure S. • Proof by induction: • Base case: p(S)=010010010010

Constructible structures Theorem: For any constructible structure S, there exists a protein p(S) with a native fold exactly filling the structure S. • Proof by induction: • Inductive case:

Constructible structures Theorem: For any constructible structure S, there exists a protein p(S) with a native fold exactly filling the structure S. • Proof: • Folds are saturated: every hydrophobic “1” is involved in two hydrophobic interactions • saturated implies native

Stability of proteins • Proteins is stable if it has unique “native fold” (fold with minimal energy). • Most natural proteins are stable. • The protein in our example is not stable: Together 82 native folds!

Stability of proteins Conjecture: For any constructible structure S, the protein p(S) is stable. • Tested for >20,000 constructible structures. • Mathematically proved for two simple infinite classes of constructible structures L0 and L1. L0: L1:

Boundary squares • Diagonal frame: the smallest diagonal rectangle containing all hydrophobic “1”-s. • Boundary square: hydrophobic “1” lying on the border of diagonal frame. 5 boundary squares

Boundary squares • Useful to find the last tile of constructible structure. • A saturated fold has at least 4 of them. Lemma.Let p=0{0,1}*0 be a protein string not containing 11, 000 and 10101 as a substring. For every saturated fold of p, each boundary square not adjacent to a terminal is the main square of a corner-closed core.

Proof for L0 structures • Take a saturated fold for p(S), L0. • It has at least 4 boundary squares, and at least 2 not adjacent to a terminal (the first or the last amino acid). • By Lemma, each is contained in a corner-closed core, i.e., is a red 1 of substring 1001001 of the protein string. • In p(S)=0(10010)n(01001)n0, there are only two occurrences of substring 1001001, and they are overlapping. • Hence, cores match each other and form a fully-closed core (closed on 3 sides) - the last tile. • Cut the last tile and apply induction.

L1 structures are more complex • p(S)=0(10010)n010(10010)m(01001)m01(01001)n-10 • p(S) contains one occurrence of substring 10101 (Lemma cannot be directly applied) and three occurrences of 1001001 (two corner-closed cores does not imply a fully-closed core).

Choosing a Lattice • 2D is easier • Fewer options for combinatorial case analysis • More visually intuitive • Torsion angles describe protein mainchain • 3D is more relevant • More biologically relevant • More representative of actual protein structures • Directly applicable to known protein structures

Protein Data Bank (PDB) • Worldwide repository for 3-D biological macromolecular structure data • Contains 30857 known protein structures (May17,2005) • Structures derived using different techniques • Nuclear Magnetic Resonance spectroscopy • X-ray crystallography • PDB ‘known structures’ are really models of the structure of a protein

Determining Ideal Lattice Attributes • Should all edges of the lattice be identical in length? • How should distances between non-adjacent lattice points behave? • What angles should the lattice have? • How regular should the lattice be? Use PDB statistics to answer these questions

Assemble a Set of Proteins Create a protein structure subset of good quality protein structures from the PDB: • Protein structures generated using X-ray diffraction • High resolution structures (<= 1.75 Å) • Model fits the experimental data well Result: 3704 Protein structures in subset

Q1: Uniform Edge Length? Overall distribution of consecutive residue distance: Consecutive residue distance appears consistently with length 3.8 Å. Answer to Question 1: All edge lengths should be uniform with length 3.8 Å.

Q2: Non-adjacent Vertex Distances? Overall distribution of non-consecutive residue distance: • minimum distance: 3.06 Å • only 10 distances < 3.5Å • 1813 distances < 3.8Å • (out of 426 billion pairs). Answer to Question 2: Non-adjacent vertices should be at least 3.8 Å apart.

Q3: Lattice Angles? One amino acid Amino acid chain

Q3: Lattice Angles? Overall distribution of Ca angles: • Calculate Ca angles: angle produced by three consecutive Ca atoms • Group results by middle amino acid residue type • Bimodal distribution: • Sharp peak at 90o • Shallow peak at 120o

Q3: Lattice Angles? Some differences appear for Ca angles around certain amino acids: Shown: Proline, Phenylalanine, Aspartic acid

Q4: Lattice Regularity? • Determine average corresponding coordinate root square mean deviation (c-RMS) values between the original PDB structure and lattice approximated structures (over the entire 3704 PDB protein subset) ai = coordinates of lattice vertex corresponding to bi bi = coordinates of residue in protein X-ray structure

Q4: Lattice Regularity? • Periodic Lattices: Cubic and Face-Centered-Cubic (FCC) • Randomized Lattices: Shift each vertex in periodic lattices by a random value from normal (0, 0.0025) distribution, preserve edges • De Novo Random Lattices: Generate random nodes and edges, maintain average degree and edge length of periodic lattices

Q4: Lattice Regularity? • average c-RMS values generally increase as the randomization of the lattices increase Answer to Question 4: Periodic lattices achieve better approximation of protein structure than random lattices of the same degree

Results: Ideal Lattice Attributes • Uniform edge lengths of 3.8Å • Mimimum distance between any two vertices of 3.8Å • Supporting mainly 90o and 120o angles • Periodic in structure

Candidate lattices (space-filling) cubic hex. prism truncated octahedron truncated tetrahedron cuboctahedron

Candidate lattices (vector-based) Face-centered cubic (FCC) Side+FCC (S+FCC) Extended FCC (e-FCC)

RMS comparison of lattices

Angle comparison of lattices

Future • Investigate candidate lattices to determine an ideal lattice for inverse protein folding • Mathematically prove that the ideal lattice can generate stable sequences for specified protein shapes within the HP model • Attempt to assign specific amino acids to lattice sites

Future • Investigate protein sequences generated by the model for stability and folding properties. • Incorporate other protein folding forces • Hydrogen Bonding • Van der Waals interactions • Intrinsic properties (conformational preference) • Ion pairing • Disulfide bonds

Questions?

The Inverse Protein Folding Problem*