Applications of Voronoi tessellations in protein structure prediction and analysis Brendan McConkey Department of Biolo

Applications of Voronoi tessellations in protein structure prediction and analysis Brendan McConkey Department of Biology University of Waterloo

Q. Du, V. Faber, M Gunzberger (1999) Centroidal Voronoi Tessellations: Applications and Algorithms SIAM review 41(4):637-676

Gravitational influence of stars. Descartes. 1644. http://www.snibbe.com/scott

Wigner-Seitz Cells Soap Bubbles in Frame. Fig. 52 from Soap Bubbles, Their Colors and Forces which Mold Them. C.V. Boys. The distribution of McDonald's Restaurants in San Francisco. http://www.snibbe.com/scott http://www.chembio.uoguelph.ca/educmat/chm729/wscells/start.htm

Part of a dragonfly's wing. Fig. 162. From On Growth and Form . D'Arcy Thompson. "Reticulum Plasmatique." Fig. 321. From On Growth and Form . D'Arcy Thompson. http://www.snibbe.com/scott

Frogs' eggs showing various partitionings of first eight cells. Fig. 257. From On Growth and Form . D'Arcy Thompson. http://www.snibbe.com/scott

Applications in protein structure analysis: • scoring functions for protein folding (statistical assessment of contacts within a protein) • generation of protein Voronoi contact maps (2D targets for structure prediction) • calculating surfaces, areas, and volumes of atoms and amino acid residues

Structure prediction methods • The structure prediction problem may be divided into two related tasks: • A search procedure - comparative modeling - ab initio prediction • An energetic or scoring function - physicochemical potentials - statistical potentials

What determines the structure of a protein? • * energetics – structure should have a minimum energy • amino acid sequence • topology of the protein • environment (solvation, membrane interactions) • constitutive ligands (ions, heme groups, …) • interactions with other proteins, cofactors, ligands • - folding of cytosolic proteins is largely driven by desolvation • That an amino acid sequence can spontaneously form a functional protein implies that the structure is robust to small changes (structure is in a low energy conformation, and will return to this conformation if perturbed)

Protein folding energy landscape • protein energy landscape is complex, with many local minima • believed to have a funnel-like shape, with global minimum representing native structure image from http://bioinfo.mshri.on.ca/

Scoring functions • Energetic functions • Etotal = Ebonds + Eangles + Edihedrals + Evan der Waals + Eelectrostatics + Esolvation + … • Knowledge-based functions (e.g. statistical pairwise distance potentials) • Residue-residue contact potentials • Each method type often uses training sets - protein structures solved by experimental methods - to estimate parameters.

Development of an atom-atom contact scoring function • Advantages of contact-based scoring: • can treat the solvent accessible surface as an atomic contact, eliminating the need to add corrective terms • solvation energy is proportional to the solvent contact area (Eisenberg, 1986) • hydrophobic interactions are largely due to desolvation, so are correlated to loss of solvent contact area • knowledge-based statistical methodology may be applied to contact areas as well as inter-atomic distances* contact scores require a reliable quantification of inter-atomic contacts. This can be done using a Voronoi tessellation.

Defining atom contacts: Voronoi tessellations Original method: given a set of points in a plane, the plane is divided into polygonal regions with one region per point (Voronoi, 1908).This may be applied to protein structures in three dimensions, and can quantify atom volumes and packing efficiencies for internal atoms (Richards et al, 1974; Tsai et al, 1999)

A constrained Voronoi procedure • Applied to atom-atom and atom-solvent contacts within proteins, • the solvent accessible surface needs to be calculated (shown in blue) • atom-atom contacts should be limited to within ~2 atom diameters • contact areas should not be dependent on the size of polyhedra A rapid and exact analytical procedure for calculating volumes, contacts, and solvent accessibility has been developed using this method, termed a constrained Voronoi algorithm.

Integration of Voronoi tessellations with Solvent Accessible Surface plane of separation bisecting plane pij = dij / 2 pij radical plane pij = [dij2 + ri2- rj2] / 2 extended radical plane pij = [dij2 + (ri+rw)2- (rj+rw)2] / 2

Calculation of atom-atom contacts • to remove the dependency on polyhedra size, the angular contact area is used. • contact area is quantified by projecting the polyhedron faces to the surface of a sphere • calculated as a sum of spherical triangles and arc segments. • provides an exact and continuous estimate of atom-atom contacts • permits solvent contacts to be treated as atom contacts • approximates loss of solvent accessible surface on folding CA5 CA1 CA4 CA2 CA3

Sample atom-atom contact frequencies 153l N---O N---Cb 1ads 0.2 1mcp 0.1 0.0 Contact frequency Cb---Cb N---N 0.2 0.1 0.0 0 5 10 15 20 0 5 10 15 20 Contact area, Å2

Calculation of scoring function • uses 167 residue specific atom types plus the solvent accessible surface for a total of 168 contact types • scores generated from a non-redundant database of 648 proteins • A contact potential eis calculated for each of the 167 x 168 possible contacts: • Corrected for atom distributions within proteins • The score of a protein structure is determined by calculating all non-bonded contacts within the structure and multiplying each by the contact potential: Score = ei(j) Areai(j)

A few words on the reference state... • reference states (expected distributions) have a large influence on statistical scoring functions • here, the unfolded protein (maximum possible solvent contact) is used as a reference state • results in a closed system, with fixed amount of solvent • statistics independent of size of protein • consistent with the idea that protein folding is largely due to hydrophobic interactions

1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 -3 -2 -1 1 2 3 Results - contact potential array 1. backbone Ca backbone C backbone N backbone O 2. val Cg2phe Cd1phe Cd2tyr Cd1tyr Cd2val Cg1ile Cg2leu Cd1ile Cd1leu Cd2phe Cz phe Ce2phe Ce1 trp Ch2 trp Cz2tyr Ce1tyr Ce2trp Cd1met Cemet Sdtrp Cz3trp Ce3 3. val Cbleu Cbphe Cbtyr Cbtrp Cbmet Cbmet Cg ile Cg1 ile Cb leu Cgcys Cbcys Cg 4. tyr Ohhis Ce1his Ne2arg Cdarg Nelys Cggln Cgasn Cbser Cbasp Cbglu Cbgln Cbthr Cbarg Cgarg Cblys Cbpro Cdhis Cd2thr Cg2pro Cbpro Cgglu Cgasn Cgtrp Ne1his Nd1asp Cg 5. tyr Cgphe Cg trp Cgtrp Cd2 6. tyr Cztrp Ce2 7. his Cb ala Cb 8. his Cg 9. glu Cd gln Cd arg Cz 10.thr Og1 ser Og asn Nd2 gln Ne2 arg Nh1 arg Nh2 lys Cd lys Ce 11. lys Nz 12.glu Oe2 glu Oe1 asp Od2 asp Od1 13.gln Oe1 asn Od1 14.Solvent

Decoy sets: source: EMBL, CASP1 http://prostar.carb.nist.gov (J. Moult, U. of Maryland) 4state, lattice_ssfit, lmds http://dd.stanford.edu (M. Levitt, Stanford U.) Rosetta http://depts.washington.edu.bakerpg (D. Baker, U. of Washington) CASP4 http://predictioncenter.llnl.gov/CASP4 (Lawrence Livermore National Laboratory) Testing of scoring functions To provide independent tests of protein folding potentials, several groups have created decoy sets, misfolded models of proteins of known structure. An effective scoring function should be able to distinguish native structures from the decoys, and ideally select near-native structures as well. (Decoy sets with corresponding X-ray structures and less than 10% difference in number of atoms were used.)

4 2 0 -2 -4 -6 -8 -10 -12 Testing of scoring functions Contact scores for 1ctf decoy set (4state decoys) Score/atom 0 2 4 6 8 10 Ca rmsd (Angstroms)

1acfrank 1/1000 1aa2rank 1/1000 1orcrank 1/1000 1msirank 29/1000 1palrank 1/1000 1r69rank 1/1000 1whorank 1/1000 4fgfrank 1/1000 5ptirank 1/1000 5icbrank 9/1000 Histograms of native (red) and decoy (blue) scores for the Rosetta decoy monomers

1csprank 1/1000 1ctfrank 1/1000 1ailrank 1/1000 1bdorank 1/1000 1pdorank 1/1000 1kterank 1/1000 1ervrank 1/1000 1gvprank 1/1000 1utgrank 1/1000 1vlsrank 1/1000 2acyrank 1/1000 1risrank 1/1000 2fharank 1/1000 Histograms of native (red) and decoy (blue) scores for the Rosetta decoy oligomers

HL Hinds and Levitt, 1992BT Betancourt and Thirumalai, 1999GKS Godzik, Kolinski, Skolnick, 1995MJ Miyazawa and Jernigan, 1996 TE Tobi and Elber, 2000BJ Bahar and Jernigan, 1997MSE McConkey, Sobolev, Edelman 2003 Comparisons with existing scoring functions • comparisons were made as Z-scores and percent of Rank 1 native structures • 4-state, lattice_ssfit, and lmds decoy sets (Samudrala and Levitt, 1999) • 23 proteins, 250-2000 decoys per protein Snative - (SSi decoy/n)sdecoy Z-score = Average Z-score % Rank 1 native structures

-20 0 T0111 (1e9i) T0117 (1j90) -20 -40 -40 -60 -60 -80 -80 -100 -100 -120 -120 0 5 10 15 20 25 Score (-% native) 0 5 10 15 20 25 30 20 40 0 T0125 (1gak) T0123 (1exs) -20 0 -40 -40 -60 -80 -80 -100 -120 -120 0 5 10 15 20 25 0 5 10 15 20 25 C-alpha RMSD (A2) Sample decoy sets from CASP4

Summary of decoy set testing Performance of atom-atom contact scoring function on decoy sets. Z-score is the distance from the native structures to the mean of the decoy set measured in standard deviations. average # rank 1 average rank 1 average decoy decoys per solutions, Z-score solutions, Z-score sets target sub-units sub-units 4°(native) 4°(native) EMBL 1 25/25 n/a 25/25 n/aCASP1 7 5/6 2.38 6/6 3.724state 665 7/7 3.86 7/7 4.08lattice_ssfit 2000 8/8 8.17 8/8 9.21lmds 453 6/8 4.96 8/8 7.80CASP4 53 21/25 2.60 24/25* 3.01 Rosetta 1042 19/23 3.64 21/23* 4.38 Total 101/112 109/112 * missed structures: CASP4 -1exs; Rosetta- 1msi, 5icb.

Summary of atom-atom contact scoring • the Voronoi tessellation permits a precise and continuous quantification of atom-atom contacts • the contact scoring function qualitatively resembles energetic interaction potentials • the scoring function has a very high success rate for recognition of correctly folded protein structures, and has greater accuracy than other currently available scoring functions • Native protein structures could be identified in 97% of the decoy sets tested

Observations from all-atom potential • backbone atoms behave similarly, independent of residue type • statistical potential less accurate for backbone atoms due to severe topology constraints (e.g. C--N interaction) • backbone N and O are almost always H-bonded or solvent exposed • there is an reasonably strong effect of neighboring atoms on the potential (e.g. Lysine NZ and Lysine CE)

But... • contact potential is still an all-atom potential • requires all atoms to be positioned for a structure to be scored • does not readily permit simplification of folding algorithms • a simplified potential would be useful in initial stages of protein folding. the same methodology for creating the all atom potential has been used to create a folding potential.

First attempt at simplification: • reduce number of contact types from 168 to less than 30 • use residue types to define united atom types • assume backbone atoms behave similarly • GLY is treated as part of backbone • implicitly includes interactions with solvent • initial function remains area dependent

One possibility is a residue-residue potential: A beads-on-a-string model of amino acid chain Unfortunately, this approach has hadonly moderate success in the past.

A variation of beads on a string: A united-atom model of the amino acid chain • backbone interactions are ignored (assumed to be hydrogen bonded) • approximates a residue-residue contact potential

A United Atom potential • uses contrained Voronoi procedure as before, with a reduced number of atom types and excluding backbone interactions • counting contacts between side-chains (i.e. excluding backbone atoms) may better model certain interactions within proteins • e.g. interactions in beta-sheets:

United Atom potential #1 • the UA potential was compared to the all-atom potential using the Rosetta decoy set • Area dependence was used: Score = ei(j) Areai(j) 21/23 19/23 19/23 17/23

United Atom potentials • the initial United Atom function is still dependent on calculating contact areas between amino acid residues, so relies on knowledge of the position of side chains • a binary potential (residues in contact or not) would be more useful, as it doesn’t require coordinate information • A binary contact potential was developed, where sidechains were considered in contact if they shared > 8 Å2 contact area. • Solvent contact was also enumerated, with 10-30 Å2 = 1 contact, 30-50 Å2 = 2 contacts, etc. • binary potential tested using Rosetta decoys

Binary United Atom potential Sample data: 1msi from the Rosetta decoy set All-atom potential binary-UA potential 1msi 1msi Score Score C-alpha RMSD

22/23 21/23 20/23 19/23 19/23 17/23 Binary United Atom potential • The binary UA potential recognized the native structure in all test sets except one (5pti, rank 2/1000)

(Smith et al, 1997) Applications to Protein Contact maps 12 A Ca contact map for 2csn, casein kinase-1 • contact maps specify both secondary structure and inter-residue contacts • a detailed contact map provides sufficient information to reconstruct a 3-D structure • generation of a large set of feasible contact maps can reproduce near native structures

Protein structure prediction: Contact maps • Some issues with distance based contact maps: • typically use C-C distances - dependent on appropriate choice of cutoff • short cutoff distance biases map towards contacts within secondary structures • longer cutoff distance results in more contacts, and a noisy data set • C atoms in close proximity may have little interaction - e.g. n, n+2 residues in an alpha helix • contact with solvent not readily integrated • A tessellation procedure based on residue sidechains can circumvent some of these issues

Voronoi Contact maps • similar to C-C distance-based maps • uses a tesselation procedure to determine if residues are in contact • contacts can be subdivided by type: - sidechain contacts - backbone contacts - both sidechain and backbone • results in recognizable patterns of interaction for within and between secondary structures • it is possible to integrate solvent contact into this scheme as well

Voronoi Contact maps

Voronoi Contact maps C distance map Voronoi map

Voronoi Contact map feature recognition Using the contact preferences from residue-residue scores, it is possible to recognize regions of secondary structure, and interactions between secondary structure elements: alpha helix: alpha-alpha: antiparallel beta-beta beta-alpha: parallel beta-beta

Future work • Further refinement of binary contact scoring functions • incorporate different contact types • beta sheet vs. alpha helix • Development of search procedures to explore contact map space • Other unrelated stuff • proteomics • gene expression and divergence • physicochemical pattern recognition

Thanks! ....questions?

Applications of Voronoi tessellations in protein structure prediction and analysis Brendan McConkey Department of Biolo

Applications of Voronoi tessellations in protein structure prediction and analysis Brendan McConkey Department of Biolo

Presentation Transcript

Protein structure prediction

Prediction of protein structure

Protein Structure Prediction

Protein structure prediction

Analysis and Prediction of Protein Function

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction and Analysis

Prediction of Protein Structure in 1D

Protein structure prediction

Protein Structure and Prediction

Protein Structure Prediction

Protein Structure Prediction

Prediction of protein structure

Protein Structure Prediction

Euclidean Voronoi Diagram of Atoms and Protein Structure Analysis

Forces and Prediction of Protein Structure

Protein structure prediction

Protein structure prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction