Gregory M. Reck PhD Dissertation Defense

Adaptation and Use of Four-body Statistical Potential to Examine Thermodynamic Properties of Proteins Gregory M. Reck PhD Dissertation Defense

Topics • Background & Approach • Adaptation of tessellation programs for hydration • Derivation of potentials from hydrated proteins • Identifying relationships between stability and potential

Background • A group of diseases have been associated with the formation of amyloid deposits in tissue, each has been linked to a specific protein precursor (typically a mutant) that appears to misfold • The misfolding leads to protein aggregation, formation of fibrils, and eventually insoluble amyloid; the resulting pathologies vary widely • No apparent sequence or structure similarity among precursors, but the resulting amyloids have a similar cross-beta structure • Some model proteins (non-disease) can be induced to form amyloid fibril structures through destabilization • If instability is a significant factor in the aggregation process for globular precursors, then a knowledge-based tool capable of reporting on stability features of a protein may be a valuable research tool

Approach • Use the tessellation-based statistical potential • Can be characterized as an indicator of the sequence-structure compatibility of a protein • Can provide a potential profile along the primary sequence that may identify key features • Begin with some adaptations to the model aimed at improving stability relationships • Previous work has had limited success in defining the relationship between protein stability and the protein total potential • Evaluate the correlation between potential and stability • Test run the revised potential on some protein/s with known amyloid characteristics

The Delaunay simplices (DS) are constructed from the VC -- red triangles that connect the C points of the 3 VC’s that meet at VC vertices The C’s at the vertices of the DS form groups of “nearest neighbors”, triplets in 2-D and quadruplets in 3-D 2-D Overview of Tessellation • Begin with set of points (coordinates) that represent amino acid positions in the protein (red spots, C) • Build a Voronoi cell (VC) around each point, so that all points in the cell are closer to the C at the center than any other (blue dashed lines) • The vertices of the VC are points that represent the intersection of 3 VC • (Note: the VC at the boundary points are not closed) List of quadruplets (from the 3-D PDB file) is the primary output

Four-body Statistical Potential Derivation • Define a statistical potential q that represents observed frequency of occurrence of a given quadruplet frequency of random occurrence of a given quadruplet where i, j, k, l are amino acid residues • The observed frequency is determined by tessellating a large nonredundant set of representative proteins (normalized) • The frequency of random occurrence is pijkl = caiajakal where the a’s are the frequencies of individual amino acid residues, and c is the permutation factor to account for replicated residue types in a quadruplet where n is the number of distinct residue types in a quadruplet, and tiis the number of amino acids of type i • Thus qijklrepresents the likelihood of finding four particular residues in one simplex • The potential function is the aggregate function of the values of all possible qijkl

Four-body Statistical Potential Parameters • Total potential of a study protein is the sum of the individual q values of each of the N quadruplets in the protein • Residue potential for a given residue is the sum of the individual q values of each quadruplet that includes that residue, thus it represents the local environment surrounding that residue • Potential profile is a vector associated with the primary sequence of the protein where each vector component is the residue potential for the amino acid residue at that position (shown below for barnase)

Adaptations of the Statistical Potential • Model the protein environment - modify the approach to provide an explicit (rather than implicit) representation of the surrounding environment, typically water • Eliminate distorted simplex elements at the protein surface that introduce unlikely neighbors (without arbitrary edge length cutoffs) Computationally hydrating the protein can resolve both issues

Computational Hydration • Used SOLVATE* to place water at representative positions around the subject protein • Initial grid placement, short van der Waals optimization, no electrostatics • Minimum 5Å shell thickness • Isolated groups identified

Discarded waters First hydration shell waters Voronoi cells Crevice in protein surface Delaunay simplex edges Tessellating Hydrated Proteins • HOH treated as another residue position and tessellated • Quadruplets with 4 HOH are discarded • HOH fills surface contours and interrupts the long distorted simplices • Internal voids may be filled, based on steric considerations

Aspects of SOLVATE HOH Placements(based on a reference set of 1353 proteins) (a) Number of HOH placements (b) Characteristic edge lengths (c) Separation between SOLVATE HOH and X-ray crystal HOH • Examined internal and surface (crevice) water placements (used DOWSER) • SOLVATE isolated HOH groups - full vs. bulk hydration • Resolved to include internal HOH, and use “full” hydration

Residue Classification Using Tessellation • Interest in classifying each residue as either surface or buried • Compared 2 strategies, both based on tessellation • Simplex face match method (unhydrated technique) • Water group method (uses hydrated protein information) • Comparison parameters: • Accessible surface area (NACCESS, Hubbard) • Residue mean depth (DPX, Pintar) • C-alpha circular variance (Simulaid, Mezei) • Water coordination number ratio

No face match S Face match U B 1 2 0 2 2 0 1 Tessellation-based Residue Classification (Surface or Core) • Simplex face match method • Simplices at the surface will have at least one face that is not matched by another simplex face, just find unmatched • Residues on unmatched faces are “surface”, residues connected to them by a simplex edge are “undersurface”, remainder are “buried” • Water group method • Separate simplices surrounding a residue into groups based on HOH content (0, 1, 2, or 3 HOH), count the number of simplices in each. • Plot the distribution of the number of HOH molecules in each group, fit a linear regression. • If the slope is positive, the residue is a “surface” residue, otherwise “buried.” • The value of the slope is the Water Group Parameter.

Cross comparison of 2 methods • Water group method classifies more residues as ‘surface’ than the simplex face match method • Both methods show distinctive statistical distributions using conventional topological parameters (ASA, depth, circular variance) • The water group method is applied in the subsequent development of the statistical potentials

Deriving a Statistical Potential with Hydrated Proteins • Given a representative method for hydration, need to incorporate the hydrated protein simplices into a potential function • Issue: Null condition of a random distribution of residues certainly does not apply when HOH is added • Developed and examined three approaches: • 21st residue: simply define HOH as a residue • Split Potential (SP): develop separate potentials for the surface and core • Water Group (WG): modify the denominator (expected frequency) to incorporate the probability of water differently • Reference set of 1353 proteins selected using PISCES*, max 30% sequence identity, < 2.2 Å resolution

Strategies 1 & 2: 21st and SP • The 21st residue potential function is computed and used in the same manner as for unhydrated proteins: observed frequency of occurrence of a given quadruplet frequency of random occurrence of a given quadruplet • The Split Potential function: • Splits the simplices into 2 groups based on residue classification of surface or core using the Water Group method just described (not mutually exclusive) • Also uses above expression to compute a potential function for each • The Split Potential function must be applied differently: • For SP, the total potential is the sum of the residue potentials rather than the sum of the simplex potentials • Thus the SP total potential may be up to 4 times as large • But the residual profiles will remain comparable

Strategy 3: Water Group Potential Function • In this strategy, the denominator is changed to separate the probability of the HOH content or ‘water group’ of the quadruplet from the probability of the residues: where mX is number of HOH in ijkl and nR is number of residues in ijkl (mX + nR = 4) is the observed probability of the water group (0X, 1X, 2X or 3X), is the product of the observed probabilities of the nR residues in the simplex is the permutation factor for the residues in the specific water group • The permutation factor depends on the number of natural residues in the simplex, for n residues: where r is the number of residues and t is the count of type i residues in the simplex • The probability of the individual residues is based on the total residue count

Comparison of Potentials • No hydration yields 8855 possible quads, 21st residue has 10,625 • Values of 21st and SP are elevated, compared to WG and unhydrated • Fine structure at low values are water groups • Absolute values of CM higher than CA

Evaluating the potentials: decoy discrimination • Tests the ability of the model to identify a native protein from a large set of non-native decoy structures. Sets are drawn from modeling activities and are generally plausible (but incorrect) structures. • Four decoy sets were drawn from ‘Decoys-R-Us’ (1) 4_state_reduced, (2) fisa, (3) lmds, (4) lattice_ssfit • Potentials were computed for entire set (including native) and ranked by potential score, highest score is presumed to be most compatible sequence-structure fit, hopefully the native structure. • Metrics: • Rank of the native protein structure, 1 is best • 0 is worst, more negative is better • Log score index = , 0 is worst and 1 is best • Compared: unhydrated, 3 hydrated strategies, CA, CM, bulk, full

Comparison with Reported Models

Stability Datasets & Metrics • Stability data were initially identified using ProTherm* database • Focused on 3 model proteins with extensive single point mutant stability data • Barnase (132 mutants, denaturant stability ddG, kcal/mol) • Staphylococcal nuclease (498 mutants, denat.) • T4 Lysozyme (377 mutants sorted into 4 pH groups, thermal stability dTM , deg. C) • Also examined conservative vs. non-conservative stability correlations using the Dayhoff groups: {A,S,T,G,P}, {V,L,I,M}, {R,K,H}, {D,E,N,Q,}, {F,Y,W}, {C} • Correlation metrics: (1) Pearson r correlation coefficient, (2) R2 regression metric

Computational Mutagenesis:Determine change in potential associated with a mutation • Determine the protein potential in the usual way: tessellate to get the simplices, then sum potentials for all simplices in the protein to get the w.t. protein potential Assumption: No change in residue coordinates associated with mutation • Change the identify of the target residue to the mutant in every simplex where it appears, then the mutant potential is the sum of the new simplex set • Residual potential = (mutant total potential - w.t. total potential) • Residual profile is the analogous calculation done for each residue position along the sequence.

Stability Correlation Coefficients for Barnase and S. nuclease(Observed change in stability vs. computed change in potential)

T4 LysozymeStabilityCorrelations

Specific Residue Types in S. nuclease

Summary of the Stability Correlations • Significant correlations were observed with all 3 study proteins, but not using all three potentials - barnase was the most difficult and correlated with only CA and SP, T4 lysozyme gave the highest correlation values • For the 3 potential functions, the SP potential achieved the highest levels of correlation and CA was comparable on many groups • Correlations were seen with both conservative and non-conservative substitutions • Correlations were also seen with several specific residue groups • Apparent that there is a relationship between mutational change in stability and the corresponding single-valued residual score

Application of Machine Learning (ML) Tools • Objective is to identify stability content in the mutant residual profiles • Used four supervised ML classifiers available in the Weka* environment • Decision tree (DT): J48(C4.5), Quinlan* • Random Forests (RF): multiple DT with random elements and voting, Breiman* • Support Vector (SV): SMO, nonlinear transform to extend linear techniques, Platt* • Regression SV: SMOreg enables numeric classes, linear or polynomial transforms, Smola* • Focused initially on S. nuclease mutant residual data

Residual Profile Format for ML Tools • Input to ML tools is set of instance vectors, the elements of each vector are the attributes and the class of the instance is included (supervised) • The residual profile of each mutant was converted into an instance vector by assigning the value of the profile at each residue position to a corresponding element (attribute) in the instance vector • In some cases, additional attributes were included: w.t. & mutant identity & position, ASA, Dayhoff class (C/NC), structure (H,S,T,C) • Full attribute set • NoResid: removed the residual profile and total • NoID: removed the w.t. and mutant identity and location • ResidOnly: removed everything except the residual profile and total • Stability data were discretized into 2, 3 and 4 classes (~ equal size)

ML Classification Procedures • Typically a small portion of the instances are removed, the ML tool is trained on the remainder, then tested on the withheld data - this may be a fixed ratio, such as 66% training/34% test • Alternatively a 10-fold cross validation (10fCV) divides instances in 10 parts, then successively withholds 1 part for test and trains on the remainder 10 times, finally computes performance on the aggregate predictions; for small data sets can use ‘leave 1 out’ strategy, equivalent to n-fold CV where n = no. instances • When the number of instances for various classes are skewed, the actual classes in the training set should be stratified to reflect the class distribution, • Weka also provides a strategy in which the instances can be weighted according to a cost matrix to compensate for skew during the training process

ML Performance Metrics • Success rate: • Kappa statistic: subtracts the expected random predictions from the Success Rate numerator & denominator; ranges from 1 - perfect score to 0 - random predictor • ROC curve: unit square plot of the TP Rate vs. FP Rate for ranked instances; insensitive to skew; • Area under ROC (AUC): • Single-valued indicator, equivalent to Wilcoxon or Mann-Whitney test of ranks • Probability that randomly chosen + instance will rank higher than a randomly chosen - instance; • AUC = 1 for a perfect predictor, 0.5 for a random classifier

Performance Comparison of 3 MethodsBased on 10 reps of 10-fold CV • Tools were run with ‘full’ attribute content and the stability data was split into 2 stability classes, either larger or smaller than ddG = -1.2 kcal/mol, error bars indicate the standard deviation across the 10 repetitions • The RF method is significantly better in AUC over the other two, and the SV is slightly ahead of DT in AUC and Kappa (not sig.) • All three potentials perform essentially the same

Effect of Number of Classes Based on 10 reps of 10-fold CV • These data are all with the ‘residual only’ attribute set using data from the SP potential, except for the black circles that show the ‘full’ attribute set for comparison • Increasing the classes clearly reduces the accuracy of the prediction • Here the DT outperforms the SV tool, but RF still shows highest values especially for the AUC

Random Forests model of Transthyretin Amyloid Behavior • A number of mutations of human transthyretin (TTR) have been identified that are associated with specific amyloid pathologies, as well as a limited number that are not amyloidogenic • Residual profiles of 77 amyloid-related and 12 non-amyloid-related mutants were used to train an RF model • Since the input data are highly skewed, a cost matrix was applied to weight the input data to balance the classes; the FP/FN cost ratio was increased until the FP/TN ratio changed • The performance of the model was evaluated and the model was used to predict amyloid behavior for hypothetical TTR mutants

TTR model performance • CA consistently shows better metrics than hydrated potentials • Cost matrix was over-driven to force change in FP [0,1;25,0] • Model performance levels are much lower than observed with S. nuclease, but still well above random levels - Kappa increased to 0.45 with cost ratios of 1:15

Prediction of TTR mutant amyloid behavior based on RF model using SP residuals Mutant array with unweighted model Model trained using 25:1 cost ratio w.t. residue Predicted amyloid True non-amyloid mutant Predicted non-amyloid True amyloid mutant

Conclusions • Computational hydration with a representative water shell yields a more consistent tessellation • Tessellation of hydrated proteins provides a unique topological residue classification method • Hydration (with H atom addition) using either the Water Group or the Split Potential strategy improves the decoy discrimination capabilities of the tessellation-based potential, to be competitive with other KB methods • Demonstrated distinct correlations of stability and residual potential for several proteins and several classes of residues • ML tools showed that information in the residual profile can be used to classify single point mutants by their stability change • An RF model demonstrated reasonable performance in predicting amyloid characteristics of human transthyretin mutants

Committee Dr. Vaisman, Director Dr. Willett Dr. Jamison Dr. Born Bioinformatics Staff Glenda Wilson Chris Ryan Structural Bioinformatics Group Dr. Masso Dr. Mathe Dr. Taylor Dr. Carr NASA Office of Educational Services Acknowledgements

Publications • Papers: • Reck, G.M. & Vaisman, I.I. A novel approach to protein-water interaction characteristics using computational geometry, Proteins (submitted). • Reck, G.M. & Vaisman, I.I. Nearest-neighbor contact potentials derived from Delaunay tessellation of hydrated proteins, Biophysical Journal (submitted). • Reck, G.M. & Vaisman, I.I. Use of statistical potentials derived from Delaunay tessellation to characterize changes in protein stability due to single point mutations, (manuscript in preparation). • Conferences: • Reck, G.M. & Vaisman, I.I. An Examination of Protein Stability Using Delaunay Tessellation that Includes Surface Hydration Effects, Intelligent Systems for Molecular Biology (ISMB), Detroit, MI, June 25-29, 2005. • Reck, G.M. & Vaisman, I.I. Use of Four-body Statistical Potential to Correlate Stability Changes in Protein G, International Conference on Structural Genomics (ICSG), Washington Hilton Hotel, Washington, DC, Nov. 17-21, 2004.

Availability of tools and results • Resources (protein list, reference set of hydrated proteins, etc.,) will be available at http://binf.gmu.edu/greck • Plan an interactive site for hydrating and tessellating proteins

ROC Curve Comparison of 3 ML Methods • All models were run with a full set of attributes. • Data for the DT and RF curves are from a single 10fCV, while the SV was generated by varying the mis-classification costs over a number of runs • The curves are quite consistent for DT and RF

Effect of Attribute Content Based on 10 reps of 10-fold CV • Data are shown for the SP potential only, error bars are std dev over 10 reps • Added content in the full instance vectors yields better performance across all metrics • Again, the RF tool yields the highest values, particularly AUC

S. nuclease Non-conservative Mutants 27 61 121 148 Bartlett’s test indicates unequal variances among the bins (groups) Welch’s test, similar to t test with unequal variances rejects H0: equal means Pairwise t test with unpooled s.d. indicates that not all group means are significantly different

Gregory M. Reck PhD Dissertation Defense