Protein Structure Prediction

Protein Structure Prediction Xiaole Shirley Liu And Jun Liu STAT115

Protein Structure Prediction Ram Samudrala University of Washington

Outline • Motivations and introduction • Protein 2nd structure prediction • Protein 3D structure prediction • CASP • Homology modeling • Fold recognition • ab initio prediction • Manual vs automation • Structural genomics STAT115

Sequence determines structure, structure determines function Most proteins can fold by itself very quickly Folded structure: lowest energy state Protein Structure STAT115

Protein Structure • Main forces for considerations • Steric complementarity • Secondary structure preferences (satisfy H bonds) • Hydrophobic/polar patterning • Electrostatics

Rationale for understanding protein structure and function structure determination structure prediction Protein structure - three dimensional - complicated - mediates function homology rational mutagenesis biochemical analysis model studies Protein sequence -large numbers of sequences, including whole genomes ? Protein function - rational drug design and treatment of disease - protein and genetic engineering - build networks to model cellular pathways - study organismal function and evolution

Protein Databases • SwissProt: protein knowledgebase • PDB: Protein Data Bank, 3D structure STAT115

View Protein Structure • Free interactive viewers • Download 3D coordinate file from PDB • Quick and dirty: • VRML • Rasmol • Chime • More powerful • Swiss-PdbViewer

Compare Protein Structures • Structure is more conserved than sequence • Why compare? • Detect evolutionary relationships • Identify recurring structural motifs • Predicting function based on structure • Assess predicted structures • Protein structure comparison and classification • Manual: SCOP • Automated: DALI

3.6 Å 2.9 Å NK-lysin (1nkl) Bacteriocin T102/as48 (1e68) T102 best model Compare protein structures • Need ways to determine if two protein structures are related and to compare predicted • models to experimental structures • Commonly used measure is the root mean square deviation (RMSD)of the Cartesian • atoms between two structures after optimalsuperposition (McLachlan, 1979): • Usually use Caatoms • Other measures include contact maps and torsion angle RMSDs

SCOP • Compare protein structure, identify recurring structural motifs, predict function • A. Murzin et al, 1995 • Manual classification • A few folds are highly populated • 5 folds contain 20% of all homologous superfamilies • Some folds are multifunctional STAT115

Determine Protein Structure • X-ray crystallography (gold standard) • Grow crystals, rate limiting, relies on the repeating structure of a crystalline lattice • Collect a diffraction pattern • Map to real space electron density, build and refine structural model • Painstaking and time consuming STAT115

Protein Structure Prediction • Since AA sequence determines structure, can we predict protein structure from its AA sequence? = predicting the three angles, unlimited DoF! • Physical properties that determine fold • Rigidity of the protein backbone • Interactions among amino acids, including • Electrostatic interactions • van der Waals forces • Volume constraints • Hydrogen, disulfide bonds • Interactions of amino acids with water STAT115

Protein folding landscape Large multi-dimensional space of changing conformations J=10-3 s unfolded barrier height molten globule DG* native J=10-8 s free energy folding reaction

Protein primary structure twenty types of amino acids two amino acids join by forming a peptide bond R R H O H C H H N OH Cα Cα Cα OH N C N C H H O O H H R each residue in the amino acid main chain has two degrees of freedom (f and y) R R H O H O H H c c y f y f C N C f N f Cα Cα Cα Cα N C N C y y c c H H O H O H R R the amino acid side chains can have up to four degrees of freedom (c1-4)

2nd Structure Prediction •  helix,  sheet, turn/loop STAT115

2nd Structure Prediction • Chou-Fasman 1974 • Base on 15 proteins (2473 AAs) of known conformation, determine P, Pfrom  0.5-1.5 • Empirical rules for 2nd struct nucleation • 4 H or h out of 6 AA, extends to both dir, P > 1.03, P > P, no  breakers • 3 H or h out of 5 AA, extends to both dir, P > 1.05, P > P, no  breakers • Have ~50-60% accuracy STAT115

P and P STAT115

2nd Structure Prediction • Garnier, Osguthorpe, Robson, 1978 • Assumption: each AA influenced by flanking positions • GOR scoring tables (problem: limited dataset) • Add scores, assign 2nd with highest score STAT115

2nd Structure Prediction • D. Eisenberg, 1986 • Plot hydrophobicity as function of sequence position, look for periodic repeats • Period = 3-4 AA,  (3.6 aa / turn) • Period = 2 AA,  sheet • Best overall JPRED by Geoffrey Barton, use many different approaches, get consensus • Overall accuracy: 72.9% STAT115

3D Protein Structure Prediction • CASP contest: Critical Assessment of Structure Prediction • Biannual meeting since 1994 at Asilomar, CA • Experimentalists: before CASP, submit sequence of to-be-solved structure to central repository • Predictors: download sequence and minimal information, make predictions in three categories • Assessors: automatic programs and experts to evaluate predictions quality STAT115

CASP Category I • Homology Modeling (sequences with high homology to sequences of known structure) • Given a sequence with homology > 25-30% with known structure in PDB, use known structure as starting point to create a model of the 3D structure of the sequence • Takes advantage of knowledge of a closely related protein. Use sequence alignment techniques to establish correspondences between known “template” and unknown. STAT115

CASP Category II • Fold recognition (sequences with no sequence identity (<= 30%) to sequences of known structure • Given the sequence, and a set of folds observed in PDB, see if any of the sequences could adopt one of the known folds • Takes advantage of knowledge of existing structures, and principles by which they are stabilized (favorable interactions) STAT115

CASP Category III • Ab initio prediction (no known homology with any sequence of known structure) • Given only the sequence, predict the 3D structure from “first principles”, based on energetic or statistical principles • Secondary structure prediction and multiple alignment techniques used to predict features of these molecules. Then, some method necessary for assembling 3D structure. STAT115

Structure Prediction Evaluation • Hydrophobic core similar? • 2nd struct identified? • Energy: minimized? H-bond contacts? • Compare with solved crystal structure: gold standard STAT115

Comparative modelling of protein structure scan align KDHPFGFAVPTKNPDGTMNLMNWECAIP KDPPAGIGAPQDN----QNIMLWNAVIP ** * * * * * * * ** … … build initial model construct non-conserved side chains and main chains refine

Homology Modeling Results • When sequence homology is > 70%, high resolution models are possible (< 3 Å RMSD) • MODELLER (Sali et al) • Find homologous proteins with known structure and align • Collect distance distributions between atoms in known protein structures • Use these distributions to compute positions for equivalent atoms in alignment • Refine using energetics STAT115

Homology Modeling Results • Many places can go wrong: • Bad template - it doesn’t have the same structure as the target after all • Bad alignment (a very common problem) • Good alignment to good template still gives wrong local structure • Bad loop construction • Bad side chain positioning STAT115

Homology Modeling Results • Use of sensitive multiple alignment (e.g. PSI-BLAST) techniques helped get best alignments • Sophisticated energy minimization techniques do not dramatically improve upon initial guess STAT115

Fold Recognition Results • Also called protein threading • Given new sequence and library of known folds, find best alignment of sequence to each fold, returned the most favorable one STAT115

Fold Recognition with Dynamic Programming • Environmental class for each AA based on known folds (buried status, polarity, 2nd struct) STAT115

Protein Folding with Dynamic Programming • D. Eisenburg 1994 • Align sequence to each fold (a string of environmental classes) • Advantages: fast and works pretty well • Disadvantages: do not consider AA contacts STAT115

Fold Recognition Results • Each predictor can submit N top hits • Every predictor does well on something • Common folds (more examples) are easier to recognize • Fold recognition was the surprise performer at CASP1. Incremental progress at CASP2, CASP3, CASP4… STAT115

Fold Recognition Results • Alignment (seq to fold) is a big problem STAT115

ab initio • Predict interresidue contacts and then compute structure (mild success) • Simplified energy term + reduced search space (phi/psi or lattice) (moderate success) • Creative ways to memorize sequence  structure correlations in short segments from the PDB, and use these to model new structures: ROSETTA STAT115

select Ab initio prediction of protein structure sample conformational space such that native-like conformations are found hard to design functions that are not fooled by non-native conformations (“decoys”) astronomically large number of conformations 5 states/100 residues = 5100 = 1070

Sampling conformational space – continuous approaches energy • Most work in the field • Molecular dynamics • Continuous energy minimization (follow a valley) • Monte Carlo simulation • Genetic Algorithms • Like real polypeptide folding process • Cannot be sure if native-like conformations are sampled

Molecular dynamics • Force = -dU/dx (slope of potential U); acceleration, force =m ×a(t) • All atoms are moving so forces between atoms are complicated functions of time • Analytical solution for x(t) and v(t) is impossible; numerical solution is trivial • Atoms move for very short times of 10-15 seconds or 0.001 picoseconds (ps) • x(t+Dt) = x(t) + v(t)Dt + [4a(t) – a(t-Dt)] Dt2/6 • v(t+Dt) = v(t) + [2a(t+Dt)+5a(t)-a(t-Dt)] Dt/6 • Ukinetic = ½ Σ mivi(t)2 = ½ n KBT • Total energy (Upotential + Ukinetic) must not change with time acceleration old velocity old position new position new velocity n is number of coordinates (not atoms)

starting conformation energy deep minimum number of steps Energy minimization • For a given protein, the energy depends on thousands of x,y,z Cartesian atomic coordinates; reaching a deep minimum is not trivial • Furthermore, we want to minimize the free energy, not just the potential energy.

Monte Carlo Simulation • Propose moves in torsion or Cartesian conformation space • Evaluate energy after every move, compute E • Accept the new conformation based on • If run infinite time, the simulated conformation follows the Boltzmann distribution • Many variations, including simulated annealing and other heuristic approaches.

Scoring/energy functions • Need a way to select native-like conformations from non-native ones • Physics-based functions: electrostatics, van der Waals, solvation, bond/angle terms. • Knowledge-based scoring functions: • Derive information about atomic properties from a database of experimentally determined conformations • Common parameters include pairwise atomic distances and amino acid burial/exposure.

Rosetta • D. Baker, U. Wash • Break sequence into short segments (7-9 AA) • Sample 3D from library of known segment structures, parallel computation • Use simulated annealing (metropolis-type algorithm) for global optimization • Propose a change, if better energy, take; otherwise take at smaller probability • Create 1000 structures, cluster and choose one representative from each cluster to submit STAT115

Manual Improvements and Automation • Very often manual examination could improve prediction • Catch errors • Need domain knowledge • A. Murzin’s success at CASP2 • CAFASP: Critical Assessment of Fully Automated Structure Prediction • Murzin Can’t play!! • MetaServers: combine different methods to get consensus STAT115

CAFASP Evaluation STAT115

Structural Genomics • With more and more solved structures and novel folds, computational protein structure prediction is going to improve • Structural genomics: • Worldwide initiative to high throughput determine many protein structures • Especially, solve structures that have no homology STAT115

Summary • Protein structures: 1st, 2nd, 3rd, 4th • Different DB: SwissProt, PDB and SCOP • Determine structure: X-ray crystallography • Protein structure prediction: • 2nd structure prediction • Homology modeling • Fold recognition • Ab initio • Evaluation: energy, RMSD, etc • CASP and CAFASP contest • Manual improvement and combination of computational approaches work better • Structural Genomics, still very difficult problem… STAT115

Acknowledgement • Amy Keating • Michael Yaffe • Mark Craven • Russ Altman STAT115

Protein Structure Prediction