Can protein model accuracy be identified?

Can protein model accuracy beidentified? NO! Morten Nielsen, CBS, BioCentrum, DTU

Identification of Protein-model accuracy • Why is it important? • What is accuracy • RMSD, fraction correct,… • Protein model correctness/quality • Procheck, Whatif, ProsaII, Verify3d • Prediction of protein model accuracy • ProQ server

Why is it so important • Reliable fold recognition • P-value, E-value, Z-score… • Tells you if you should believe in the fold!! • Alignment (model construction) • No obvious method to estimate reliability of alignment • Number of gaps, length of gaps • Amino acids in protein core and loops • % id is too conservative • Many low homology models are accurate, and some high homology model are wrong • Correct fold, wrong alignment => Terrible model • How to gain confidence in a protein model?

Model accuracy. Swiss-model.1200 models sharing 25-95% sequence identity with the submitted sequences (www.expasy.ch/swissmod)

What is protein model accuracy • Model quality (correctness) • Does the model look like a protein? • Hydrophobic residues in core, hydrophilic on surface • Backbone geometry (phi/psi angles, bond-length) • Amino acid environment • A correct model can be completely wrong • Structure accuracy (if we know the answer) • RMSD • Fraction of correct modeled residues

Amino acid environment • 1.000.000 of different protein sequences (Swissprot) • 10.000 different solved protein structures (PDB) • 600 different protein folds => Typical amino acid environment 1.000.000 10.000 600

Model accuracy Blue model Yellow structure dij Fraction correct = Nc/N Nc = number correct (dij<4Å)

Evaluation of model quality • Check for proper protein stereochemistry • ProCheck (http://biotech.ebi.ac.uk:8400/cgi-bin/sendquery) • Ramachandran plot, bond-length, … • Whatif (http://www.cmbi.kun.nl/gv/servers/WIWWWI) • Packing quality • Both web-servers • Fitness of sequence to structure • ProsaII (http://lore.came.sbg.ac.at/Services/prosa.html) • Program runs on Linux and Unix • Verify3D (http://www.doe-mbi.ucla.edu/Services/Verify_3D/) • Web-server

ProCheckPeptide backbone geometry • Peptide planes • CaNCCa • Dihedral angles y, f • y, f = 180 degrees • b strand • y, f = -60 degrees • a helix Y From speedy.st-and.ac.uk/.../lectures/ 3014/lecture/dars1.htm

Ramachandran plot B L • B. Beta strand • A. Right handed helix • L. Left handed helix • Color coding • White. Disallowed • Red. Most favorable • Yellow. Allowed region • Glycine triangles A

Find the wrong structure 1PLC Electron transport protein 1RIP Ribosomal protein.

Procheck. Bond length 1plc

1plc

What-if. Fine packing Quality • Statistical description of local chemical environment in high quality protein structures • Superimpose tryptophans and find average local environment. Same for other amino acids • Full atom model G. Vriend and C. Sander, 1992

Example. Casp Model T0133 • T0133 Casp5 target • Modeled by X3M (CPHModels-2.0, Lund O., 2002) • RMSD=7.3

Casp Model - Fine packing quality BB: Backbone SC: Sidechain ---Residue----- State AllAll BB-BB BB-SC SC-BB SC-SC ------------------------------------------------------------------------- 1 ILE ( 33 ) 2 -0.737 -0.462 0.331 -1.312 -0.865 2 SER ( 34 ) 2 -0.241 0.209 -0.021 -1.437 -1.421 ….. 245 ALA ( 296 ) 2 -1.919 -1.770 -1.264 0.000 0.000 246 GLU ( 297 ) 3 -1.384 -0.641 -1.400 0.070 -1.132 247 HIS ( 298 ) 3 -1.476 -1.211 -1.736 -0.874 -1.427 ============================================================ All contacts : Average = -0.459 Z-score = -3.05 BB-BB contacts : Average = -0.155 Z-score = -1.14 BB-SC contacts : Average = -0.445 Z-score = -2.94 SC-BB contacts : Average = -0.221 Z-score = -1.39 SC-SC contacts : Average = -0.701 Z-score = -4.10 ============================================================ Average protein values ("Z-score for all contacts") can be read as follows: -5.0 Guaranteed wrong structure. Bad structure or poor model -3.0 Probably bad structure or unrefined model. Doubtful structure or model -2.0 Structure OK or good model. Good structures 0.0 Good structures. 2.0 Good structures. Unusually Good structures 4.0 Probably a strange model of a perfect helix Bad model

T0133 structure - Fine packing quality ---Residue----- State AllAll BB-BB BB-SC SC-BB SC-SC ------------------------------------------------------------------------- 18 ILE ( 33 ) A 2 0.781 1.018 -0.116 0.661 -0.291 19 SER ( 34 ) A 2 1.435 1.467 0.077 2.284 0.134 ….. 281 ALA ( 296 ) A 2 -2.272 -2.504 -0.404 0.000 0.000 282 GLU ( 297 ) A 2 -0.778 -1.601 -1.256 0.137 1.471 283 HIS ( 298 ) A 3 -0.836 -0.801 -0.948 -1.094 0.351 ============================================================ All contacts : Average = 0.001 Z-score = -0.04 BB-BB contacts : Average = -0.040 Z-score = -0.40 BB-SC contacts : Average = 0.139 Z-score = 0.90 SC-BB contacts : Average = -0.196 Z-score = -1.23 SC-SC contacts : Average = -0.024 Z-score = 0.02 ============================================================ Average protein values ("Z-score for all contacts") can be read as follows: -5.0 Guaranteed wrong structure. Bad structure or poor model -3.0 Probably bad structure or unrefined model. Doubtful structure or model -2.0 Structure OK or good model. Good structures 0.0 Good structures. 2.0 Good structures. Unusually Good structures 4.0 Probably a strange model of a perfect helix Good model

ProsaII (Potential of Mean Force)Likelihood of amino acid packing • Method developed by Manfred Sippl., 1993 • Works for Ca-models • For high quality protein structure estimate nearest neighbor counts for all aa • E = -log(P(N|a)/P(N)) • Hydrophobic residues tend to have many neighbors (buried) • Hydrophilic residues tend to have fewer N (exposed) • Finding an hydrophilic aa with many NN can indicate wrong model Exposure potential for D D is a charged aa Sippl, J.M. (1990) J. Mol. Biol. 213,859-883 (1990).

ProsaII (Potential of Mean Force)Likelihood of amino acid packing • E = -log(P(r|abs)/P(r|s)) • If D and E are close in sequence (s=3), then they prefer to be close in distance d~5.5Å • Hydrogen bonds? Pair potential for D, E. s=3 s b a r Sippl, J.M. (1990) J. Mol. Biol. 213,859-883 (1990).

Verify 3D (Eisenberg et al. 1997) • Closely related to ProsaII exposure potential. • How well does aa fit its local environment (hydrophobic/hydrophilic) • T0133 Casp5 target • Modeled by X3M (Lund, O., 2002) • RMSD=7.3 • Red: Crystal structure, • Blue: Model

Model T0133. Verify 3D Sequence has poor match to structure

ProQ. Prediction of Model accuracy • Neural network to identify correct protein models. • B. Wallner and Arne Elofsson, 2003 • http://www.sbc.su.se/~bjorn/ProQ • Input, a pdb structure/model • Output, accuracy measure • LGscore • Maxsub score

ProQ • Input to neural net • Atom-atom contacts • C, N, O • How often is C in contact with N? • Residue-residue contacts • How often is E in contact with D? • Solvent accessibility surface • Average exposure of L’s • Secondary structure prediction • How consistent is prediction with model?

Casp model T0113

Structure 1RIP

LifeBench data • 11000 Models • 220 targets • Modeled by Pcons • Incorrect model • Lgscore <1.5 • Maxsub < 0.1

Conclusions • Correct protein models cannot (yet!) reliably be identified!! • Many methods from the protein crystallography world are useful to identify wrong models • Bad models can however pass all filters • ProQ is a first attempt of an “accuracy prediction server” • Can integrate information from many sources • Future will show if this approach can provide reliable prediction of model accuracy

Can protein model accuracy be identified?

Can protein model accuracy be identified?

Presentation Transcript

Cycle Counting – the Secret to Inventory Accuracy

Protein Homology Modelling

Ruminant Protein Nutrition

Addressing Housing and Food Insecurity with Program Income

Recombinant protein production in Eukaryotic cells

Protein 3D-structure analysis

Nuclear Magnetic Resonance (NMR) Data Protein–Protein Docking

What is the difference between accuracy and precision?

Protein metabolism

Lecture 4 Protein Function prediction using network concepts Hierarchical Clustering

Chapter 17 From Gene to Protein

Protein Concentration Determination

Protein folding

Protein interactions and Pathways

Protein Structure

Protein – protein interaction

The Protein

From DNA to Protein: Gene Expression

Protein Chemistry Basics

DockoMatic : Automated Tool for Homology Modeling and Docking Studies