280 likes | 425 Views
Can protein model accuracy be identified?. NO!. Morten Nielsen, CBS, BioCentrum, DTU. Identification of Protein-model accuracy. Why is it important? What is accuracy RMSD, fraction correct,… Protein model correctness/quality Procheck, Whatif, ProsaII, Verify3d
E N D
Can protein model accuracy beidentified? NO! Morten Nielsen, CBS, BioCentrum, DTU
Identification of Protein-model accuracy • Why is it important? • What is accuracy • RMSD, fraction correct,… • Protein model correctness/quality • Procheck, Whatif, ProsaII, Verify3d • Prediction of protein model accuracy • ProQ server
Why is it so important • Reliable fold recognition • P-value, E-value, Z-score… • Tells you if you should believe in the fold!! • Alignment (model construction) • No obvious method to estimate reliability of alignment • Number of gaps, length of gaps • Amino acids in protein core and loops • % id is too conservative • Many low homology models are accurate, and some high homology model are wrong • Correct fold, wrong alignment => Terrible model • How to gain confidence in a protein model?
Model accuracy. Swiss-model.1200 models sharing 25-95% sequence identity with the submitted sequences (www.expasy.ch/swissmod)
What is protein model accuracy • Model quality (correctness) • Does the model look like a protein? • Hydrophobic residues in core, hydrophilic on surface • Backbone geometry (phi/psi angles, bond-length) • Amino acid environment • A correct model can be completely wrong • Structure accuracy (if we know the answer) • RMSD • Fraction of correct modeled residues
Amino acid environment • 1.000.000 of different protein sequences (Swissprot) • 10.000 different solved protein structures (PDB) • 600 different protein folds => Typical amino acid environment 1.000.000 10.000 600
Model accuracy Blue model Yellow structure dij Fraction correct = Nc/N Nc = number correct (dij<4Å)
Evaluation of model quality • Check for proper protein stereochemistry • ProCheck (http://biotech.ebi.ac.uk:8400/cgi-bin/sendquery) • Ramachandran plot, bond-length, … • Whatif (http://www.cmbi.kun.nl/gv/servers/WIWWWI) • Packing quality • Both web-servers • Fitness of sequence to structure • ProsaII (http://lore.came.sbg.ac.at/Services/prosa.html) • Program runs on Linux and Unix • Verify3D (http://www.doe-mbi.ucla.edu/Services/Verify_3D/) • Web-server
ProCheckPeptide backbone geometry • Peptide planes • CaNCCa • Dihedral angles y, f • y, f = 180 degrees • b strand • y, f = -60 degrees • a helix Y From speedy.st-and.ac.uk/.../lectures/ 3014/lecture/dars1.htm
Ramachandran plot B L • B. Beta strand • A. Right handed helix • L. Left handed helix • Color coding • White. Disallowed • Red. Most favorable • Yellow. Allowed region • Glycine triangles A
Find the wrong structure 1PLC Electron transport protein 1RIP Ribosomal protein.
What-if. Fine packing Quality • Statistical description of local chemical environment in high quality protein structures • Superimpose tryptophans and find average local environment. Same for other amino acids • Full atom model G. Vriend and C. Sander, 1992
Example. Casp Model T0133 • T0133 Casp5 target • Modeled by X3M (CPHModels-2.0, Lund O., 2002) • RMSD=7.3
Casp Model - Fine packing quality BB: Backbone SC: Sidechain ---Residue----- State AllAll BB-BB BB-SC SC-BB SC-SC ------------------------------------------------------------------------- 1 ILE ( 33 ) 2 -0.737 -0.462 0.331 -1.312 -0.865 2 SER ( 34 ) 2 -0.241 0.209 -0.021 -1.437 -1.421 ….. 245 ALA ( 296 ) 2 -1.919 -1.770 -1.264 0.000 0.000 246 GLU ( 297 ) 3 -1.384 -0.641 -1.400 0.070 -1.132 247 HIS ( 298 ) 3 -1.476 -1.211 -1.736 -0.874 -1.427 ============================================================ All contacts : Average = -0.459 Z-score = -3.05 BB-BB contacts : Average = -0.155 Z-score = -1.14 BB-SC contacts : Average = -0.445 Z-score = -2.94 SC-BB contacts : Average = -0.221 Z-score = -1.39 SC-SC contacts : Average = -0.701 Z-score = -4.10 ============================================================ Average protein values ("Z-score for all contacts") can be read as follows: -5.0 Guaranteed wrong structure. Bad structure or poor model -3.0 Probably bad structure or unrefined model. Doubtful structure or model -2.0 Structure OK or good model. Good structures 0.0 Good structures. 2.0 Good structures. Unusually Good structures 4.0 Probably a strange model of a perfect helix Bad model
T0133 structure - Fine packing quality ---Residue----- State AllAll BB-BB BB-SC SC-BB SC-SC ------------------------------------------------------------------------- 18 ILE ( 33 ) A 2 0.781 1.018 -0.116 0.661 -0.291 19 SER ( 34 ) A 2 1.435 1.467 0.077 2.284 0.134 ….. 281 ALA ( 296 ) A 2 -2.272 -2.504 -0.404 0.000 0.000 282 GLU ( 297 ) A 2 -0.778 -1.601 -1.256 0.137 1.471 283 HIS ( 298 ) A 3 -0.836 -0.801 -0.948 -1.094 0.351 ============================================================ All contacts : Average = 0.001 Z-score = -0.04 BB-BB contacts : Average = -0.040 Z-score = -0.40 BB-SC contacts : Average = 0.139 Z-score = 0.90 SC-BB contacts : Average = -0.196 Z-score = -1.23 SC-SC contacts : Average = -0.024 Z-score = 0.02 ============================================================ Average protein values ("Z-score for all contacts") can be read as follows: -5.0 Guaranteed wrong structure. Bad structure or poor model -3.0 Probably bad structure or unrefined model. Doubtful structure or model -2.0 Structure OK or good model. Good structures 0.0 Good structures. 2.0 Good structures. Unusually Good structures 4.0 Probably a strange model of a perfect helix Good model
ProsaII (Potential of Mean Force)Likelihood of amino acid packing • Method developed by Manfred Sippl., 1993 • Works for Ca-models • For high quality protein structure estimate nearest neighbor counts for all aa • E = -log(P(N|a)/P(N)) • Hydrophobic residues tend to have many neighbors (buried) • Hydrophilic residues tend to have fewer N (exposed) • Finding an hydrophilic aa with many NN can indicate wrong model Exposure potential for D D is a charged aa Sippl, J.M. (1990) J. Mol. Biol. 213,859-883 (1990).
ProsaII (Potential of Mean Force)Likelihood of amino acid packing • E = -log(P(r|abs)/P(r|s)) • If D and E are close in sequence (s=3), then they prefer to be close in distance d~5.5Å • Hydrogen bonds? Pair potential for D, E. s=3 s b a r Sippl, J.M. (1990) J. Mol. Biol. 213,859-883 (1990).
Verify 3D (Eisenberg et al. 1997) • Closely related to ProsaII exposure potential. • How well does aa fit its local environment (hydrophobic/hydrophilic) • T0133 Casp5 target • Modeled by X3M (Lund, O., 2002) • RMSD=7.3 • Red: Crystal structure, • Blue: Model
Model T0133. Verify 3D Sequence has poor match to structure
ProQ. Prediction of Model accuracy • Neural network to identify correct protein models. • B. Wallner and Arne Elofsson, 2003 • http://www.sbc.su.se/~bjorn/ProQ • Input, a pdb structure/model • Output, accuracy measure • LGscore • Maxsub score
ProQ • Input to neural net • Atom-atom contacts • C, N, O • How often is C in contact with N? • Residue-residue contacts • How often is E in contact with D? • Solvent accessibility surface • Average exposure of L’s • Secondary structure prediction • How consistent is prediction with model?
LifeBench data • 11000 Models • 220 targets • Modeled by Pcons • Incorrect model • Lgscore <1.5 • Maxsub < 0.1
Conclusions • Correct protein models cannot (yet!) reliably be identified!! • Many methods from the protein crystallography world are useful to identify wrong models • Bad models can however pass all filters • ProQ is a first attempt of an “accuracy prediction server” • Can integrate information from many sources • Future will show if this approach can provide reliable prediction of model accuracy