440 likes | 465 Views
Crystallography -- Lecture 22. Refinement and Validation. Refinement. Initial model to final model. Steps after initial modeling: Rigid body refinement. (2) Density modification. (3) Difference maps. (4) Least squares, protein coordinates + overall B-factor.
E N D
Crystallography -- Lecture 22 Refinement and Validation
Refinement Initial model to final model Steps after initial modeling: Rigid body refinement. (2) Density modification. (3) Difference maps. (4) Least squares, protein coordinates + overall B-factor. (5) Add waters, ions. More least squares. (6) Least squares, protein coordinates + atomic B-factors. (7) Least squares, multiple occupancy and anisotropic B-factors. (8) Validation. Publication!
Rigid body refinement (1) Rigid body refinement.After molecular replacement only, to get the precise orientation of the molecule relative to the crystal axes. Whole molecule treated as a rigid group. Model may be cut into domains. If so, then each domain is rigidbody refined.
Solvent Flattening: Make the water part of the map flat. (1) Calculate map. (2) Skeletonize the map (3) Make the skeleton “protein-like” (4) Back transform the skeleton. (1) Draw envelope around protein part initial phases Density modification : Fo’s and (new) phases Fc’s and new phases Map Modified map (2) Set solvent r to <r> and back transform. Protein-like means: (a) no cycles, (b) no islands Density modification. (2) Density modification.Coordinate-free refinement. The map is modified directly, then new phases are calculated. This step may be skipped for good starting models.
Difference maps (3) Difference maps are used throughout the refinement process after a model has been built. (Fo-Fc) = Difference map. Fc is calculate from the coordinates. This map shows missing or wrongly placed atoms. (2Fo-Fc) = This is a “native” map (Fo) plus a difference map (Fo-Fc). This map should look like the corrected model. Omit map = Difference map or 2Fo-Fc after removing suspicious coordinates. Removes “phase bias” density that results from least-squares refinement using wrong coordinates. (X) means “maps calculated using amplitudes X”
Omit maps Two inhibitor peptides in two different crystals of the protease thrombin. The inhibitor coordinates were omitted from the model before calculating Fc. Then maps were made using Fo-Fc amplitudes and Fc phases. (stereo images) FÉTHIÈRE et al, Protein Science (1996), 5: 1174- 1183.
bond lengths bond angles torsion angles van der Waals planar groups Least-squares refinement (4) Least squares, protein coordinates + overall B-factor. • The partial derivative of the R-factor with respect to each atomic position can be calculated, because we know the change in amplitudes with change in coordinates. • A 3D derivative is a “gradient”. Each atom is moved down-hill along the gradient. • “Restraints” may be imposed to maintain good stereochemistry. Restraint types:
Stereochemical constraints Constraints reduce the effective number of parameters • Bond lengths, angles, and planar groups may be fixed (frozen) to their ideal values during refinement. • Using constraints, Ser has 3 parameters, Phe 4, and Arg 6. bond lengths bond angles • There are an average 3.5 torsion angles per residue. • Papain has ~700 torsion angle parameters. • data/parameter ratio =25,000/700≈35 planar groups
Adding waters, ions. (5) Add waters, ions. More least squares. Calculate difference map Place waters (just an oxygen) in the peak positive density position if (1) there is no atom there, (2) there is an atom nearby, (3) the density or shape does not suggest an ion of ligand.
(6) Least squares, protein coordinates + atomic B-factors. Atomic B-factor refinement B = “temperature factor” = Gaussian d-2-dependent scale factor Gaussian equation : FT : The derivative of the R-factor with respect to B can be calculated, since B-effects the amplitudes. Because the high resolution amplitudes depend on B more than low-resolution amplitudes, high resolution (2.5Å or better) is required to refine atomic B-factors. Restraint: Atoms that are bonded to each other should not have large differences in B.
OH OH OH Multiple Occupancy (7) Least squares, multiple occupancy and anisotropic B-factors. Only possible with high-resolution data and a high-quality model.Some atoms (Ser or Val sidechains) may have more than one location. Multiple alternative locations may be defined for these cases. 1 2 3 4 5 6 7 8 12345678901234567890123456789012345678901234567890123456789012345678901234567890 ATOM 145 N VAL A 25 32.433 16.336 57.540 1.00 11.92 A1 N ATOM 146 CA VAL A 25 31.132 16.439 58.160 1.00 11.85 A1 C ATOM 147 C VAL A 25 30.447 15.105 58.363 1.00 12.34 A1 C ATOM 148 O VAL A 25 29.520 15.059 59.174 1.00 15.65 A1 O ATOM 149 CB AVAL A 25 30.385 17.437 57.230 0.28 13.88 A1 C ATOM 150 CB BVAL A 25 30.166 17.399 57.373 0.72 15.41 A1 C ATOM 151 CG1AVAL A 25 28.870 17.401 57.336 0.28 12.64 A1 C ATOM 152 CG1BVAL A 25 30.805 18.788 57.449 0.72 15.11 A1 C ATOM 153 CG2AVAL A 25 30.835 18.826 57.661 0.28 13.58 A1 C ATOM 154 CG2BVAL A 25 29.909 16.996 55.922 0.72 13.25 A1 C PDB “ATOM” lines showingaltloc indicators (A or B)in column 17 and occupancy in cols 56-60.
Anisotropic B-factors (7) Least squares, multiple occupancy and anisotropic B-factors. Atom motions are probably not isotropic. The cloud of density for each atom can be better modeled by an ellipsoidal Gaussian. (6 parameters) 1 2 3 4 5 6 7 812345678901234567890123456789012345678901234567890123456789012345678901234567890ATOM 107 N GLY 13 12.681 37.302 -25.211 1.000 15.56 NANISOU 107 N GLY 13 2406 1892 1614 198 519 -328 NATOM 108 CA GLY 13 11.982 37.996 -26.241 1.000 16.92 CANISOU 108 CA GLY 13 2748 2004 1679 -21 155 -419 CATOM 109 C GLY 13 11.678 39.447 -26.008 1.000 15.73 CANISOU 109 C GLY 13 2555 1955 1468 87 357 -109 CATOM 110 O GLY 13 11.444 40.201 -26.971 1.000 20.93 OANISOU 110 O GLY 13 3837 2505 1611 164 -121 189 OATOM 111 N ASN 14 11.608 39.863 -24.755 1.000 13.68 NANISOU 111 N ASN 14 2059 1674 1462 27 244 -96 N PDB “ANISOU” lines follow “ATOM” or “HETATM” lines.
Molecular dynamics w/ Xray refinement MD samples conformational space while maintaining good geometry (low residual in restraints). E = (residual of restraints) + (R-factor) dE/dxi is calculated for each atom i, then we move i downhill. Random vectors added, proportional to temperature T. The simulated annealing MD method: (1) start the simulation “hot” (2) “cool” slowly, trapping structure in lowest minimum. “X-plor” Axel Brünger et al
radius of convergence total residual parameter space ...=How far away from the truth can it be, and still find the truth? radius of convergence depends on data & method. More data = fewer false (local) minima Better method = one that can overcome local minima
The final model www.rcsb.org
Sources of error • Error is broadly defined as the difference between your model and reality. • Sources of error can be in the data (the crystal itself or the processing of the data) or in the molecularmodel. • If the model is at fault, errors may be localized to certain parts of a model, or spread throughout.
X-rays Crystal Detector Polarization variable flux colimation filtering/monochrometer Sources of error in crystal structures Data Model
Experimental sources of error Solution: zonal scaling. weaker scatter vertically Polarization Scale factors are calculated in evenly-sampled zones of reciprocal space. vertical graphite monochromater horizontally polarized X-rays
Experimental sources of error A problem for synchrotron X-rays. Solution: Use an external flux meter. Scaling. variable flux t Large colimator means high background, large spots, spot overlap if cell dimensions are large. Small colimator means longer exposures. colimation Spots may be radially smeared. Solution: Use monochromater instead of direct Xrays. variable wavelength
X-rays Crystal Detector mosaicity twinning absorbsion decay non-isomorphism Sources of error in crystal structures Data Model
X-rays Crystal Detector mosaicity twinning absorbsion decay non-isomorphism Sources of error in crystal structures Data Model get a better crystal separate multiple crystals clean and dry the crystal freeze the crystal give up, start over
X-rays Crystal Detector saturation limit machining pixel size shorter exposures sue back up, you’re too close Sources of error in crystal structures Data Model
data/parameter ratio phase bias bad geometry X-rays Crystal Detector Computational Sources of error Data Luzatti or A plot will estimate errors. Real-space R. Model Omit maps, 2Fo-Fc maps. PROCHECK
Cross-validation: The free R-factor The R-factor measures the residual difference between observed and calculated amplitudes. Free R is summed on a “test set”. Test set data was not used for refinement. Free R ask: “How well does your model predict the data it hasn’t been fit to?” Note: T = independent test set of F’s.
calculated R-factor = 0.000!! What is over-fitting? If you have three points, you can fit them to a quadratic equation (3 parameters) with zero residual, but is it right? Observed data
Fitting unseen data, as a test Fit is correct if additional data, not used in fitting the curve, fall on the curve. Low residual in the “test set” validates the fit. residual≠0
cross-validation Means: measuring the residual on data (a “test set”) that were not used to refine (or fit) the model. The residual on test data is likely to be small if is large. a line has 2 parameters
Parameters versus Data Example from Drenth, Ch 13: Papain crystal structure has 25,000 reflections. Papain has 2000 non-H atoms times 4 parameters each (x, y, z, B) equals 8000 parameters data/parameters = 25,000/8000 ≈ 3 <-- this is too small!
Phase error Every reflection has a phase error, which is the difference of the calculated phase from the true phase (unknown). Free R-factorcorrelates with phase error free R <phase error>
Thought experiment What is the phase error for 4Å resolution reflections if the average coordinate error is 1Å?
Coordinate error causes phase error If the error in atomic position is 1Å, and the Bragg plane separation is 4Å,then the error in phase is ≤ (1/4)*360°=90° If the error is a Gaussian in real space, then the phase error is also a Gaussian. (The projectionof a 3D Gaussian on the normal to the Bragg planes is a 1D Gaussian)
Luzzati plot Data is divided into shells in S (=1/d). The R-factor for each shell is calculated and plotted. The plot is matched to the theoretical R vs S for a model with randomly-distributed errors = e. ps. Luzzati did this in 1952, long before computers!
Reciprocal space R: Map evaluator: Real space R-factor Electron density “residual” Summed over real space position r
Real space R-factor as a diagnostic High B-factors or real-space R may indicate places where the model is locally wrong.
In class exercise: Procheck http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html To run PROCHECK on MODLAB machines: validation -f 8dfr.pdb -o 0 (-o O [zero] means PDB format. This is the default, so you can omit it.) Read procheck.out using the vi editor, or jot, or the more command. This has a summery of the output file, including their names. Use “showps” to look at .ps files: showps xxxxx.ps
Ramachandran angle regions are (A,B,L) Most favored (red)(a,b,l,p) allowed (yellow)(~a,~b,~l,~p) generously allowed (beige?)disallowed (white)