440 likes | 458 Views
Understand likelihood concept, its application in crystallography, and automation in Phaser programming. Explore how likelihood enhances model consistency with data, optimizing model parameters, and its use in molecular replacement and phased error elimination.
E N D
Likelihood and automation in Phaser R J Read, Department of Haematology Cambridge Institute for Medical Research
Likelihood and automation in Phaser • Likelihood • background, use in crystallography • Molecular replacement • SAD phasing • log-likelihood-gradient maps • SAD phasing from partial model • bootstrapping from MR solution • iterative phasing
1/34=1/81 3 ? Concept of likelihood • Likelihood with dice Roll 2,3,1,1. Which die? 6 4 8 10 p(4)=1/44=1/256 p(6)=1/64=1/1296 p(8)=1/84=1/4096 p(10)=1/104=1/10000
Principle of maximum likelihood • How consistent is the model with the data? • What is the probability that the data would be measured if the model were correct? • Optimise model by adjusting parameters in probability distribution • parameters include variances (sources of error)
Illustration of likelihood • Random data with Gaussian distribution • Mean? Variance? 2 3 4 5 6 7 8 Model parameters: mean = m, variance = s
Illustration of likelihood m=4, s=1 m=6, s=1
Illustration of likelihood m=5, s=0.5 m=5, s=2
Illustration of likelihood m=5, s=1
Least squares and likelihood • Most experiments have multiple sources of error: Gaussian error in observations • Central Limit Theorem • Likelihood for Gaussians = least squares
Why not least squares in crystallography? • Gaussian error for observations • Error in predicting observation generally includes difference between structure factors • this is Gaussian in phased difference • e.g.Fvs.FC from model, FPvs.FPH • Phased error usually dominates • elimination of unknown phase changes probabilities
Applying likelihood to crystallography • Find probability distribution for observations • start from structure factor probabilities • eliminate unknown phase angles • Adjust parameters to optimise likelihood Applications: • calculating model phase probabilities • structure refinement • experimental phasing (isomorphous/anomalous) • likelihood-based molecular replacement
The Central Limit Theorem • Probability distribution of a sum of independent random variables tends to be Gaussian • regardless of distributions of variables in sum • Conditions: • sufficient number of independent random variables • none may dominate the distribution • Centroid (mean) of Gaussian is sum of centroids • Variance of Gaussian is sum of variances
Effect of atomic errors • Atomic errors give “boomerang” distribution of possible atomic contributions • Portion of atomic contribution is correct Bragg Plane Bragg Plane
Structure factor with coordinate errors • Same direction as the sum of the atomic f • but shorter by 0< D <1 • D=f(resolution) • Central Limit Theorem • Gaussian distribution for the total summed F • sD=f(resolution) FC sD DFC F
Amplitude probability distribution • Integrate over unknown phase angle to get Rice (Luzzati, Sim, Srinivasan) distribution
Rotation likelihood function • What structure factors could be obtained from an oriented model? • add up contributions from symmetry-related molecules, but unknown relative phase
Likelihood-based molecular replacement • Molecular replacement likelihood functions • account for expected coordinate error in model • account for missing components • exploit knowledge from partial solution • More sensitive than previous methods • succeeds with more distant homologues • succeeds with more components to find
Likelihood and automation • Automated decisions require reliable scores • Likelihood provides semi-absolute score • compare different models against same data • likelihood should increase for better model • more accurate, more complete or more detailed
Programming for automation • Phaser developed in C++ • Different modes of operation • modes can call other modes • Functions exported to Python • run Phaser from Python scripts that can use functionality from other packages • e.g. AutoMR wizard in Phenix
Selected Data Anisotropy Correction 2nd and subsequent models Fast Rotation Functions RF peak selection criteria Fast Rotation Function 1st model Fast Translation Functions RF peak selection criteria loop over models TF peak selection criteria Best RF solutions for 1st model Packing Packing criteria Fast Translation Functions Refinement and Phasing TF peak selection criteria loop over space-groups Packing Best solutions for complete structure Packing criteria All Data Anisotropy Correction Refinement and Phasing Refinement and Phasing Best TF solutions for 1st model Best spacegroup .pdb files .mtz files .sol files
A31P mutant of ROP: four helix bundle • Originally solved by 23-dimensional Monte Carlo search with four copies of poly-Ala helix • space group C2 • helix = 15% of protein • Glykos & Kokkinidis (2003) • Can be solved in minutes by Phaser
Helix 1 Helix 2 Helix 3 Helix 4 Data to 2.9Å Anisotropy 15.4Å2 24 (12*) RF/TF 307 (283*) RF/TF 6 (1*) RF/TF 32 (20*) RF/TF 3 (1*) Pack 68 (64*) Pack 6 (1*) Pack 22 (17*) Pack 3 (1*) Refined 24 (2*) Refined 6 (1*) Refined 8 (1*) Refined *best .pdb files .mtz files
Pushing the limits of molecular replacement • Investigate the use of smaller fragments • helices, subdomains • Extend the limits of homology (David Baker) • use ab initio models from Rosetta • (Qian et al., Nature450: 259-264, 2007) • improve homology modeling before MR • increase convergence radius for refinement after MR • pilot project: angiotensinogen • Apply concepts to NMR structure solution (Ernest Laue)
Likelihood-based SAD phasing • Conventional SAD phasing uses a least-squares term • New SAD likelihood function developed using multivariate statistics
SAD likelihood function • Fix structure factors calculated from model • Factor joint probability into two parts • Integrate out unknown phases, a + and a -
Intuitive understanding of SAD phasing Expected value of F-* (H-*) Expected difference between F+ and F-*
Intuitive understanding of SAD phasing Expected difference between F+ and F-* Expected value of F-* (H-*) Total likelihood is integral of the product of the two distributions under the black circle
Absolute scaling • SAD target uses real (partial structure) scattering and anomalous scattering • best results if f’’ known precisely • helps to have data on absolute scale • use BEST data from Sasha Popov • average intensities as function of resolution • get Wilson B-factor, absolute scale • have to define composition of crystal
Breakdown of Friedel’s law • Friedel’s law breaks down for mixture of scatterers differing in real:anomalous ratio • SAD target can distinguish hand for model with mixture of scattering types
SAD log-likelihood gradient (LLG) map • Compute derivative of log-likelihood wrt heavy atom structure factor • opposite phase shifts for plus and minus hands • Fourier transform gives map of where likelihood target would like to see changes in anomalous scatterer model • Very sensitive to minor sites • picks up sites identified as water molecules in refined structures determined by halide soaks
Locating anomalous scatterers in model solved by MR • Structure of thyroxine-binding globulin • Thyroxine doesn’t bind in accepted site • only 2.8Å resolution, but thyroxine contains 4 iodine atoms • data collected at Daresbury SRS with l=0.979Å • f’’ 3e • Compare conventional model-phased anomalous difference map with Phaser LLG map
mol 1 Dano, 3.5s LLG, 5.5s mol 2
Iterative model-building with SAD • Nitrate reductase structure • integral membrane protein, 1976 residues • contains 21 Fe atoms, 1 Mo, 113 S • solved by Natalie Strynadka, using combination of Fe-MAD, MIRAS • Fe peak SAD data • find 11 “Fe” sites with phenix.hyss • several are super-sites of Fe4S4 clusters • phase and complete adding Fe with Phaser • total of 38 sites, some of which are S atoms • still ghosts of super-sites
Round 1 of iterative model-building • Improve phases by density modification • Build with ARP/wARP (Resolve also works…) • 798 residues, 18 docked in sequence • LLG completion in Phaser, using partial polyAla model • Fe sites are now perfectly resolved
Convergence of iterative model-building • LLG maps are better than random at identifying atom type • resolve any ambiguities by refined occupancy • Converges after 5 cycles • anomalous scatterer model from Phaser has 21 Fe, 1 Mo, 84 of 113 S • 1392/1976 residues, 731 docked in sequence • Could do better by preserving anomalous scatterers in refined models, refining against SAD likelihood target
Automation of SAD phasing • Functions are all available from Python • part of AutoSolve wizard in Phenix • could run directly from HySS • will be a refinement target for phenix.refine • can run from HAPPy (CCP4) • Log-likelihood-gradient completion • look for one or several types of scatterer • start from MR model or partial substructure • analyse map to add sites, make atoms anisotropic • delete atoms that fade away • repeat to convergence
Future plans for experimental phasing • Account for translational NCS • Bi-wavelength anomalous diffraction (BAD) phasing • MIRAS • Account for radiation damage
Contributors • Molecular replacement • Airlie McCoy, Laurent Storoni • SAD phasing • Raj Pannu, Airlie McCoy, Laurent Storoni • BEST data • Sasha Popov • ccp4i GUI • Anne Baker, Peter Briggs • PHENIX collaboration • Ralf Grosse-Kunstleve, Nigel Moriarty, Paul Adams • Tom Terwilliger (Wizards)
Sponsors • Wellcome Trust • crystallographic theory and methods • structures of proteins relevant to pathogenesis • NIH • PHENIX package for automated crystallography • Paul Adams, Tom Terwilliger, David & Jane Richardson • implementation of likelihood-based methods • CCP4 • GUI development for Beast and Phaser