620 likes | 712 Views
Optimisation in Refinement. Garib N Murshudov Chemistry Department University of York. Outline. Introduction to optimisation techniques MX refinement Fast gradient calculation Information matrix for different likelihood Fast approximate Hessian and information matrix
E N D
Optimisation in Refinement Garib N Murshudov Chemistry Department University of York IUCR Computing School
Outline • Introduction to optimisation techniques • MX refinement • Fast gradient calculation • Information matrix for different likelihood • Fast approximate Hessian and information matrix • Constrained optimisation and stabilisation IUCR Computing School
Introduction Most of the problems in scientific research can be brought to problem of optimisation (maximisation or minimisation). Formulation of the problem in a general form: L(x) --> min Ci(x) = 0 i=1,k Bj(x)>0, j=1,m If m=0, k=0 then the problem is called unconstrained optimisation, otherwise it is constrained optimisation IUCR Computing School
Introduction Part 1) Unconstrained optimisation: gradients, Hessian, information matrix Part 2) Constrained optimisation and stabilisation IUCR Computing School
Crystallographic refinement The form of the function in crystallographic refinement has the form: L(p)=LX(p)+wLG(p) Where LX(p) is the -loglikelihood and LG(p) is the -log of prior probability distribution - restraints. It is only one the possible formulations It uses Bayesian statistics. Other formulation is also possible. For example: stat physical energy, constrained optimisaton IUCR Computing School
-loglikelihood -loglikelihood depends on the assumptions about experimental data, crystal contents and parameters. For example with the assumptions that all observations are independent (e.g. no twinning), there is no anomolous scatterers and no phase information available, for acentric reflections it becomes: And for centric reflections: All parameters (scale, other overall and atomic) are inside |Fc| and Note that these are loglikelihood of multiples of chi-squared distribution with degree of freedom 2 and 1 IUCR Computing School
-loglikelihood Sometimes it is easier to start with more general formulation. By changing the first and/or the second term inside the integral we can get functions for twin refinement, SAD refinement. By playing with the second central moments (variance) and overall parameters in |Fc| we can get functions for pseudo-translation, modulated crystals. For ML twin refinement refmac uses this form of the likelihood IUCR Computing School
Geometric (prior) term Geometric (prior) term depends on the amount of information about parameters (e.g. B values, xyz) we have, we are willing (e.g. NCS restraints) and we are allowed (software dependent) to use. It has the form: Note that some of these restraints (e.g. bonds, angles) may be used as constraints also IUCR Computing School
So, simple refinement can be thought of as unconstrained optimisation. There is (at least) one hidden parameter - weights on X-ray. It could be adjusted using ideas from constrained optimisation (Part 2) IUCR Computing School
Introduction: Optimisation methods Optimisation methods roughly can be divided into two groups: • Stochastic • Detrministic IUCR Computing School
Stochastic • Monte-Carlo and its variations • Genetic algorithms an its variants • Molecular dynamics IUCR Computing School
Deterministic • Steepest descent, conjugate gradient and its variants • Newton-Raphson method • Quasi-Newton methods • Tunnelling type algorithms for global optimisation (it can be used with stochastic methods also) IUCR Computing School
Steepest descent: algorithm • Initialise parameters - p0, assign k=0 • Calculate gradient - gk • Find the minimum of the function along this gradient (i.e. minimise (L(pk - *gk)) • Update the parameters pk+1= pk - *gk • If shifts are too small or the change of the function is too small then exit • Assign k=k+1, go to step 2 Unless parameters are uncorrelated (i.e. the second mixed derivatives are very small) and are comparable in values this method can be painfully slow. Even approximate minimum is not guaranteed. IUCR Computing School
Conjugate gradient: algorithm • Initialise parameters, p0, k=0 • Calculate gradients g0 • Assign s0 = -g0 • Find the minimum L(pk+sk) • Update parameters pk+1= pk+ksk • If converged exit • Calculate new gradient - gk+1 • Calculate: k= (gk+1 ,gk+1- gk)/(gk ,gk) • Find new direction sk+1= -gk+1 + ksk • Go to step 4 IUCR Computing School
Conjugate gradient: Cont Some notes: • It is the most popular technique • In some sense search is done in?? • Variation of search direction calculation is possible • If it is used for multidimensional nonlinear function minimisation then it may be necessary to restart the process after certain number of cyclels • It may not be ideal choice if parameters are wildly different values (e.g. B values and xyz) • When used for linear equation solutionwith symmetric and positive matrix then step 4 in the previous algorithms becomes calculation of k =(Hk pk + gk,sk)/(sk,Hksk) IUCR Computing School
(Quasi-)Newton 1: algorithm • Initialise parameters: p0, k=0 • Calculate gradients - gk and (approximate) second derivatives Hk. It could be diagonal • Solve the equation Hks=-gk • Find the minimum L(pk+sk) and denote pk+1= pk+sk • Check the convergence • Update “Hessian” Hk+1. Options: 1) second derivative matrix, 2) information matrix, 3) BFGS type where Hk+1depends on Hk, gradients and shifts • Go to step 3 IUCR Computing School
(Quasi-)Newton 2: algorithm • Initialise parameters: p0, k=0 • Calculate gradients - gk and (approximate) inversion of second derivatives Bk. It could be diagonal • Calculate sk =- Bk gk • Find the minimum L(pk+sk) and denote pk+1= pk+sk • If converged exit • Update “inverse Hessian” Bk+1. Options: 1) inversion of second derivative matrix, 2) inversion of information matrix, 3) BFGS type where Bk+1depends on Bk, gradients and shifts • Go to step 3 Note 1: If you are working with sparse matrix then be careful: usually Hessian (also known as interaction matrix) is more sparse than its inverse (also known as covariance matrix). It may result in unreasonable shifts. Note 2: In general conjugate gradient and this technique with diagonal initial values are similar. Note 3. Preconditioning may be difficult IUCR Computing School
Tunnelling type minimisation: algorithm • Define L0=L, k=0 • Minimise Lk. Denote minimum pk and the value by Lkv • Modify the function: Lk+1=Lk+f(p-pk). Where f is a bell shaped function (e.g. Gaussian) • Solve the equation Lk+1= Lkv • If no solution then exit • Increment k by 1 and go to step 2 IUCR Computing School
Note on local minimisers In good optimisers local minimisers form a module in a larger iterations. For example in tunnelling type optimisers, unconstrained optimisers etc. Conjugate gradient can also be used as a part of (quasi-)Newton methods. In these methods one needs to solve the linear equation and it can be solved using conjugate gradient (see below) IUCR Computing School
Newton-Raphson method Taylor expansion of the objective function f(x) around a working xk H Second-derivative matrix of f(x) s Shift vector to be applied to xk g Gradient of f(x) Hs=-g xk sk Hk-1 xk+1 sk+1 Hk+1,gk+1 sk+2 xk+2 Hk+2,gk+2 xB Macromolecules pose special problems Hk,gk IUCR Computing School
Macromolecules H in isotropic refinement has 4N4N elements Direct calculation time NelNrefl 2500 atoms 100 000 000 elements FFT methods [Agarwal, 1978] [Murshudov et al., 1997] [Tronrud, 1999] [Urzhumtsev & Lunin, 2001] time c1Nel + c2NrefllogNrefl H in anisotropic refinement has 9N9N elements 2500 atoms 506 250 000 elements The calculation and storage of H (H-1)is very expensive IUCR Computing School
Fast gradient calculation The form of the function as sum of the individual components is not needed for gradient. In case of SAD slight modification is needed Lh - component of the likelihood depends only on the reflection with Miller index h xij - j-th parameter of i-th atom i - i-th atoms IUCR Computing School
Gradients of likelihood We need to calculate the gradients of likelihood wrt structure factors. In a general form: If the second term is Gaussian then calculations reduce to IUCR Computing School
Gradients: real space fit For model building gradient has particularly simple form. In this case minimisation tries to flatten the difference map. IUCR Computing School
Gradients: least-squares In this case difference map with calculated phases is flattened IUCR Computing School
Gradients: Maximum likelihood In this case weighted difference map is flattened IUCR Computing School
Approximations to Hessians sparse matrix [Templeton, 1999] diagonal matrix steepest descent full matrix The magnitude of matrix elements decreaseswith the lenghtening of the interatomic distance IUCR Computing School
Sparse matrix in refinement • The set of the contacting atoms. Denote this set by list. These include all pairs of atoms related by restraints: bonds, angles, torsions, vdws, ncs etc • Size of each parameter (for positional 3, for B values 1, for aniso U values 6, for occupancies 1) • Design the matrix Hii are quadratic matrices. They reflect Interaction of atom with itself Hij are rectangular matrices. They reflect interaction between contacting atoms The number of the elements need to be calculated: IUCR Computing School
Preconditioner We need to solve Hs = -g When parameter values are wildly different (e.g. B values and xyz) then this equation can be numerically ill-conditioned and few parameters may dominate. To avoid this, preconditioning could be used. The equation can be solved in three steps: 1) Precondition 2) Solve 3) remove conditioning. The algorithm PTHPP-1s = -PTg H1s1 = -g1 s=Ps1 H1=PTHP g1=PTg IUCR Computing School
Algorithm • Calculate preconditioner • Calculate H1 and g1 • Solve the linear equation (e.g. using conjugate gradient) • Remove conditioner IUCR Computing School
Preconditioner for refinement Inversion and square root should be understood as pseudo-inversion and that for matrices. If any singularities due to a single atom then they can be removed here. The resulting matrix will have unit (block)diagonal terms IUCR Computing School
Fisher’s method of scoring (scoring method) The objective function f(x) is the likelihood L = -log(o;p) Hs=-g Is=-g Score vector Observed information matrix [Fisher, 1925] Fisher’s information matrix Positive semidefinite IUCR Computing School
Example of information Assume that the distribution is von Mise. I.e the distribution of observation (o) given model (p) is P(o;p)=exp(Xcos(o-p))/I0(X) Where Ik(X) is k-th order modified Bessel function of the second kind Then the -loglikeihood function L(o;m)=-Xcos(o-p)-log(K) Gradient: g(m)=Xsin(o-p) Observed information H=Xcos(o-p) - could be negative as well as positive (from -X to X) Expected Information I=X I1(X)/I0(X) It is never negative. It can be 0 for uniform distribution only, i.e. X=0 IUCR Computing School
Example: Cont If parameters are too far from optimal (e.g. around p=o+) then method using the second derivative (observed information) would not go downhill. If expected information is used then shifts are always downhill. Near to the maximum, observed information is X and expected information is m X where m = I1(X)/I0(X) < 1. So shifts calculated using observed information is larger than shifts calculated using expected information therefore oscillation around the maximum is possible. IUCR Computing School
Information matrix and observations Information matrix does not depend on the actual values of observations. It depends on their conditional distribution only. This fact could be used for planning experiment. Note that in case of least-squares (I.e. gaussian distribution of obervations given model parameters) the normal matrix is expected as well as observed information matrix IUCR Computing School
Information matrix and change of variables If variables are changed to new set of variables there is a simple relation between old and new information matrix As a result information matrix can be calculated for “convenient” variables and then converted to desired ones. For example in crystallography it can be calculated for calculated structure factors first. IUCR Computing School
Fisher’s information in crystallography Lh = L-h Fh = F-h* (no anomalous scatterers) I0 I1 Likelihood weighting factor Wh I2 IUCR Computing School
Approximate informaion matrix IUCR Computing School
Information matrix Let us consider the information matrix: Only component dependent on observation is Ws Different type of refinement uses different likelihood function and therefore different Ws. Let us consider some of them IUCR Computing School
Information: real-space fit In this case differences between “observed” and calculated electron density is minimised There is no dependence on observations. So all components of information matrix can be tabulated only once. IUCR Computing School
Information: least-squares In this case the distribution of observation given calculated structure factors is Gaussian: Again there is no dependence on observations. This formulations can be modified by addition of weights for each observation. Then dependence of information matrix will be on these weights only. For the most cases modified (or iteratively weighted) least-squares approximates more sophisticated maximum likelihood very well. If weights depend on the cycle of refinement then calculation can be done at each cycle, otherwise only once. IUCR Computing School
Information: least-squares Note that expected information removes potentially dangerous term dependent on |Fc|-|Fo| and makes the matrix positive IUCR Computing School
Information: Maximum likelihood The simplest likelihood function has a form (acentric only): Corresponding term for information matrix: These integrals or corresponding form using first derivatives can be calculated using Gaussian integration IUCR Computing School
Information: maximum likelihood Alternative (simpler?) way of calculations (refmac uses this form): Again Gaussian integration can be used. Note the the result of this integral is always non-negative. Refmac uses Gaussian integration designed for this case. IUCR Computing School
Integral approximation of I [Agarwal, 1978] [Dodson, 1981] [Templeton, 1999] IUCR Computing School
Analytical I versus integral I IUCR Computing School
Block diagonal version Because of sharp decrease of the components of the information matrix vs the distance between atoms and there are almost no atoms closer than 1.2A, using only diagonal terms of the information matrix works very well. Their calculation is extremely fast. Much faster than gradient calculations. Calculation of the sparse information matrix with pretabulation is also extremely fast IUCR Computing School
Fast evaluation of I Two-step procedure Tabulation step – a limited set of integralsare tabulated in a convenient coordinate system Rotation step – the matrix element in the crystal system is calculated from the tabulated values using a rotation matrix IUCR Computing School
Summary • Gradients can be calculated very fast using FFT • Use of Fisher information matrix improves behaviour of optimisation • The integral approximation allows the calculation of a sparse Fisher’s information in ‘no time’ • The scoring method using sparse fast-evaluated Fisher’s information has been implemented in the program REFMAC5 • The method works satisfactorily IUCR Computing School
Part 2Constrained optimisation and stabilisation IUCR Computing School