640 likes | 710 Views
Robotics Algorithms for the Study of Protein Structure and Motion. Jean-Claude Latombe Computer Science Department Stanford University. Protein. Long sequence of amino-acids (dozens to thousands), from a dictionary of 20 distinct amino-acids. Central Dogma of Molecular Biology.
E N D
Robotics Algorithms for the Study of Protein Structure and Motion Jean-Claude Latombe Computer Science DepartmentStanford University
Protein Long sequence of amino-acids (dozens to thousands), from a dictionary of 20 distinct amino-acids
Central Dogma of Molecular Biology Physiological conditions: aqueous solution, 37°C, pH 7, atmospheric pressure
Why Proteins? • They are the workhorses of living organisms • They perform many vital functions, e.g.: • catalysis of reactions • storage of energy • transmission of signals • building blocks of muscles • They raise challenging computational issues • Large molecules (100s to several 1000s of atoms) • Made of building blocks drawn from a small “dictionary” • Unusual kinematic structure • They are associated with many critical problems • Folded structure determination • Global and local structural similarities • Prediction of folding and binding motions
peptide group side-chain group f-y Kinematic Linkage Model
Two problems • Structure determination from electron density maps • Inverse kinematics techniques [Itay Lotan, Henry van den Bedem, Ashley Deacon (Joint Center for Structural Genomics)] • Energy maintenance during Monte Carlo simulation • Collision detection techniques [Itay Lotan, Fabian Schwarzer, and Danny Halperin (Tel Aviv University)]
NMR spectrometry Structure Determination/Prediction • Experimental tools • Computational tools • Homology, threading • Molecular dynamics X-ray crystallography
Protein Data Bank Only about 10% of structures have been determined for known protein sequences Protein Structure Initiative (PSI) 1990 250 new structures 1999 2500 new structures 2000 >20,000 structures total 2004 ~30,000 structures total
Automated Model Building • Software systems: RESOLVE, TEXTAL, ARP/wARP, MAID • 1.0Å < d < 2.3Å ~ 90% completeness • 2.3Å ≤ d < 3.0Å ~ 67% completeness (varies widely)1 1.0Å 3.0Å JCSG: 43% of data sets 2.3Å • Manually completing a model: • Labor intensive, time consuming • Existing tools are highly interactive Model completion is high-throughput bottleneck 1Badger (2003) Acta Cryst. D59
The Completion Problem • Input: • Electron-density map • Partial structure • Two anchor residues • Amino-acid sequence of missing fragment (typically 4 – 15 residues long) • Output: • Few candidate conformation(s) of fragment that • Respect the closure constraint (IK) • Maximize match with electron-density map
T IK Problem • Input: • Closed kinematic chain with n > 6 degrees of freedom • Relative positions/orientations X of end frames • Target functionT(Q) → R • Output: • Joint angles Q that • Achieve closure • OptimizeT
Related Work Biology/Crystallography • Exact IK solvers • Wedemeyer & Scheraga ’99 • Coutsias et al. ’04 • Optimization IK solvers • Fine et al. ’86 • Canutescu & Dunbrack Jr. ’03 • Ab-initio loop closure • Fiser et al. ’00 • Kolodny et al. ’03 • Database search loop closure • Jones & Thirup ’86 • Van Vlijman & Karplus ’97 • Semi-automatic tools • Jones & Kjeldgaard ’97 • Oldfield ’01 Robotics/Computer Science • Exact IK solvers • Manocha & Canny ’94 • Manocha et al. ’95 • Optimization IK solvers • Wang & Chen ’91 • Redundant manipulators • Khatib ’87 • Burdick ’89 • Motion planning for closed loops • Han & Amato ’00 • Yakey et al. ’01 • Cortes et al. ’02, ’04
Two-Stage IK Method • Candidate generations Closed fragments • Candidate refinement Optimize fit with EDM
Stage 1: Candidate Generation • Generate random conformation of fragment (only one end attached to anchor) • Close fragment (i.e., bring other end to second anchor) using Cyclic Coordinate Descent (CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03)
moving end fixed end Closure Distance Closure Distance: A.A. Canutescu and R.L. Dunbrack Jr.Cyclic coordinate descent: A robotics algorithm for protein loop closure. Prot. Sci. 12:963–972, 2003. Compute + bias toward EDM + avoid steric clashes
dq3 dq2 (q1,q2,q3) dq1 Stage 2: Candidate Refinement • Target function T (Q)measuring quality of the fit with the EDM • Minimize T while retaining closure • Closed conformations lie on a self-motion manifold of lower dimension Null space 1-D manifold
y = T(Q) Closure and Null Space • dX = J dQ, where J is the 6n Jacobian matrix (n > 6) • Null space {dQ | J dQ = 0} has dim = n – 6 • N: orthonormal basis of null space • Pseudo-inverse J+ such that JJ+ = I • dQ = J+dX + NNTy
0 NT (n-6) basis N of null space s1 s2 Gram-Schmidt orthogonalization s6 Computation of J+ and N SVD of J S66 dX U66 VT6n dQ = J+ = V S+ UT where S+=diag[1/si]
Refinement Procedure Repeat until minimum is reached: • Compute J, J+ and N at current Q • Compute T at current Q(analytical expression of T + linear-time recursive computation [Abe et al., Comput. Chem., 1984]) • Move along dQ = J+dX + NNT T until minimum is reached or closure is broken + Monte Carlo + simulated annealing protocol to deal with local minima
Monte Carlo Optimization Repeat: • Perform a random move of the fragment: • either by picking a random direction in null space • or by using an exact IK solver over 6 dofs [Coutsias et al, 2004] ( big jumps) • Minimize T(Q) • Accept move with Metropolis-criterion probability ~exp(-DT/Temp)
Tests #1: Artificial Gaps • TM1621 (234 residues) and TM0423 (376 residues), SCOP classification a/b • Complete structures (gold standard) resolved with EDM at 1.6Å resolution • Compute EDM at 2, 2.5, and 2.8Å resolution • Remove fragments and rebuild
TM1621 103 Fragments from TM1621 at 2.5Å Short Fragments: 100% < 1.0Å aaRMSD Long Fragments: 12: 96% < 1.0Å aaRMSD 15: 88% < 1.0Å aaRMSD Produced by H. van den Bedem
Comparison Across Resolutions Resolution = 2.0Å Resolution = 2.5Å Resolution = 2.8Å
Example: TM0423 PDB: 1KQ3, 376 res. 2.0Å resolution 12 residue gap Best: 0.3Å aaRMSD
Tests #2: True Gaps • Structure computed by RESOLVE • Gaps completed independently (gold standard) • Example: TM1742 (271 residues) • 2.4Å resolution; 5 gaps left by RESOLVE Produced by H. van den Bedem
TM0813 PDB: 1J5X, 342 res. 2.8Å resolution 12 residue gap GLU-83 GLY-96
TM0813 PDB: 1J5X, 342 res. 2.8Å resolution 12 residue gap Best 0.6Å aaRMSD GLU-83 GLY-96
TM1621 • Green: manually completed conformation • Cyan: conformation computed by stage 1 • Magenta: conformation computed by stage 2 • The aaRMSD improved by 2.4Å to 0.31Å
Alr1529 D72-D78 resolution: 2.0Å initial model: ARP/wARP contour: 1.0s PDB: 1VJG aaRMSD: 0.33Å
TM0542 • Top-scoring fragment in cyan • Manually completed fragment in green • Residues A259 and A260 are flipped
A B Current/Future Work • Software actively being used at the JCSG • What about multi-modal loops?
A323 Hist A316 Ser • TM0755: data at 1.8Å • 8-residue fragment crystallized in 2 conformations • Overlapping density: Difficult to interpret manually Algorithm successfully identified and built both conformations
A B Current/Future Work • Software actively being used at the JCSG • What about multi-modal loops? • Fuzziness in EDM can then be exploited • Use EDM to infer probability measure over the conformation space of the loop
Amylosucrase J. Cortés, T. Siméon, M. Renaud-Siméon, and V. Tran. J. Comp. Chemistry, 25:956-967, 2004
Energy maintenance during Monte Carlo simulation joint work with Itay Lotan, Fabian Schwarzer, and Dan Halperin11 Computer Science Department, Tel Aviv University
Monte Carlo Simulation (MCS) • Random walk through conformation space • At each attempted step: • Perturb current conformation at random • Accept step with probability: • The conformations generated by an arbitrarily long MCS are Boltzman distributed, i.e., #conformations in V ~
Monte Carlo Simulation (MCS) • Used to: • sample meaningful distributions of conformations • generate energetically plausible motion pathways • A simulation run may consist of millions of steps energy must be evaluated frequently Problem: How to maintain energy efficiently?
Energy Function • E = S bonded terms + S non-bonded terms+S solvation terms • Bonded terms- O(n) • Non-bonded terms- E.g., e.g. Van der Waals and electrostatic- Depend on distances between pairs of atoms-O(n2) Expensive to compute • Solvation terms-Mayrequire computing molecular surface
Non-Bonded Terms • Energy terms go to 0 when distance increases Cutoff distance (6 - 12Å) • vdW forces prevent atoms from bunching up Only O(n) interacting pairs[Halperin&Overmars 98] Problem: How to find interacting pairswithout enumerating all atom pairs?
dcutoff Grid Method • Subdivide 3-space into cubic cells • Compute cell that contains each atom center • Represent grid as hashtable
dcutoff Grid Method • Θ(n) time to build grid • O(1) time to find interactive pairs for each atom • Θ(n) to find all interactive pairs of atoms [Halperin&Overmars, 98] • Asymptotically optimal in worst-case
0 Number k of DOF changes 20 5 30 10 Can we do better on average? • Few DOFs are changed at each MC step simulationof 100,000 attempted steps
Can we do better on average? • Few DOFs are changed at each MC step • Proteins are long chain kinematics • Long sub-chains stay rigid at each step • Many partial energy sums remain constant Problem: How to retrieve the unchanged partial sums?
Hierarchical Collision Checking • Widely used technique in robotics/graphics to approximate distances between objects • Pre-computation of bounding-volume hierarchy • How to update this hierarchy if the objects deform
Two New Data Structures • ChainTree Fast detection of interacting atom pairs • EnergyTree Retrieval of unchanged partial energy sums
TNO TJK TAB ChainTree(Twofold Hierarchy: BVs + Transforms) joints
Updating the ChainTree Update path to root: • Recompute transforms that “shortcut” the DOF change • Recompute BVs that contain the DOF change • O(k log(n/k)) work for k changes