1 / 30

V25 – protein docking, FFT

V25 – protein docking, FFT. Fast Fourier Transform. Matching densities. Intuitively, we want to compute the overlap of the two densities after placing the two lattices on top of each other. But what means 'on top of each other' in mathematical terms?.

thad
Download Presentation

V25 – protein docking, FFT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. V25 – protein docking, FFT Fast Fourier Transform Bioinformatics III

  2. Matching densities Intuitively, we want to compute the overlap of the two densities after placing the two lattices on top of each other. But what means 'on top of each other' in mathematical terms? Orienting the two lattices can be done with respect to 6 degrees of freedom, 3 for translation along x, y, and z, and 3 for rotation around the angles , , and . Among all these possibilities, one wishes to identify the relative orientation x, y, z, , ,  that minimizes the sum of least squares Here, R,,is a three-dimensional rotation matrix and Tx,y,z is a translation operator that translates molecule B to the position x, y, z. Minimizing the sum of squared errors is equivalent to maximizing the linear cross-correlation of A and B, for a given translation vector (x,y,z) and rotation (, , ). Bioinformatics III

  3. What is the complexity of computing this correlation? Task: compute the linear cross-correlation of A and B, for a given translation vector (x,y,z) and rotation (, , ). Let R be the set of all possible rotations of molecule B. Certain combinations of rotating around 3 Euler angles will lead to the same final result  those can be omitted and we obtain a minimal set of rotations R’. Then we need O(N3) for computation of each value Cxyz leading to a total time of O(N6) for the computation of the translational component of C in all grid points and O( |R’| N6) for the complete algorithm. Bioinformatics III

  4. Fast Fourier Transform Discrete Fourier Transform of a function from a finite number of its sampled points. Suppose that we have N consecutive sampled values so that the sampling interval is . Let us assume that N is even. The discrete Fourier transform of the N points hk is The formula for the discrete inverse Fourier transform, which recovers the set of hk‘sexactly from the Hn‘s is: after: Numerical Recipes Bioinformatics III

  5. Fast Fourier Transform How much computation is involved in computing the discrete Fourier transform of N points? Until the mid-1960s, the standard answer was this: Define W as the complex number Then we can write The vector of hk‘s is multiplied by a matrix whose (n,k)th element is the constant W to the power n  k. The matrix multiplication produces a vector result whose components are the Hn’s. This matrix multiplication requires N2complex multiplications, plus a smaller number of operations to generate the required powers of W. So, the discrete Fourier transform appears to be an O(N2) process. Bioinformatics III

  6. Fast Fourier Transform However, the discrete Fourier transform (in 1 dimension) can be computed in O(N log2 N) operations by an algorithm called the Fast Fourier Transform. With N = 106, the difference between O(N2) and O(N log2 N) is 30 CPU seconds against 2 CPU weeks! The FFT algorithm became generally known in the mid-1960s from the work of J.W. Cooley and J.W. Tukey. In fact, efficient methods to compute discrete Fourier transforms had been independently discovered many times, starting with Gauss in 1805. Bioinformatics III

  7. FFT by Danielson and Lanczos (1942) D. and L. showed that a discrete Fourier transform of length N can be rewritten as the sum of two discrete Fourier transforms, each of length N/2. One of the two is formed from the even-numbered points of the original N, the other from the odd-numbered points. W is the same constant as before. Fke : k-th component of the Fourier transform of length N/2 formed from the even components of the original fj’s Fko : k-th component of the Fourier transform of length N/2 formed from the odd components of the original fj ’s Bioinformatics III

  8. FFT by Danielson and Lanczos (1942) The wonderful property of the Danielson-Lanczos-Lemma is that it can be used recursively. Having reduced the problem of computing Fk to that of computing Fke and Fko , we can do the same reduction of Fke to the problem of computing the transform of its N/4 even-numbered input data and N/4 odd-numbered data. We can continue applying the DL-Lemma until we have subdivided the data all the way down to transforms of length 1. What is the Fourier transform of length one? It is just the identity operation that copies its one input number into its one output slot. For every pattern of log2Ne‘s and o‘s, there is a one-point transform that is just one of the input numbers fn Bioinformatics III

  9. FFT by Danielson and Lanczos (1942) The next trick is to figure out which value of n corresponds to which pattern of e‘s and o‘s in Answer: reverse the pattern of e‘s and o‘s, then let e = 0 and o = 1, and you will have, in binary the value of n. Idea: this works because the successive subdividisions of the data into even and odd are tests of successive low-order (least significant) bits of n. This idea of bit reversal can be exploited in a very clever way which, along with the DL-Lemma, makes FFT practical: Suppose we take the original vector of data fjand rearrange it into bit-reversed order, so that the individual numbers are in the order not of j, but of the number obtained by bit-reversing j. Bioinformatics III

  10. FFT by Danielson and Lanczos (1942) Reordering an array (here of length 8) by bit reversal, (a) between two arrays, versus (b) in place. The points as given are the one-point transforms. We combine adjacent pairs to get two-point transforms, then combine adjacent pairs of pairs to get 4-point transforms, and so on until the first and second halves of the whole data set are combined into the final transform. Each combination takes of order N operations, and there are log2N combinations. This, then, is the structure of an FFT algorithm. Bioinformatics III

  11. Faster than FFT Shape Matching? Bioinformatics 23, 427 (2007) Bioinformatics III

  12. Faster than FFT Shape Matching? Bioinformatics 23, 427 (2007) Bioinformatics III

  13. Faster than FFT Shape Matching? Bioinformatics 23, 427 (2007) Bioinformatics III

  14. Faster than FFT Shape Matching? Bioinformatics 23, 427 (2007) Bioinformatics III

  15. Prediction of Assemblies from Pairwise Docking CombDock: first fully automated approach for predicting hetero multimolecular assembly only based on structural models of its protein subunits. Problem appears more difficult than the pairwise docking problem; it is NP-hard. Idea: exploit additional geometric constraints embraced in the combinatorial problem. Input: a set of protein structural models. Unlike a 3D puzzle, where two connected pieces in the puzzle solution match perfectly, we would like to tolerate some extent of penetration, due to the flexible nature of the proteins. Inbar et al., J. Mol. Biol. 349, 435 (2005) Bioinformatics III

  16. Pairwise docking: Katchalski-Kazir algorithm; FTDOCK Discretize proteins A and B on a grid. Every node is assigned a value Use FFT to compute correlation efficiently. Output: solutions with best surface complementarity. Gabb et al. J. Mol. Biol. (1997) Bioinformatics III

  17. (1) All pairs docking module Module gets as its input N protein structures  predict pairwise interactions. Perform pairwise docking for each of the N (N - 1) / 2 pairs of proteins. Keep K best solutions for each pair of proteins. Since pairwise-docking is a difficult problem, the correct solution may be among the first few hundred solutions.  K should be set reasonably high. Here, K was varied from dozens to hundreds. Inbar et al., J. Mol. Biol. 349, 435 (2005) Bioinformatics III

  18. (2) Combinatorial assembly module Input: N subunits and N (N - 1) / 2 sets of K scored transformations. These are the candidate interactions. Reduction to a spanning tree Build weighted graph representing the input: each structural unit = vertex each transformation = edge connecting the corresponding vertices edge weight = score of the transformation  Since the input contains K transformations for each pair of subunits, we have a complete graph with K parallel edges between each pair of vertices. Inbar et al., J. Mol. Biol. 349, 435 (2005) Bioinformatics III

  19. (2) Combinatorial assembly module For two subunits, each candidate complex is represented by an edge and the two vertices. In the case of N structural units a candidate complex is represented by a spanning tree = a subgraph of the input graph that connects all vertices and has no circles. Each spanning tree of the input graph represents a complex of all the input structural units. The problem of finding complexes is equivalent to finding spanning trees. The number of spanning trees in a complete graph with no parallel edges is NN-2 (Cayley‘s formula). Since the input graph has K parallel edges between each pair of vertices, the number of spanning trees is NN-2 KN-1 .  Exhaustive searches are infeasible. Inbar et al., J. Mol. Biol. 349, 435 (2005) Bioinformatics III

  20. (2) Combinatorial assembly module:algorithm Algorithm uses 2 basic principles: (1) hierarchical construction of the spanning tree (2) greedy selection of subtrees Different trees share common trees  generate trees with n vertices by connecting two trees of smaller size (that were previously generated) with an input edge. Thus, the common parts of different trees are generated only once. When connecting subtrees, validate only the inter-subtree constraints.  need to check whether there are severe penetrations in the complex only between pairs of subunits, where each is represented by a different subtree. Inbar et al., J. Mol. Biol. 349, 435 (2005) Bioinformatics III

  21. (2) Combinatorial assembly module:algorithm Stage 1: algorithm constructs trees of size 1. Each tree contains a single vertex that represents a subunit. Stage i: the tree complexes that consist of exactly i vertices (subunits) are generated by connecting two trees generated at a lower stage with an input edge transformation. Tree complexes that fulfil the penetration constraint are kept for the next stages. Because it is impractical to search all valid spanning trees, the algorithm performs a greedy selection of subtrees. For each subset of vertices, the algorithm keeps only the D best-scoring valid trees that connect them. The tree score is the sum of its edge weights. Inbar et al., J. Mol. Biol. 349, 435 (2005) Bioinformatics III

  22. Flowchart www.cs.tau.ac.il/~inbaryuv/combdoc/ Bioinformatics III

  23. Example The construction of the third-best scoring solution of arp2/3 complex (RMSD 1.2 Å ). The combinatorial assembly algorithm is hierarchical: at the first stage, each complex consists of a single subunit. At the ith stage it constructs complexes that consist of i subunits by connecting complexes of smaller size using one of the input candidate transformations. The arp2/3 complex consists of seven subunits shown at the top. In this Figure we present only the complexes of the different stages that are relevant to the construction of the third-best scoring solution (at the bottom of the Figure). Along with each complex is its corresponding subgraph, where the vertices represent the subunits and the edges represent the pairwise interactions that were used to construct the complex. In each graph, the red edge represents the transformation of the current stage, while blue edges represent transformations of previous stages. Inbar et al., J. Mol. Biol. 349, 435 (2005) Bioinformatics III

  24. Final scoring The geometric score evaluates the shape complementarity between the subunits: check distances between surface points on adjacent subunits. Close surface points increase score, penetrating surface points decrease score. Physico-chemical component of the final score counts the #surface points that belong to non-polar atoms = gives an estimate of the hydrophobic effect. Clustering of solutions: (1) compute contact maps between subunits: array of N ( N – 1 ) bins. If two subunits are in contact within the complex, set the corresponding bit to 1, and to 0 otherwise. (2) superimpose complexes that have the same contact map and compute RMSD between C atoms. If this distance is less than a threshold, consider complexes as members of a cluster. For each cluster, keep only the complex with the highest score. Inbar et al., J. Mol. Biol. 349, 435 (2005) Bioinformatics III

  25. Performance for known complexes Inbar et al., J. Mol. Biol. 349, 435 (2005) Bioinformatics III

  26. Method works with different contact topologies. The near-native solutions for two complexes with different contact topologies. Left: CombDock solution, Right: solution superposed on the crystal structure (gray thiner lines). (a) the sixth-best scoring solution for the IkBa/NF-kB complex of an unbound input, RMSD 1.9 Å. The p65 subunit was extracted from a homodimer structure (PDB 1BFT). The structure used for the IkBa subunit was generated by MODELLER6 v2 using bcl-3 (PDB 1k1b) as the template structure; (b) the second-best scoring solution of VHL/elonginC/elonginB complex (PDB 1vcb), with an RMSD of 0.5 Å . Each complex consists of three subunits but, while in the IkBa/NF-kB complex all the subunits are in contact with each other, in the VHL/elonginC/ elonginB complex the elonginC is the core of the complex (in yellow) and VHL (in blue) and elonginB (in red) are not in contact. The algorithm was able to predict a near-native solution for both complexes regardless of their contact topologies. Inbar et al., J. Mol. Biol. 349, 435 (2005) Bioinformatics III

  27. Examples of large complexes Left: CombDock solution, Right: solution superposed on the crystal structure (gray thinner lines). The solutions are: (a) the third-best scoring assembly of the seven subunits of the arp2/3 complex, RMSD 1.2 Å ; (b) the bestranked complex of the ten subunits of RNA polymerase II, RMSD 1.4 Å. Inbar et al., J. Mol. Biol. 349, 435 (2005) Bioinformatics III

  28. Discussion of CombDock For the five different targets, CombDock predicted at least one near-native solution and ranked it in the top ten for both bound and unbound cases. Problem in evaluating performance: full sets of „unbound“ structures are not available for complexes with a higher number of subunits. It is unlikely that this version of the algorithm (using rigid protein conformations) will be able to correctly assemble such complexes if the input subunits involve significant conformational changes.  future version should include hinge-bending movements of protein subunits. Inbar et al., J. Mol. Biol. 349, 435 (2005) Bioinformatics III

  29. Alber et al., Nature 450, 683 (2007) Bioinformatics III

  30. Alber et al., Nature 450, 683 (2007) Bioinformatics III

More Related