1 / 17

Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations

Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations. Fabian Schwarzer Itay Lotan. Motivation. SRS Sample conformations Create edges between “neighboring” conformations Ab-initio structure prediction Generate a large decoy set Cluster based on similarity.

dulcea
Download Presentation

Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations Fabian Schwarzer Itay Lotan

  2. Motivation • SRS • Sample conformations • Create edges between “neighboring” conformations • Ab-initio structure prediction • Generate a large decoy set • Cluster based on similarity When the number of conformations is large, finding neighboring (similar) conformations is costly

  3. Similarity Measures • Given the backbone Cα atom positions of two conformations – how similar are they? • Hard to define when comparing two different proteins • Straightforward when comparing two conformations of the same protein.

  4. Similarity Measures • We are interested in comparing conformations of the same protein. • Hence - trivial correspondence between the two point sets. • The two most common measures are: • cRMS deviation • dRMS deviation

  5. cRMS T is the rigid body transform that optimally aligns P and Q • cRMS is a metric, but the space is not Euclidean • There is a closed form solution for T • Complexity is linear in the number of points (plus a 4x4 eigenvectors computation)

  6. dRMS • A metric over a Euclidean space. • Complexity is quadratic in the number of points (size of protein) D is the internal distances matrix:

  7. k Nearest Neighbors • Find the k nearest neighbors of every conformation in the set • Currently the fastest algorithm in practice for high dimensionality is brute force: For each conformation q in set Compute distance to all other conformations Find the k nearest conformations • Complexity is O(n2 log k)

  8. k Nearest Neighbors • The literature has a number of efficient nearest neighbor algorithms: • kd-trees is the most prevalent • We cannot use these algorithms: • Require a Euclidean space – cRMS • Not efficient with high dimensionality - dRMS We reduce the dimensionality of dRMS to make kd-trees applicable.

  9. Uniform Simplification • Cut sequence into m equal subsequences • Average the coordinates of the Cα atoms in each subsequence • Use averaged coordinates ai when computing cRMS and dRMS a3 a6 a0 am a5 a1 a2 a4

  10. Uniform Simplification - Results • There is a high correlation between the full and the averaged representation when using cRMS and dRMS: • Proteins with 60 – 75AA: r > 0.95 for m > 12 • Protein with 374 AA: r > 0.95 for m > 16 Even with m = 12, the dimensionality of the internal distances matrix used by dRMS is too high (66) for a kd-tree to be used. Further reduction is needed.

  11. Proteins 4PTI (58) 1CTF (68) 1R69 (63) 1HTB (374)

  12. Further Reduction using SVD • We Apply SVD to the reduced distance matrices (stacked as vectors) • We project the reduced matrices onto the important singular vectors to further reduce the size.

  13. Further Reduction – Results. • Averaging before creating internal distances vector makes SVD feasible • For proteins with 60-75 AA, dRMS using only 20 parameters was highly correlated (r > 0.90) with dRMS using full representation. • 20 Dimensions is not too much for kd-trees.

  14. Finding k Nearest Neighbors • We tested the actual ability of the reduced representation to find NNs • 80 of the 100 true NNs (using dRMS) where found using the reduced rep. of decoy sets • Results are better (90) when the data set contains uniformly sampled conformations • The maximal relative error was 10% - 20% (0.5Å – 1.5Å) • The average relative error was < 5%

  15. Using kd-trees • We used the ANN implementation (UMD kd-tree software). • The data set contained 100,000 conformations. • We want to find 100 NN for each conformation.

  16. Why Does Averaging Work? • The mean distance of the i’th point from the origin is O(N0.5) and its stdev is also O(N0.5). • There is very high corr. between dRMS using the full distances vector and using only distances between “highly” separated points • The amount of distortion added by averaging has a mean of 0 and stdev of O(n0.5)

  17. Conjecture: The important differences between two conformations are found in the distances between “highly” separated points. These distances are large and therefore only distorted a little by averaging

More Related