500 likes | 757 Views
Conformational Space. Conformational Space. Conformation of a molecule: specification of the relative positions of all atoms in 3D-space, Typical parameterizations : List of coordinates of atom centers List of torsional angles (e.g., the f - y - c for a protein)
E N D
Conformational Space • Conformation of a molecule: specification of the relative positions of all atoms in 3D-space, • Typical parameterizations: • List of coordinates of atom centers • List of torsional angles (e.g., the f-y-cfor a protein) • Conformational space:Space of all conformations
qj qi qN-1 q2 qN q1 Conformational Space
q q q q q 1 3 0 n 4 Conformational Space
q 2 q q q q q t(t) 1 n 0 4 3 Relation to Robotics/Graphics Configuration space
Need for a Metric • Simulation and sampling techniques can produce millions of conformations • Which conformations are similar? • Which ones are close to the folded one? • Do some conformations form small clusters (e.g. key intermediates while folding)?
Metric in Conformational Space • A metric over conformational space C is a function: d: c,c’ C d(c,c’) +{0}such that: • d(c,c’) = 0 c = c’ (non-degeneracy) • d(c,c’) = d(c’,c) (symmetry) • d(c,c’) + d(c’,c”) d(c,c”) (triangle inequality)
But not all metrics are “good” • Euclidean metric: d(c,c’) = Si=1,...,n(|fi-fi’|2+ |yi-yi’|2)
Metric in Conformational Space • A “good” metric should measure how well the atoms in two conformations can be aligned • Usual metrics: cRMSD, dRMSD
RMSD • Given two sets of n points in 3 A = {a1,…,an} and B = {b1,…,bn} • The RMSD between A and B is:RMSD(A,B) = [(1/n)Si=1,…,n||ai-bi||2]1/2 where ||ai-bi|| denotes the Euclidean distance between ai and bi in 3 • RMSD(A,B) = 0 iff ai = bi for all i
cRMSD • Molecule M with n atoms a1,…,an • Two conformations c and c’ of M • ai(c) is position of ai when M is at c • cRMSD(c,c’) is the minimized RMSD between the two sets of atom centers: minT[(1/n)Si=1,…,n||ai(c)– T(ai(c’))||2]1/2 where the minimization is over all possible rigid-body transform T
cRMSD • cRMSD verifies triangle inequality • cRMSD takes linear time to compute • Often, cRMSD is restricted to a subset of atoms, e.g., the Ca atoms on a protein’s backbone
Representation Restricted to Ca Atoms - The positions of AA residue centers (Cα atoms) mainly determine the structure of a protein. - In structural comparison, people usually work only on the backbone of Cα atoms, and neglect the other atoms. Protein 1tph
Possible project: Design a method for efficiently finding nearest neighbors in a sampled conformation space of a protein, using the cRMSD metric.
dRMSD • Molecule M with n atoms a1,…,an • Two conformations c and c’ of M • {dij(c)}: nn symmetrical intra-molecular distance matrix in M at c • dRMD(c, c’) is :[(1/n(n-1))Si=1,…,n-1Sj=i+1,…,n(dij(c)– dij(c’))2]1/2 • {dij} is usually restricted to a subset of atoms, e.g., the Ca atoms on a protein’s backbone
Intra-Molecular Distance Matrix Distances between Ca pairs of a protein with 142 residues. Darker squares represent shorter distances.
45 40 85 1 Intra-Molecular Distance Matrix Distances between Ca pairs of a protein with 142 residues. Darker squares represent shorter distances.
dRMSD • Molecule M with n atoms a1,…,an • Two conformations c and c’ of M • {dij(c)}: nn symmetrical intra-molecular distance matrix in M at c • dRMSD(c, c’) =[(2/n(n-1))Si=1,…,n-1Sj=i+1,…,n(dij(c)– dij(c’))2]1/2 • {dij} is usually restricted to a subset of atoms, e.g., the Ca atoms on a protein’s backbone
dRMSD • Molecule M with n atoms a1,…,an • Two conformations c and c’ of M • {dij(c)}: nn symmetrical intra-molecular distance matrix in M at c • dRMSD(c, c’) =[(2/n(n-1))Si=1,…,n-1Sj=i+1,…,n(dij(c)– dij(c’))2]1/2 • {dij} is usually restricted to a subset of atoms, e.g., the Ca atoms on a protein’s backbone • Advantage: No aligning transform • Drawback: Takes quadratic time to compute
Is dRMSD a metric? • dRMSD(c, c’) = [(2/n(n-1))Si=1,…,n-1Sj=i+1,…,n(dij(c)– dij(c’))2]1/2 is a metric in the n(n-1)/2-dimensional space, where a conformation c is represented by {dij(c)} • But, in this representation, the same point represents both a conformation and its mirror image
k-Nearest-Neighbors Problem Given a set S of conformations of a protein and a query conformation c, find the k conformations in S most similar to c(w.r.t. cRMSD, dRMSD, other metric) Can be done in time O(N(log k + L)) where: -N = size of S- L = time to compare two conformations
k-Nearest-Neighbors Problem The total time needed to compute the k nearest neighbors of every conformation in S is O(N2(log k + L))Much too long for large datasets where N ranges from 10,000’s to millions!!! Can be improved by: 1. Reducing L 2. More efficient algorithm (e.g., kd-tree)
kd-Tree In a d-dimensional space, where d>2, range searching for a point takes O(dn1-1/d)
k-Nearest-Neighbors Problem Idea: simplify protein’s description
Assume that each conformation is described by the coordinates of the n Ca atoms cRMSD O(n) time dRMSD O(n2) time
ci cj This representation is highly redundant • Proximity along the chain entails spatial proximity • Atoms can’t bunch up, hence far away atoms along the chain are on average spatially distant
m-Averaged Approximation • Cut the backbone into fragments of m Ca atoms • Replace each fragment by the centroid of the m Ca atoms • Simplified cRMSD and dRMSD 3n coordinates 3n/mcoordinates
Evaluation: Test Sets[Lotan and Schwarzer, 2003] • 8 diverse proteins (54 -76 residues) • Decoy setsof N =10,000 conformations from the Park-Levitt set [Park et al, 1997] Correlation: Higher correlation for random sets ( greater savings)
Further Reduction for dRMSD • Stack m-averaged distance matrices as vectors of a matrix A
N A r Vector ai of elements of distance matrix of ith conformation (i = 1 to N)
Further Reduction for dRMSD • Stack m-averaged distance matrices as vectors of a matrix A • Compute the SVD A = UDVT
Diagonal matrix Orthonormal(rotation) matrix SVD Decomposition N A(rxN) U(rxr) D(rxr) VT(rxN) = r Vector aj of elements of distance matrix of jth conformation (j = 1 to N)
SVD Decomposition N s1 s2 sr A(rxN) U(rxr) VT(rxN) 0 = r 0 Vector aj of elements of distance matrix of jth conformation (j = 1 to N) Diagonal matrix s1 s2 ... sr 0(singular values) Orthonormal(rotation) matrix
vjT vkT vi and vj are orthogonal unit Nx1 vectors SVD Decomposition N A(rxN) U(rxr) D(rxr) VT(rxN) = r Vector aj of elements of distance matrix of jth conformation (j = 1 to N) Diagonal matrix Matrix withorthonormal rows Orthonormal(rotation) matrix
y Representation ofA in space (X,Y) X does not depend on thecoordinate system! Y r-dimensional space x SVD Decomposition N A(rxN) U(rxr) D(rxr) VT(rxN) = r
s1 s2 s3 sr v1T v2T SVD Decomposition N A(rxN) U(rxr) D(rxr) VT(rxN) = r ||s1v1|| ||s2v2|| ...
s1 s2 s3 sr v1T v2T SVD Decomposition N A(rxN) U(rxr) D(rxr) VT(rxN) = r p principal components vpT
SVD Decomposition N A(rxN) U(rxr) D(rxr) VT(rxN) = r s1 s2 sp v1T v2T p principal components vpT 0
Further Reduction for dRMSD • Stack m-averaged distance matrices as vectors of a matrix A • Compute the SVD A = UDVT • Project onto p principal components
between dRMSD and is reduced to summing up 12 to 20 terms(instead of ~ 80 to 200, since the proteins have 54 to 76 amino acids) Correlation
Complexity of SVD • SVD of rxN matrix, where N > r, takes O(r2N) time • Here r ~ (n/m)2 • So, time complexity is O(n4N) • Would be too costly without m-averaging
Evaluation for 1CTF Decoy Sets[Lotan and Schwarzer, 2003] • N = 100,000, k = 100, 4-averaging, 16 PCs • 70% correct, with furthest NN off by 20% • Brute-force: 84 h • Brute-force + m-averaging: 4.8 h • Brute-force + m-averaging + PC: 41 min • kD-tree + m-averaging + PC: 19 min • Speedup greater than x200 • 6k approximate NNs contain all true k NNs • Use m-averaging and PC reduction as fast filters