230 likes | 241 Views
Amit Satsangi amit@cs.ualberta.ca. Novel Approaches for Small Bio-molecule Classification and Structural Similarity Search Karakoc E, Cherkasov A., and Sahinalp S.C. Background and Focus.
E N D
Amit Satsangi amit@cs.ualberta.ca Novel Approaches for Small Bio-molecule Classification and Structural Similarity SearchKarakoc E, Cherkasov A., and Sahinalp S.C. CMPUT 605
CMPUT 605 Background and Focus • Identification of molecules that play an active role in regulation of biological processes or disease states (Aspirin) • Structural similarity Similar biological and/or physico-chemical properties (Maggiora et al.) • Classification of probe compound (unknown bioactivity) • Similarity search amongst compounds with known bioactivity
CMPUT 605 Background and Focus • Determining similarity distance measures (SDM) • Using SDM for classification of compounds—k-NN classification • Efficient data structures for fast similarity search—DMVP trees (an improvement over SCVP trees used previously)
CMPUT 605 Outline • Similarity measures • Classification techniques • k-NN classifier • DMVP tree • Results, Observations and Conclusion
CMPUT 605 Similarity between Molecules • Structural Similarity—doubly bonded C pair, existence of aromatic atom etc. (Used in structural similarity search engines) • Similarity of chemical descriptors—atomic wt., hydrophobicity, charge, density etc. (Used in QSAR* tools) *Quantitative Structure-Activity Relationship
CMPUT 605 Similarity Measures • Tanimoto coefficient T(X,Y)—Given two descriptor sets X & Y: • X & Y: n-dimensional bit-vectors (representation used by PubChem & some other databases) • Range of Tanimoto coefficient: [0, 1]
CMPUT 605 Similarity measures • Tanimoto Dist. Measure: DT(X,Y) = 1 –T(X,Y) • Minkowski distance (LP): • Real valued data possible
CMPUT 605 Classification Techniques • Multiple Linear Regression (MLR) • Linear Discriminant Analysis (LDA) • Artifical Neural Networks (ANN) • Support Vector Machines (SVM) • k-nearest Neighbor (k-NN) classification not used previously.
CMPUT 605 Distance-based Classification • Compounds—s & r • S & R respective descriptor arrays • If D(S,R) is small then bioactivity levels of s & r are similar • Notion of distance classification of new compounds • Distance measure == metric (conditions) e.g. Hamming Distance, Tanimoto distance etc.
CMPUT 605 k-nn Classification • Given Bioactivity • To Find Distance measure that separates active and inactive compounds for the training set N-dimensional plane • Problem Easy
CMPUT 605 k-nn Classification • Given Bioactivity • To Find Distance measure that separates active and inactive compounds for the training set N-dimensional plane • Problem NP-hard • Solution Use Genetic Algorithms, heuristic linear search to find the plane
QSAR approach • Uses a linear combination of descriptors • Assigns a weight to each dimension , W [0,1] • Weighted Minkowski distance of order 1 • Only binary classification considered (A/I) • Methods are general CMPUT 605
CMPUT 605 Parameter Optimization
CMPUT 605 k-NN Classifier • Set of data elements: {X1, … Xn} • Query element: Y • Range query Find Xi such that D(Y,Xi) < R1 (user defined) • k-nn query Find k items such that their distance to Y is as small as possible
CMPUT 605 Data structures: VP-Trees • Vantage Point (VP) tree • Choose an arbitrary data point (called Vantage Point) • Binary tree—recursively partitions the dataset into two equal sized subsets • Zero in on the nearest neighbor
CMPUT 605 Efficient data structures: SCVP Trees • Space Covering Vantage Point tree • Multiple vantage points chosen at each level • No more a binary tree—multiple branches at each internal node • Multiple inner partitions—hope is that each data point lies in atleast one inner partition
CMPUT 605 DMVP Tree • Memory requirements of SCVP tree can be large—redundancy of data elements • Deterministic selection of Vantage points • VP minimization—NP-Hard • Minimization == Weighted set cover problem • Use of greedy Algorithm: O(log l); l<n • Approximates the min number of VP’s
CMPUT 605 Experiments • Five types of bioactivities viz. being antibiotic (520), bacterial metabolite (562), human metabolite(1104), drug(958), drug-like(1202) • 62 dimensional descriptor array (30 QSAR & 32 physico-chemical properties) • k=1 i.e. one NN • Comparison with LDA, MLR, ANN • 70% data used for training • wL1 distance is calculated in all cases
CMPUT 605 Experimental Results • Table 1 shows that in almost all cases in terms of accuracy, and T_P, T_N, F_P etc. k-NN does better than LDA and MLR • ANN beats k-NN on almost all counts • Pruning—more than 80% in each kind of bioactivity (over brute-force search) • Key point – k-NN classifier is faster • More than 100 times faster than ANN
CMPUT 605 Experimental Results • Can calculate the level of bioactivity instead of a YES/NO • The value of the weights provides insights into the importance of descriptors for each bioactivity
CMPUT 605 Observations & Conclusion • Bacterial metabolites & antimicrobial drugs overlap (confirmation) • Human metabolites display distinctive properties • QSAR models for drugs + human metabolites dominated by few descriptors • These descriptors favored by drug developers and natural evolution
CMPUT 605 Observations & Conclusion • Classification results from k-NN can help rationalize the design and discovery of drugs • DMVP tree improves the space utilization of the program • Provides a means for fast similarity search • Data structure can be applied to any metric distance like wLp and Tanimoto distance
Thank You For Your Attention! CMPUT 605