Amit Satsangi amit@cs.ualberta

Amit Satsangi amit@cs.ualberta.ca Novel Approaches for Small Bio-molecule Classification and Structural Similarity SearchKarakoc E, Cherkasov A., and Sahinalp S.C. CMPUT 605

CMPUT 605 Background and Focus • Identification of molecules that play an active role in regulation of biological processes or disease states (Aspirin) • Structural similarity  Similar biological and/or physico-chemical properties (Maggiora et al.) • Classification of probe compound (unknown bioactivity) • Similarity search amongst compounds with known bioactivity

CMPUT 605 Background and Focus • Determining similarity distance measures (SDM) • Using SDM for classification of compounds—k-NN classification • Efficient data structures for fast similarity search—DMVP trees (an improvement over SCVP trees used previously)

CMPUT 605 Outline • Similarity measures • Classification techniques • k-NN classifier • DMVP tree • Results, Observations and Conclusion

CMPUT 605 Similarity between Molecules • Structural Similarity—doubly bonded C pair, existence of aromatic atom etc. (Used in structural similarity search engines) • Similarity of chemical descriptors—atomic wt., hydrophobicity, charge, density etc. (Used in QSAR* tools) *Quantitative Structure-Activity Relationship

CMPUT 605 Similarity Measures • Tanimoto coefficient T(X,Y)—Given two descriptor sets X & Y: • X & Y: n-dimensional bit-vectors (representation used by PubChem & some other databases) • Range of Tanimoto coefficient: [0, 1]

CMPUT 605 Similarity measures • Tanimoto Dist. Measure: DT(X,Y) = 1 –T(X,Y) • Minkowski distance (LP): • Real valued data possible

CMPUT 605 Classification Techniques • Multiple Linear Regression (MLR) • Linear Discriminant Analysis (LDA) • Artifical Neural Networks (ANN) • Support Vector Machines (SVM) • k-nearest Neighbor (k-NN) classification not used previously.

CMPUT 605 Distance-based Classification • Compounds—s & r • S & R respective descriptor arrays • If D(S,R) is small then bioactivity levels of s & r are similar • Notion of distance  classification of new compounds • Distance measure == metric (conditions) e.g. Hamming Distance, Tanimoto distance etc.

CMPUT 605 k-nn Classification • Given  Bioactivity • To Find  Distance measure that separates active and inactive compounds for the training set N-dimensional plane • Problem  Easy

CMPUT 605 k-nn Classification • Given  Bioactivity • To Find  Distance measure that separates active and inactive compounds for the training set N-dimensional plane • Problem  NP-hard • Solution  Use Genetic Algorithms, heuristic linear search to find the plane

QSAR approach • Uses a linear combination of descriptors • Assigns a weight to each dimension , W [0,1] • Weighted Minkowski distance of order 1 • Only binary classification considered (A/I) • Methods are general CMPUT 605

CMPUT 605 Parameter Optimization

CMPUT 605 k-NN Classifier • Set of data elements: {X1, … Xn} • Query element: Y • Range query  Find Xi such that D(Y,Xi) < R1 (user defined) • k-nn query  Find k items such that their distance to Y is as small as possible

CMPUT 605 Data structures: VP-Trees • Vantage Point (VP) tree • Choose an arbitrary data point (called Vantage Point) • Binary tree—recursively partitions the dataset into two equal sized subsets • Zero in on the nearest neighbor

CMPUT 605 Efficient data structures: SCVP Trees • Space Covering Vantage Point tree • Multiple vantage points chosen at each level • No more a binary tree—multiple branches at each internal node • Multiple inner partitions—hope is that each data point lies in atleast one inner partition

CMPUT 605 DMVP Tree • Memory requirements of SCVP tree can be large—redundancy of data elements • Deterministic selection of Vantage points • VP minimization—NP-Hard • Minimization == Weighted set cover problem • Use of greedy Algorithm: O(log l); l<n • Approximates the min number of VP’s

CMPUT 605 Experiments • Five types of bioactivities viz. being antibiotic (520), bacterial metabolite (562), human metabolite(1104), drug(958), drug-like(1202) • 62 dimensional descriptor array (30 QSAR & 32 physico-chemical properties) • k=1 i.e. one NN • Comparison with LDA, MLR, ANN • 70% data used for training • wL1 distance is calculated in all cases

CMPUT 605 Experimental Results • Table 1 shows that in almost all cases in terms of accuracy, and T_P, T_N, F_P etc. k-NN does better than LDA and MLR • ANN beats k-NN on almost all counts • Pruning—more than 80% in each kind of bioactivity (over brute-force search) • Key point – k-NN classifier is faster • More than 100 times faster than ANN

CMPUT 605 Experimental Results • Can calculate the level of bioactivity instead of a YES/NO • The value of the weights provides insights into the importance of descriptors for each bioactivity

CMPUT 605 Observations & Conclusion • Bacterial metabolites & antimicrobial drugs overlap (confirmation) • Human metabolites display distinctive properties • QSAR models for drugs + human metabolites dominated by few descriptors • These descriptors favored by drug developers and natural evolution

CMPUT 605 Observations & Conclusion • Classification results from k-NN can help rationalize the design and discovery of drugs • DMVP tree improves the space utilization of the program • Provides a means for fast similarity search • Data structure can be applied to any metric distance like wLp and Tanimoto distance

Thank You For Your Attention! CMPUT 605

Amit Satsangi amit@cs.ualberta

Amit Satsangi amit@cs.ualberta

Presentation Transcript

MySimon.Com By Amit Phull

Awadh Bihari Yadav , Amit Kumar Singh and Amit Misra

Amit Satsangi amit@cs.ualberta

Dr. Amit Mehra

Amit Mallik

Amit Mallik

by : Amit kumar upadhyay

Amit Mallik

Defense by Amit Saha

Amit Berman

Amit Dasgupta

MySimon.Com By Amit Phull

Prof. Amit Sahai

AMIT Update to RMS

Amit Colori Undri Pune by Amit Enterprises

HAN Update AMIT/RMS