440 likes | 575 Views
EXTENDED NEAREST NEIGHBOR CLASSIFICATION METHODS FOR PREDICTING SMALL MOLECULE ACTIVITY. Farhad Hormozdiari Lab for Computational Biology, Simon Fraser University. Outline. Small Molecule Similarity Measure Classification Kernel Methods Nearest Neighbor classifier
E N D
EXTENDED NEAREST NEIGHBOR CLASSIFICATION METHODS FOR PREDICTING SMALL MOLECULE ACTIVITY Farhad Hormozdiari Lab for Computational Biology, Simon Fraser University
Outline • Small Molecule • Similarity Measure • Classification • Kernel Methods • Nearest Neighbor classifier • Centroid based Nearest Neighbor • Distance / Metric Learning • Results
What are small molecules ? • Chemical compounds with small molecular mass • Important in the synthesis and maintenance of larger molecules (DNA, RNA and proteins). • High potential as medicine. • Increasing number of databases: PubChem, ChemDB, ChemBank… • Standard task in in silico drug discovery: Classifying an compound with unknown activity
Representation of small molecules • Chemical (Conventional) Descriptors: A(x)=(25, 0.24, 1, 12.3,….., 5, 2.12,……..) • Chemical structures represented by labeled graphs
Classification methods for small molecules • Artificial Neural Networks (ANN) • Support Vector Machine (SVM) • K-Nearest Neighbor Classification • Recent works focused on Kernel Methods
SVM (Support Vector Machine) • Φ(x) fixed feature transformation • tnϵ{1,-1} • Find a decision boundary • Y(x) = WT Φ(x) + b • Goal to maximize the distance • Dist= • Quadratic programming
Recent works on small molecule classification • Mariginalized Kernel (MK) • Tsuda et.al 2002, Kashima et al. (ICML 2003) • Features are number of labeled paths of random walks • Improved Mariginalized Kernel • Mahe et al. (ICML 2005) • Avoid totters (walks that visit a node which was visited in two previous stages)
Recent works on small molecule Classification • Swamidass et al. (Bioinformatics 2003) • Kernels based on 3D Euclidean coordinates of atoms • One histogram per pair of atom labels • Similarity between histograms • Cao et al. (ISMB 2008, Bioinformatics 2010) • Use Maximum Common Substructure (MCS) as a measure of similarity • Randomly pick ”basis” compounds • Features of a molecule are MCS between that molecule and all basis compounds
Nearest Neighbor Classification • Nearest Neighbor (NN) Classification • The label of a molecule is predicted based on ones of its nearest neighbors • NN Error < 2*Bayes error (Cover et al. 1967) • One of most used classifiers in small molecule classification because of its simplicity
Nearest Neighbor Classification Drawbacks • Speed/Memory • Distances to all traning set points should be computed • All the traning set is stored in the memory • Overfitting
Centroid based Nearest Neighbor (CBNN) Classificatrion • CBNN Classification • Centroids are picked from each class • Bioactivity of a small molecule is predicted based on its nearest centroids • CBNN tackle NN drawbacks
Centroid Selection • Hart et al., 1968 introduced Condensed NN Classification • Initially, the set of centroids S includes one point • Iteratively go through each remaining point p, if its nearest neighbor in S has the opposite class, p is added to S • Fast condensed NN Classification (Angiulli et al., ICML 2005) • S is assigned to medoids of each class • For each point in S their Voronoi cell is build • In each Voronoi cell if there exist a point from different class is added to S
w v u Centroid Selection • Gabriel Graph (Gabriel et al. 1969,1980) • There exist an edge between two points u,v • If for any point w dist(u,v) < min{dist(u,w),dist(w,v)} • After the graph is built, connected nodes from different classes are selected Removed link
Centroid Selection • Relative Neighborhood Graph (Toussiant et al. 1980) • There exists an edge between two points u,v if • for any point w, dist(u,v) < max{dist(u,w),dist(w,v)} • After the graph is built, connected nodes from different classes are selected
Combinatorial Centroid Selection • Combinatorial Centroid Selection(CCS) • Given a training set of points (compounds) where distances satisfy triangle inequalities • Asked to find the minimum number of centroids (selected compounds) such that for each point, its nearest centroid is from same class • For simplicity, we only deal with binaryclassification i.e. C1 first class and C2 second class.
CCS Complexity • k-CCS problem • Asked to select a set of points with cardinality less than k such that for each point, its nearest centroid is from same class • k-CCS is NP-Complete • K-Dominating Set (k-DS): given a graph G(V,E), ask whether there exists V' ⊆ V, |V'| ≤ k and each node v∊V either exist in V' or it is adjecent to a node in V' • k-DS ≤p k-CCS • This reduction states no approximation better than O(log n) exists for CCS unless P = NP
Integer Linear Program Solution • Notations: • To minimize the number of chosen points or compounds (called centroids)
Integer Linear Program Solution • Ensure that for every pair of compounds i of class 1 and j of class 2, if j is chosen as a centroid, a compound k of class 1within the radius of between i and j should be chosen as a centroid as well.
Integer Linear Program Solution • Ensure that for each class there is a compound chosen as a centroid
Fixed Size Neighborhood Solution • ILP solution suffers from • Huge size • due to pairwise constraints among points • Potential trivial solution • Propose a relaxed version of ILP • Reduce the number of constraints • for each point p within the radius equal to the distance from p to its k-th nearest neighbor of the different class there must be one centroid of same class of p • We will call this method CCNN1
Special case of CCS • When the majority of the compounds do not exhibit the bioactivity of interest • All compounds that exhibit bioactivity of interest are picked as centroids • We minimize the number of compounds chosen from compounds that does not exhibit the activity of interest
Special case of CCS • It can be reduced to Set Cover • O(logn)-approximation algorithm • Set Cover problem • Given a Universal Set (U) and a collection of subsets (C) from U. Goal is to pick the minimum number of sets from C which cover all the elements in U. • NP-Complete • Greedy Algorithm • Pick the set which cover the maximum number of uncoverd elements from the universal set • We will call this method CCNN2
Experimental Results - Datasets • Mutageniticy dataset • includes aromatic and hetero-aromatic nitro compounds that are tested for mutagenicity on Salmonella • 188 compounds with positive levels of log mutagenicity • 63 negative examples • Drug dataset includes • 958 drug compounds • 6550 non-drug compounds including antibiotics, human, bacterial, plant, fungal metabolites and drug-like compounds
Experimental Results - Descriptors • The structures of the compounds have been used • 30 3D inductive QSAR descriptors by Cherkasov et al. 2005 • 32 conventional QSAR by MOE: • Number of basic atoms • Number of bonds • ….
Comparison with other CBNN based methods • Drug dataset
Comparison with small molecule classication methods • Mutag Data set
Learning the Metric Space Emre Karakoc, Artem Cherkasov, S.Cenk Sahinalp (ISMB 2006)
Quantitative Structure-Activity Relationship(QSAR) • Similarity measure • Minkowski distance • Each feature is equally significant • But some features should be more significant and some less • Weighted Minkowski distance
Main Idea • Can weighted Minkowski be useful? • Reduce the number of features. • PCA • Increase the accuracy • How to learn the right W? • Decrease the within-class distance • Increase the between-class dist.
Learn the optimal W • Given the training set T let • Active set • Inactive set • Min f(T) • f(T) =
Learn the optimal W (cont.) • Min f(T) • s.t
Metric Learning • Weinberger et al. NIPS 2006 • Semidefinite program • D(xi,xj) = (xi-xj)TM(xi-xj) where M = LTL • s.t. M > 0 • The difference between between-class and within-class distances is pre-fixed • It aims to compute the “best” M
Classification of new compounds • Input: • Distances of new compound Q to the ones in the data-set • Assumption: • Bioactivity level of Q is likely to be similar to its close neighbors • kNN classifier estimate the bioactivity of Q: • The majority bioactivity among its k-nearest neighbors
Querying a compound • Naïve Method • O(S) which S is the number element in database. • Binary search tree • Vantage Point (VP) tree (Uhlmann 1991) • Binary tree that recursively partition data space using distances of data points to randomly picked vantage point.
VP-Tree • Internal nodes: (Xvp, M, Rptr, Lptr) M: median distance of among d(Xvp, Xi) for all Xi in the space partitioned. Xvp: Vantage point. • Leaves: references to data points
Proximity search in VP-tree • Given a query point q, metric distance d(.,.) and a proximity radius r • Goal is to find all points x where d(x,q) < r • If d(q,Xvp) – r < M recursively search the inner partition • If d(q, Xvp) + r > M recursively search the outer partition • Else search both
Can we do better? • Select multiple vantage points at each level • Space Covering VP (SCVP) Trees (Sahinalp et.al 2003) • Increasing the chance of inclusion of query in one of the inner partitions.
Can we do much better? • Instead of selecting random vantage points select them more intelligently • Deterministic Multiple Vantage Point (DMVP) Tree • Select minimum number of multiple vantage points that cover the entire data collection (OVPS problem) • Better space utilization (Optimal redundancy) • OVPS problem is NP-hard for any wLp
Conclusion • NN is powerful classifier • Small molecule classification • NN problem • CBNN • CCNN1 and CCNN2 • Distance learning • Accuracy • DMVP tree
Future work • Further investigation of possible approximation algorithms for selecting centroids • Combining CCNN (selecting centroids) with metric learning • Ideally the problem formulation should ask to ensure the NN of each point in the training set is in the same class with that point • Adapt CCNN to work with regression datasets
References • Phuong Dao*, Farhad Hormozdiari*, Hossien Jowhari, Kendall Byler, Artem Cherkasov, S. Cenk Sahinalp, Improved Small Molecule Activity Determination via Centroid Nearest Neighbors Classification, CSB 2008. • Emre Karakoc, Artem Cherkasov, S. Cenk Sahinalp Distance Based Algorithm for small Biomolecule Classification and Structural Similarity Search, ISMB 2006 • Lurii Sushko et.al. Applicability domains for classification problems: benchmarking of distance to models for AMES mutagenicity set, J. Chemical Informatics 2010.
Acknowledgments • Cenk Sahinalp • Artem Cherkasov • Zehra Cataltepe • Emre Karakoc • Phuong Dao • Hossien Jowhari • Kendall Byler • All members of Lab