EXTENDED NEAREST NEIGHBOR CLASSIFICATION METHODS FOR PREDICTING SMALL MOLECULE ACTIVITY

EXTENDED NEAREST NEIGHBOR CLASSIFICATION METHODS FOR PREDICTING SMALL MOLECULE ACTIVITY Farhad Hormozdiari Lab for Computational Biology, Simon Fraser University

Outline • Small Molecule • Similarity Measure • Classification • Kernel Methods • Nearest Neighbor classifier • Centroid based Nearest Neighbor • Distance / Metric Learning • Results

What are small molecules ? • Chemical compounds with small molecular mass • Important in the synthesis and maintenance of larger molecules (DNA, RNA and proteins). • High potential as medicine. • Increasing number of databases: PubChem, ChemDB, ChemBank… • Standard task in in silico drug discovery: Classifying an compound with unknown activity

Representation of small molecules • Chemical (Conventional) Descriptors: A(x)=(25, 0.24, 1, 12.3,….., 5, 2.12,……..) • Chemical structures represented by labeled graphs

Classification methods for small molecules • Artificial Neural Networks (ANN) • Support Vector Machine (SVM) • K-Nearest Neighbor Classification • Recent works focused on Kernel Methods

SVM (Support Vector Machine) • Φ(x) fixed feature transformation • tnϵ{1,-1} • Find a decision boundary • Y(x) = WT Φ(x) + b • Goal to maximize the distance • Dist= • Quadratic programming

Recent works on small molecule classification • Mariginalized Kernel (MK) • Tsuda et.al 2002, Kashima et al. (ICML 2003) • Features are number of labeled paths of random walks • Improved Mariginalized Kernel • Mahe et al. (ICML 2005) • Avoid totters (walks that visit a node which was visited in two previous stages)

Recent works on small molecule Classification • Swamidass et al. (Bioinformatics 2003) • Kernels based on 3D Euclidean coordinates of atoms • One histogram per pair of atom labels • Similarity between histograms • Cao et al. (ISMB 2008, Bioinformatics 2010) • Use Maximum Common Substructure (MCS) as a measure of similarity • Randomly pick ”basis” compounds • Features of a molecule are MCS between that molecule and all basis compounds

Nearest Neighbor Classification • Nearest Neighbor (NN) Classification • The label of a molecule is predicted based on ones of its nearest neighbors • NN Error < 2*Bayes error (Cover et al. 1967) • One of most used classifiers in small molecule classification because of its simplicity

Nearest Neighbor Classification Drawbacks • Speed/Memory • Distances to all traning set points should be computed • All the traning set is stored in the memory • Overfitting

Centroid based Nearest Neighbor (CBNN) Classificatrion • CBNN Classification • Centroids are picked from each class • Bioactivity of a small molecule is predicted based on its nearest centroids • CBNN tackle NN drawbacks

Centroid Selection • Hart et al., 1968 introduced Condensed NN Classification • Initially, the set of centroids S includes one point • Iteratively go through each remaining point p, if its nearest neighbor in S has the opposite class, p is added to S • Fast condensed NN Classification (Angiulli et al., ICML 2005) • S is assigned to medoids of each class • For each point in S their Voronoi cell is build • In each Voronoi cell if there exist a point from different class is added to S

w v u Centroid Selection • Gabriel Graph (Gabriel et al. 1969,1980) • There exist an edge between two points u,v • If for any point w dist(u,v) < min{dist(u,w),dist(w,v)} • After the graph is built, connected nodes from different classes are selected Removed link

Centroid Selection • Relative Neighborhood Graph (Toussiant et al. 1980) • There exists an edge between two points u,v if • for any point w, dist(u,v) < max{dist(u,w),dist(w,v)} • After the graph is built, connected nodes from different classes are selected

Combinatorial Centroid Selection • Combinatorial Centroid Selection(CCS) • Given a training set of points (compounds) where distances satisfy triangle inequalities • Asked to find the minimum number of centroids (selected compounds) such that for each point, its nearest centroid is from same class • For simplicity, we only deal with binaryclassification i.e. C1 first class and C2 second class.

CCS Complexity • k-CCS problem • Asked to select a set of points with cardinality less than k such that for each point, its nearest centroid is from same class • k-CCS is NP-Complete • K-Dominating Set (k-DS): given a graph G(V,E), ask whether there exists V' ⊆ V, |V'| ≤ k and each node v∊V either exist in V' or it is adjecent to a node in V' • k-DS ≤p k-CCS • This reduction states no approximation better than O(log n) exists for CCS unless P = NP

Integer Linear Program Solution • Notations: • To minimize the number of chosen points or compounds (called centroids)

Integer Linear Program Solution • Ensure that for every pair of compounds i of class 1 and j of class 2, if j is chosen as a centroid, a compound k of class 1within the radius of between i and j should be chosen as a centroid as well.

Integer Linear Program Solution • Ensure that for each class there is a compound chosen as a centroid

Fixed Size Neighborhood Solution • ILP solution suffers from • Huge size • due to pairwise constraints among points • Potential trivial solution • Propose a relaxed version of ILP • Reduce the number of constraints • for each point p within the radius equal to the distance from p to its k-th nearest neighbor of the different class there must be one centroid of same class of p • We will call this method CCNN1

Special case of CCS • When the majority of the compounds do not exhibit the bioactivity of interest • All compounds that exhibit bioactivity of interest are picked as centroids • We minimize the number of compounds chosen from compounds that does not exhibit the activity of interest

Special case of CCS • It can be reduced to Set Cover • O(logn)-approximation algorithm • Set Cover problem • Given a Universal Set (U) and a collection of subsets (C) from U. Goal is to pick the minimum number of sets from C which cover all the elements in U. • NP-Complete • Greedy Algorithm • Pick the set which cover the maximum number of uncoverd elements from the universal set • We will call this method CCNN2

Experimental Results - Datasets • Mutageniticy dataset • includes aromatic and hetero-aromatic nitro compounds that are tested for mutagenicity on Salmonella • 188 compounds with positive levels of log mutagenicity • 63 negative examples • Drug dataset includes • 958 drug compounds • 6550 non-drug compounds including antibiotics, human, bacterial, plant, fungal metabolites and drug-like compounds

Experimental Results - Descriptors • The structures of the compounds have been used • 30 3D inductive QSAR descriptors by Cherkasov et al. 2005 • 32 conventional QSAR by MOE: • Number of basic atoms • Number of bonds • ….

Comparison with other CBNN based methods • Drug dataset

Comparison with small molecule classication methods • Mutag Data set

Comparison with small molecule classication methods • Drug

Learning the Metric Space Emre Karakoc, Artem Cherkasov, S.Cenk Sahinalp (ISMB 2006)

Quantitative Structure-Activity Relationship(QSAR) • Similarity measure • Minkowski distance • Each feature is equally significant • But some features should be more significant and some less • Weighted Minkowski distance

Main Idea • Can weighted Minkowski be useful? • Reduce the number of features. • PCA • Increase the accuracy • How to learn the right W? • Decrease the within-class distance • Increase the between-class dist.

Learn the optimal W • Given the training set T let • Active set • Inactive set • Min f(T) • f(T) =

Learn the optimal W (cont.) • Min f(T) • s.t

Metric Learning • Weinberger et al. NIPS 2006 • Semidefinite program • D(xi,xj) = (xi-xj)TM(xi-xj) where M = LTL • s.t. M > 0 • The difference between between-class and within-class distances is pre-fixed • It aims to compute the “best” M

Classification of new compounds • Input: • Distances of new compound Q to the ones in the data-set • Assumption: • Bioactivity level of Q is likely to be similar to its close neighbors • kNN classifier estimate the bioactivity of Q: • The majority bioactivity among its k-nearest neighbors

Querying a compound • Naïve Method • O(S) which S is the number element in database. • Binary search tree • Vantage Point (VP) tree (Uhlmann 1991) • Binary tree that recursively partition data space using distances of data points to randomly picked vantage point.

VP-Tree • Internal nodes: (Xvp, M, Rptr, Lptr) M: median distance of among d(Xvp, Xi) for all Xi in the space partitioned. Xvp: Vantage point. • Leaves: references to data points

Proximity search in VP-tree • Given a query point q, metric distance d(.,.) and a proximity radius r • Goal is to find all points x where d(x,q) < r • If d(q,Xvp) – r < M recursively search the inner partition • If d(q, Xvp) + r > M recursively search the outer partition • Else search both

Can we do better? • Select multiple vantage points at each level • Space Covering VP (SCVP) Trees (Sahinalp et.al 2003) • Increasing the chance of inclusion of query in one of the inner partitions.

Can we do much better? • Instead of selecting random vantage points select them more intelligently • Deterministic Multiple Vantage Point (DMVP) Tree • Select minimum number of multiple vantage points that cover the entire data collection (OVPS problem) • Better space utilization (Optimal redundancy) • OVPS problem is NP-hard for any wLp

Conclusion • NN is powerful classifier • Small molecule classification • NN problem • CBNN • CCNN1 and CCNN2 • Distance learning • Accuracy • DMVP tree

Future work • Further investigation of possible approximation algorithms for selecting centroids • Combining CCNN (selecting centroids) with metric learning • Ideally the problem formulation should ask to ensure the NN of each point in the training set is in the same class with that point • Adapt CCNN to work with regression datasets

References • Phuong Dao*, Farhad Hormozdiari*, Hossien Jowhari, Kendall Byler, Artem Cherkasov, S. Cenk Sahinalp, Improved Small Molecule Activity Determination via Centroid Nearest Neighbors Classification, CSB 2008. • Emre Karakoc, Artem Cherkasov, S. Cenk Sahinalp Distance Based Algorithm for small Biomolecule Classification and Structural Similarity Search, ISMB 2006 • Lurii Sushko et.al. Applicability domains for classification problems: benchmarking of distance to models for AMES mutagenicity set, J. Chemical Informatics 2010.

Acknowledgments • Cenk Sahinalp • Artem Cherkasov • Zehra Cataltepe • Emre Karakoc • Phuong Dao • Hossien Jowhari • Kendall Byler • All members of Lab

Questions

EXTENDED NEAREST NEIGHBOR CLASSIFICATION METHODS FOR PREDICTING SMALL MOLECULE ACTIVITY

EXTENDED NEAREST NEIGHBOR CLASSIFICATION METHODS FOR PREDICTING SMALL MOLECULE ACTIVITY

Presentation Transcript

K-nearest neighbor methods

Classification Methods: k-Nearest Neighbor Naïve Bayes

Nearest Neighbor Classifiers

Nearest-Neighbor Classifiers

Nearest Neighbor

Nearest neighbor matching

K Nearest Neighbor Classification Methods

Nearest-Neighbor Classifiers

Optimized Nearest Neighbor Methods

Classification Nearest Neighbor

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams

Nearest Neighbor and Reverse Nearest Neighbor Queries for Moving Objects

Nearest Neighbor

K Nearest Neighbor Classification Methods

K nearest neighbor

K Nearest Neighbor Classification Methods

K-Nearest Neighbor

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams

Fast and Scalable Nearest Neighbor Based Classification

Classification Nearest Neighbor

Learning: Nearest Neighbor

Nearest Neighbor Classifier