180 likes | 348 Views
Active Learning Strategies for Compound Screening. Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical Engineering, Boston University 229 th ACS National Meeting March 13-17, 2005 San Diego, CA. Outline.
E N D
Active Learning Strategies for Compound Screening Megon Walker1 and Simon Kasif1,2 1Bioinformatics Program, Boston University 2Department of Biomedical Engineering, Boston University 229th ACS National Meeting March 13-17, 2005 San Diego, CA
Outline • Introduction to active learning for compound screening • Objectives and performance criteria • Algorithms and procedures • Thrombin dataset results • Preliminary conclusions
Introduction: drug discovery • drug discovery is an iterative process • goal: to identify many target binding compounds with minimal screening iterations descriptors compounds screening selection
Introduction:supervised learning • input: data set with positive and negative examples • output: a classifier such that for each example • = 1 if example is positive • = -1 if example is negative • standard learning • classifier trains on a static training set • train, then test • active learning • classifier chooses data points for training set • classifer “requests” labels • iterative rounds of training and testing
Introduction:active learning & compound screening • Mamitsuka et al. Proceedings of the Fifteenth International Conference on Machine Learning, 1998:1-9. • Warmuth et al. J. Chem Inf. Comput. Sci. 2003, 43: 667-673. 1st query 2nd query
Objectives • exploration • Accurate model of activity • Sensitivity • exploitation • Hit Performance • Enrichment Factor (EF)
Start Methods: datasets Input data files Pick training and testing data for next round of cross validation • 632 DuPont thrombin-targeting compounds • 149 actives • 483 inactives • a binary feature vector for each compound • shaped-based features • pharmacophore features • 139,351 features • retrospective data • 200 features selected by mutual information (MI) w.r.t. activity labels • mean MI = 0.126 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics • Warmuth et al. J. Chem Inf Comput Sci. 2003 Mar-Apr;43(2):667-73. • Eksterowicz et al. J Mol Graph Model. 2002 Jun;20(6):469-77. • Putta et al. J Chem Inf Comput Sci. 2002 Sep-Oct;42(5):1230-40. • KDD Cup 2001. http://www.cs.wisc.edu/~dpage/kddcup2001/ End
Start Methods: cross validation Input data files Pick training and testing data for next round of cross validation • 5X cross validation 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density 2nd 1st Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End
Start Methods: perceptron Input data files • given • binary input vector, • weight vector, • threshold value, T • learning rate, n • classification, t • TEST: • TRAIN: • if classified correctly, do nothing • if misclassified, Pick training and testing data for next round of cross validation 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End
Start Methods: classifier committees Input data files Pick training and testing data for next round of cross validation 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no • bagging: uniform sampling distribution • boosting: compounds misclassified by classifier #1 more likely resampled by classifier #2 All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End
Start Methods: weighted voting Input data files Pick training and testing data for next round of cross validation • weighted vote of all classifiers predicts compound activity label 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End
Start Methods: sample selection strategies Input data files Pick training and testing data for next round of cross validation • P(active) : select compounds predicted active with highest probability by the committee • uncertainty: select compounds on which the committee disagrees most strongly • density with respect to actives: select compounds most similar to previously labeled or predicted actives • Tanimoto similarity metric • given compound bitstrings A and B • a = # bits on in A • b = # bits on in B • c = # bits on in both A and B 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: -P(active) -uncertainty -density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End
Start Methods: performance criteria Input data files Pick training and testing data for next round of cross validation • Hit Performance • Enrichment Factor (EF) • Sensitivity 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End
Results: sensitivity • uncertainty • highest testing set sensitivity initially • no significant increase in testing set sensitivity
Results: bagging vs. boosting • boosting • training set TP climbs faster, converges higher • overfits to the training data
Conclusions • Sample selection • Bag vs. boost • Committee vs. single classifier • Testing set sensitivity • Trade off: exploration and exploitation