Active Learning Strategies for Compound Screening

Active Learning Strategies for Compound Screening Megon Walker1 and Simon Kasif1,2 1Bioinformatics Program, Boston University 2Department of Biomedical Engineering, Boston University 229th ACS National Meeting March 13-17, 2005 San Diego, CA

Outline • Introduction to active learning for compound screening • Objectives and performance criteria • Algorithms and procedures • Thrombin dataset results • Preliminary conclusions

Introduction: drug discovery • drug discovery is an iterative process • goal: to identify many target binding compounds with minimal screening iterations descriptors compounds screening selection

Introduction:supervised learning • input: data set with positive and negative examples • output: a classifier such that for each example • = 1 if example is positive • = -1 if example is negative • standard learning • classifier trains on a static training set • train, then test • active learning • classifier chooses data points for training set • classifer “requests” labels • iterative rounds of training and testing

Introduction:active learning & compound screening • Mamitsuka et al. Proceedings of the Fifteenth International Conference on Machine Learning, 1998:1-9. • Warmuth et al. J. Chem Inf. Comput. Sci. 2003, 43: 667-673. 1st query 2nd query

Objectives • exploration • Accurate model of activity • Sensitivity • exploitation • Hit Performance • Enrichment Factor (EF)

Start Methods: datasets Input data files Pick training and testing data for next round of cross validation • 632 DuPont thrombin-targeting compounds • 149 actives • 483 inactives • a binary feature vector for each compound • shaped-based features • pharmacophore features • 139,351 features • retrospective data • 200 features selected by mutual information (MI) w.r.t. activity labels • mean MI = 0.126 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics • Warmuth et al. J. Chem Inf Comput Sci. 2003 Mar-Apr;43(2):667-73. • Eksterowicz et al. J Mol Graph Model. 2002 Jun;20(6):469-77. • Putta et al. J Chem Inf Comput Sci. 2002 Sep-Oct;42(5):1230-40. • KDD Cup 2001. http://www.cs.wisc.edu/~dpage/kddcup2001/ End

Start Methods: cross validation Input data files Pick training and testing data for next round of cross validation • 5X cross validation 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density 2nd 1st Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End

Start Methods: perceptron Input data files • given • binary input vector, • weight vector, • threshold value, T • learning rate, n • classification, t • TEST: • TRAIN: • if classified correctly, do nothing • if misclassified, Pick training and testing data for next round of cross validation 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End

Start Methods: classifier committees Input data files Pick training and testing data for next round of cross validation 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no • bagging: uniform sampling distribution • boosting: compounds misclassified by classifier #1 more likely resampled by classifier #2 All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End

Start Methods: weighted voting Input data files Pick training and testing data for next round of cross validation • weighted vote of all classifiers predicts compound activity label 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End

Start Methods: sample selection strategies Input data files Pick training and testing data for next round of cross validation • P(active) : select compounds predicted active with highest probability by the committee • uncertainty: select compounds on which the committee disagrees most strongly • density with respect to actives: select compounds most similar to previously labeled or predicted actives • Tanimoto similarity metric • given compound bitstrings A and B • a = # bits on in A • b = # bits on in B • c = # bits on in both A and B 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: -P(active) -uncertainty -density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End

Start Methods: performance criteria Input data files Pick training and testing data for next round of cross validation • Hit Performance • Enrichment Factor (EF) • Sensitivity 1st batch? yes no Select 1st batch randomly or by chemist Sample Selection: - P(active) - uncertainty - density Query training set batch labels Train classifier committee on labeled training set subsamples Predict compound labels by committee weighted majority vote no All training set labels queried? yes no Cross validation completed? yes Accuracy and performance statistics End

Results: hit performance

Results: sensitivity • uncertainty • highest testing set sensitivity initially • no significant increase in testing set sensitivity

Results: bagging vs. boosting • boosting • training set TP climbs faster, converges higher • overfits to the training data

Results: # classifiers

Conclusions • Sample selection • Bag vs. boost • Committee vs. single classifier • Testing set sensitivity • Trade off: exploration and exploitation

Active Learning Strategies for Compound Screening

Active Learning Strategies for Compound Screening

Presentation Transcript

Active Learning Strategies and Techniques

Strategies for Active Schools

Active Learning Strategies

Issues in Adapting Active Learning Strategies

Active Learning Strategies

Screening strategies

Active learning Query Strategies

Active Learning Strategies

ACTIVE LEARNING STRATEGIES

Reflections on Theories and Strategies for Active Learning

Active Learning for Active Citizenship

Active Teaching for Active Learning

Social Studies Active Learning Strategies

Classroom Strategies for Active Learning

Strategies for Active Schools

Top Active Learning Strategies For Current Generation Learners

Lecture III – Theories and Strategies for Active Learning

Active Learning Strategies for Large Geoscience Classes

Active Learning Strategies

Active Teaching and Active Learning: Techniques and Strategies for Instructors