140 likes | 248 Views
Indiana University Collaboration with Scripps Florida. Rajarshi Guha IU Chemical Informatics and Cyberinfrastructure Collaboratory 7 February 2007. Project Overview. Broad goal
E N D
Indiana University Collaboration with Scripps Florida Rajarshi Guha IU Chemical Informatics and Cyberinfrastructure Collaboratory 7 February 2007
Project Overview • Broad goal • Develop toxicity prediction models for the screening data generated by Scripps, FL. and make the models easily accessible in multiple ways • Approach • Use a large, curated dataset for training (LeadScope) and investigate multiple model types • Handle the unbalanced nature of the problem • Consider multiple species • Consider multiple modes of administration • Deploy models in multiple formats
Data Preprocessing • Obtained ~ 102K SMILES from Scripps, FL • Features • Divided into mouse, rat and human LD50 values • LeadScope fragment keys were provided • Our current models consider the mouse data (~46K compounds), due to size issues • We also calculated BCI 1052 bit fingerprints
Modeling Strategy • Minimize feature selection • Investigate oversampling, undersampling and ensemble approaches to alleviating the unbalanced classification problem • Take into account multiple species • Include measure of applicability • Suggest alternative models
Random Forests • Selected random forest • no feature selection required • resistant to overfitting • variable importance allows us to do feature selection for other models • Initial models used BCI fingerprints
Random Forest - Results • Considered all the toxics (1832 compounds) and an equal number of randomly selected non-toxics • Overall % correct classification: 85% (out-of-bag) • Repeated runs (to avoid selection bias) gave similar results
Random Forest - Results • We considered an ensemble of 10 models • Fixed prediction set • Training set had all the toxics and randomly selected non-toxics • Average % correct (prediction set): 86% • Using the majority vote of the ensemble: 87% • The ensemble also provides a confidence measure
Naïve Bayes Models • We also considered Naïve Bayes model • With 1052 bits, % correct = 74% • Need to perform feature selection! • Get the 100 most important bits from each of the 10 random forest models • Get the unique set (166 bits) • Using the 166 bit reduced set, % correct = 72% • We get very similar performance, with ~10% of the original fingerprint
What Do the 166 Bits Mean? • Each bit represents a fragment • Manual analysis indicates that they match with previously known toxicity indicating fragments • Examples • [*]-;!@[a]:[a]:[a]-;!@[A]-;!@[*] • [S,s]!@C!@C!@C!@C!@[C,c]
LeadScope Keys • We also considered LeadScope keys • Since there are ~27K possible keys we applied a reduction procedure based on frequency of occurrence • Gave 966 keys • Rebuilt RF and NB (using 150 keys) models • RF performance is slightly worse (84% correct) • NB performance is slightly better (78% correct)
Model Deployment • Models were developed in R • Saved to binary format and deployed in the R web service infrastructure • The R binary model file can also be downloaded and run locally if desired • Currently, access is via a web service • Clients are in progress • Accepts SMILES and returns a toxicity prediction and confidence score
R Infrastructure • Currently provides access to • model routines (OLS, CNN, RF) • plotting • sampling distributions • We can also provide access to pre-packaged scripts/models • pkCalc – web service interface to pharmacokinetics program (translated to R from Matlab) • ScrippsTox – web service interface to the RF ensemble model
R Infrastructure • Some issues remain when handling prebuilt models • How to identify descriptors (OWL dictionaries?) • How to generically obtain model descriptions, summaries, etc.
What’s Coming ... • Use multiple species data simultaneously (indicator variables) • Build individual models for individual species • Use similarity to indicate which model may be more suitable • Develop web page client to access the deployed model • Work on model ... • description (provenance, comments, stats) • details (where to get the descriptors from)