Indiana University Collaboration with Scripps Florida

Indiana University Collaboration with Scripps Florida Rajarshi Guha IU Chemical Informatics and Cyberinfrastructure Collaboratory 7 February 2007

Project Overview • Broad goal • Develop toxicity prediction models for the screening data generated by Scripps, FL. and make the models easily accessible in multiple ways • Approach • Use a large, curated dataset for training (LeadScope) and investigate multiple model types • Handle the unbalanced nature of the problem • Consider multiple species • Consider multiple modes of administration • Deploy models in multiple formats

Data Preprocessing • Obtained ~ 102K SMILES from Scripps, FL • Features • Divided into mouse, rat and human LD50 values • LeadScope fragment keys were provided • Our current models consider the mouse data (~46K compounds), due to size issues • We also calculated BCI 1052 bit fingerprints

Modeling Strategy • Minimize feature selection • Investigate oversampling, undersampling and ensemble approaches to alleviating the unbalanced classification problem • Take into account multiple species • Include measure of applicability • Suggest alternative models

Random Forests • Selected random forest • no feature selection required • resistant to overfitting • variable importance allows us to do feature selection for other models • Initial models used BCI fingerprints

Random Forest - Results • Considered all the toxics (1832 compounds) and an equal number of randomly selected non-toxics • Overall % correct classification: 85% (out-of-bag) • Repeated runs (to avoid selection bias) gave similar results

Random Forest - Results • We considered an ensemble of 10 models • Fixed prediction set • Training set had all the toxics and randomly selected non-toxics • Average % correct (prediction set): 86% • Using the majority vote of the ensemble: 87% • The ensemble also provides a confidence measure

Naïve Bayes Models • We also considered Naïve Bayes model • With 1052 bits, % correct = 74% • Need to perform feature selection! • Get the 100 most important bits from each of the 10 random forest models • Get the unique set (166 bits) • Using the 166 bit reduced set, % correct = 72% • We get very similar performance, with ~10% of the original fingerprint

What Do the 166 Bits Mean? • Each bit represents a fragment • Manual analysis indicates that they match with previously known toxicity indicating fragments • Examples • [*]-;!@[a]:[a]:[a]-;!@[A]-;!@[*] • [S,s]!@C!@C!@C!@C!@[C,c]

LeadScope Keys • We also considered LeadScope keys • Since there are ~27K possible keys we applied a reduction procedure based on frequency of occurrence • Gave 966 keys • Rebuilt RF and NB (using 150 keys) models • RF performance is slightly worse (84% correct) • NB performance is slightly better (78% correct)

Model Deployment • Models were developed in R • Saved to binary format and deployed in the R web service infrastructure • The R binary model file can also be downloaded and run locally if desired • Currently, access is via a web service • Clients are in progress • Accepts SMILES and returns a toxicity prediction and confidence score

R Infrastructure • Currently provides access to • model routines (OLS, CNN, RF) • plotting • sampling distributions • We can also provide access to pre-packaged scripts/models • pkCalc – web service interface to pharmacokinetics program (translated to R from Matlab) • ScrippsTox – web service interface to the RF ensemble model

R Infrastructure • Some issues remain when handling prebuilt models • How to identify descriptors (OWL dictionaries?) • How to generically obtain model descriptions, summaries, etc.

What’s Coming ... • Use multiple species data simultaneously (indicator variables) • Build individual models for individual species • Use similarity to indicate which model may be more suitable • Develop web page client to access the deployed model • Work on model ... • description (provenance, comments, stats) • details (where to get the descriptors from)

Indiana University Collaboration with Scripps Florida

Indiana University Collaboration with Scripps Florida

Presentation Transcript

Doing business with Scripps Southwest Florida Group

INDIANA UNIVERSITY

Indiana University

Indiana University

Indiana University

Indiana University

Indiana University

Indiana University

Indiana University

INDIANA UNIVERSITY

Indiana University

INDIANA UNIVERSITY

Indiana University

Indiana University

Indiana University

Indiana University

Indiana University

Indiana University*

Indiana University

Indiana University

Indiana University

VLab development team UNIVERSITY OF MINNESOTA Indiana University Florida State