1 / 14

Indiana University Collaboration with Scripps Florida

Indiana University Collaboration with Scripps Florida. Rajarshi Guha IU Chemical Informatics and Cyberinfrastructure Collaboratory 7 February 2007. Project Overview. Broad goal

lauren
Download Presentation

Indiana University Collaboration with Scripps Florida

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indiana University Collaboration with Scripps Florida Rajarshi Guha IU Chemical Informatics and Cyberinfrastructure Collaboratory 7 February 2007

  2. Project Overview • Broad goal • Develop toxicity prediction models for the screening data generated by Scripps, FL. and make the models easily accessible in multiple ways • Approach • Use a large, curated dataset for training (LeadScope) and investigate multiple model types • Handle the unbalanced nature of the problem • Consider multiple species • Consider multiple modes of administration • Deploy models in multiple formats

  3. Data Preprocessing • Obtained ~ 102K SMILES from Scripps, FL • Features • Divided into mouse, rat and human LD50 values • LeadScope fragment keys were provided • Our current models consider the mouse data (~46K compounds), due to size issues • We also calculated BCI 1052 bit fingerprints

  4. Modeling Strategy • Minimize feature selection • Investigate oversampling, undersampling and ensemble approaches to alleviating the unbalanced classification problem • Take into account multiple species • Include measure of applicability • Suggest alternative models

  5. Random Forests • Selected random forest • no feature selection required • resistant to overfitting • variable importance allows us to do feature selection for other models • Initial models used BCI fingerprints

  6. Random Forest - Results • Considered all the toxics (1832 compounds) and an equal number of randomly selected non-toxics • Overall % correct classification: 85% (out-of-bag) • Repeated runs (to avoid selection bias) gave similar results

  7. Random Forest - Results • We considered an ensemble of 10 models • Fixed prediction set • Training set had all the toxics and randomly selected non-toxics • Average % correct (prediction set): 86% • Using the majority vote of the ensemble: 87% • The ensemble also provides a confidence measure

  8. Naïve Bayes Models • We also considered Naïve Bayes model • With 1052 bits, % correct = 74% • Need to perform feature selection! • Get the 100 most important bits from each of the 10 random forest models • Get the unique set (166 bits) • Using the 166 bit reduced set, % correct = 72% • We get very similar performance, with ~10% of the original fingerprint

  9. What Do the 166 Bits Mean? • Each bit represents a fragment • Manual analysis indicates that they match with previously known toxicity indicating fragments • Examples • [*]-;!@[a]:[a]:[a]-;!@[A]-;!@[*] • [S,s]!@C!@C!@C!@C!@[C,c]

  10. LeadScope Keys • We also considered LeadScope keys • Since there are ~27K possible keys we applied a reduction procedure based on frequency of occurrence • Gave 966 keys • Rebuilt RF and NB (using 150 keys) models • RF performance is slightly worse (84% correct) • NB performance is slightly better (78% correct)

  11. Model Deployment • Models were developed in R • Saved to binary format and deployed in the R web service infrastructure • The R binary model file can also be downloaded and run locally if desired • Currently, access is via a web service • Clients are in progress • Accepts SMILES and returns a toxicity prediction and confidence score

  12. R Infrastructure • Currently provides access to • model routines (OLS, CNN, RF) • plotting • sampling distributions • We can also provide access to pre-packaged scripts/models • pkCalc – web service interface to pharmacokinetics program (translated to R from Matlab) • ScrippsTox – web service interface to the RF ensemble model

  13. R Infrastructure • Some issues remain when handling prebuilt models • How to identify descriptors (OWL dictionaries?) • How to generically obtain model descriptions, summaries, etc.

  14. What’s Coming ... • Use multiple species data simultaneously (indicator variables) • Build individual models for individual species • Use similarity to indicate which model may be more suitable • Develop web page client to access the deployed model • Work on model ... • description (provenance, comments, stats) • details (where to get the descriptors from)

More Related