180 likes | 431 Views
Carolina Exploratory Center for Cheminformatics Research (CECCR). (pronounced [Sē:k ê r]). Focus on improving hit rates of screening campaigns. Funded by p20 HG003898. Carolina Exploratory Center For Cheminformatics Research (CECCR) (funded by p20 HG003898, 500K direct/2 years).
E N D
Carolina Exploratory Center for Cheminformatics Research (CECCR) (pronounced [Sē:kêr]) Focus on improving hit rates of screening campaigns Funded by p20 HG003898
Carolina Exploratory Center For Cheminformatics Research (CECCR) (funded by p20 HG003898, 500K direct/2 years) Software support group Scitegic CCG OpenEye VCU-MCV Monti Kier Glen Kellogg UNC Weifan Zheng (NCCU), Assoc. Director Sasha Golbraikh (MedChem) Yufeng Liu (OR/Stat) Steve Marron (OR/Stat) Diane Pozefsky (CS) Wei Wang (CS) UIC Jie Liang Alexander Tropsha, PI, Director School of Pharmacy UT Austin Bob Pearlman Advisory Group Yvonne Martin (Abbott; Chair): Jim Wikel (Lilly, ret) Dimitris Agrafiotis (J&J) Robert Pacifici (HiQ) Chris Waller (Pfizer) Marti Head (GSK) Consultants Eric Toone (Duke) Bryan Roth (CWUUNC)
Screening data analysis? • Why Bother? • Who Cares? • And So What? • we can not screen every important chemical entity • screening hit rate is very low • experimentalists need tools for data handling • there is a need in accurate predictive models of (historic) data • (QSAR analysis) that guide the (future) experiment • modern QSAR modeling can serve as a decision support tool • rigorous models afford biological data imputation
Average MLSCN Hit Rates Note: 4 of 35 Assays for NCGC account for 92% of their hits. Without these screens hit rate for NCGC = 0.004 Note: 4 of 35 Assays for NCGC account for 92% of their hits. Without these screens the hit rate for NCGC = 0.004
Largefraction are confirmed actives Our approach: Focusing on Validated Predictions, not Statistics or Interpretation Large External database/library Small SAR dataset Input QSAR Magic Small numberof computational hits Output Real Test
Predictive QSAR Modeling Workflow* Y-Randomization Multiple Training Sets Combi-QSAR Modeling Split into Training, Test, and External Validation Sets Original Dataset Only accept models that have a q2 > 0.6 R2 > 0.6, etc. Experimental Validation Multiple Test Sets Activity Prediction Database Screening Using Applicability Domain External validation Using Applicability Domain Validated Predictive Models with High Internal & External Accuracy *Tropsha, A., Gramatica, P., Gombar, V. The importance of being earnest:… Quant. Struct. Act. Relat. Comb. Sci. 2003, 22, 69-77.
Application of the Predictive QSAR Workflow to Anticonvulsants* 48anticonvulsants (tested at NIH) Acceptance criteria ~ 760 kNN QSAR models 48 anticonvulsants 48 anticonvulsants 10 Best models Mining DBs using Probes Similarity Cutoff Ca. 255,000chemicals in DBs 293,000 chemicals in DBs 50 consensus (common) hits 4334 hits Predictions with 10 QSAR models using applicability domain 9 compounds selected based on synthetic considerations 7compounds active 7 compounds active NIH testing 22 compounds submitted to chemists 22compounds submitted to chemists *Shen, M., et al. J. Med Chem., 2002, 45, 2811-2823; Shen, M., et al 2004, 47, 2356-2364.
Application of the Predictive QSAR Workflow to D1 antagonists Y-Randomization Multiple Training Sets Variable Selection QSAR Models Divide into Training and Test Sets 48 D1 Antagonists Only accept models that have a R2Train > 0.6 R2Test > 0.6 Multiple Test Sets Activity Prediction 54 Hits predicted as D1 Ligands Validated Predictive Models with High Internal & External Accuracy Screen 700,000 Compounds Oloff, S., Mailman, R.B., and Tropsha, A.. J. Med. Chem., 2005, 48, 7322-32.
Application of the Predictive QSAR Workflow to D1 antagonist modeling Y-Randomization Multiple Training Sets Variable Selection QSAR Models Divide into Training and Test Sets 48 D1 Antagonists Only accept models that have a R2Train > 0.6 R2Test > 0.6 Multiple Test Sets Activity Prediction 54 Hits predicted as D1 Ligands Validated Predictive Models with High Internal & External Accuracy Screen 700,000 Compounds Oloff, S., Mailman, R.B., and Tropsha, A.. J. Med. Chem., 2005, 48, 7322-32.
Compounds identified via virtual screening of databases that were previously characterized as D1 ligands
Experimental Validation of Database Screening Predictions Prediction Algorithm % Models Predicted Predicted -Log(K0.5) Std. Dev. of Prediction Actual -Log(K0.5) SVM 51 7.5 0.84 5.5 kNN 34 8.1 0.2 6.0
Application of Predictive QSAR Workflow to HDAC Inhibitors 59 HDAC Inhibitors Y-Randomization Multiple Training Sets Variable Selection QSAR Models Remaining Subset (50 compounds) Multiple Test Sets Validation Set (9 compounds) 1385 accepted models that have a R2Train > 0.60 R2Test > 0.60 Screen 3,100,000 Compounds 27 Hits predicted as HDAC inhibitors 4 validated experimentally* 2 are mkM inhibitors 70 Validated Predictive Models with High Internal & External Accuracy *collaboration with Bryan Roth, UNC
Experimental validation of HDAC computational hits (data from Bryan Roth’s lab)
Application of Predictive QSAR Workflow to GPR40 Agonists Y-Randomization External Validation Set8 Compds Multiple Training Sets Combi-QSAR Modeling Training and Test Set 45 Compds 53 GPR40 Agonists Only accept models that have both q2 > 0.75 R2 > 0.75 Activity Prediction Multiple Test Sets Validation of Predictive Models with High External Accuracy 48 Multiple Hits Predicted as Potent Agonists 226 Hits after similarity search Screen 9,500,000 Compounds
Summary • Focus on accurate prediction of external datasets is more critical than accurate fitting of existing data • validation!!! • applicability domain • consensus prediction using all acceptable models • experimental validation of a small number of computational hits • Predictive QSAR workflow with extensive validation affords statistically significant models • reliable property predictors • decision support tools in selecting experimental screening sets • biological data imputation (data imputation is the substitution of estimated values for missing or inconsistent data items (fields). The substituted values are intended to create a data record that does not fail edits.)
CECCR: Infrastructure for Research, Support, and Training • Cheminformatics Research: • Chemical descriptors • QSAR modeling techniques • Molecular similarity/diversity metrics • Visualization of multidimensional datasets • Software development and deployment: C-ChemBench • Application Research: • Predictive model development (QSAR, pharmacophores) for specific end-points including ADMETox leading to predictors. • Virtual screening of chemical libraries using predictors to identify hits • Library design • Target-specific annotation of compounds in chemical libraries (CECCR-Base) • Target Audience • Computational chemists: advanced model development tools • Chemists: library design • Biologists: specialized predictors and annotated compounds for experimental validation