210 likes | 294 Views
Empirical Validation of the Effectiveness of Chemical Descriptors in Data Mining. Kirk Simmons DuPont Crop Protection Stine-Haskell Research Center 1090 Elkton Road Newark, DE 19711 kirk.a.simmons@usa.dupont.com. The Study. Purpose Strategy Methods Metrics Results
E N D
Empirical Validation of the Effectiveness of Chemical Descriptors in Data Mining Kirk Simmons DuPont Crop Protection Stine-Haskell Research Center 1090 Elkton Road Newark, DE 19711 kirk.a.simmons@usa.dupont.com
The Study • Purpose • Strategy • Methods • Metrics • Results • Practical Application • Conclusions
Purpose • Chemical Structure Conference (1996) – Holland • Data mining/similarity methodologies reported • Used numerous descriptor sets • No standard datasets • Comparisons difficult • Comparative study of chemical descriptors across varied biology
Strategy • Systematically evaluate descriptors within a compound dataset across multiple biological endpoints • All compounds have experimentally measured endpoints • Diversity of biological endpoints • In-Vitro (receptor affinity, enzyme inhibition) • In-Vivo (insect mortality) • Explored nine common descriptor sets • Train and then use model to forecast a validation set
Methods • Four In-Vitro assays • 48K compound dataset for training • Corporate database for validation • Two In-Vivo assays • 75-100K compound datasets • Randomly divided into training and validation subsets • Recursive Partitioning - analytic method • Appropriate method for HTS data • Selected statistically conservative inputs (p-tail < 0.01)
Metrics • 4-way Interaction • Analytic Method, Compound Set, Biology, and Descriptors • Efficiency of analysis (Lift Chart) • Fraction of Actives found/Fraction of Dataset tested • Rewards efficiency only • Effectiveness of analysis (Composite Score) • Fraction of Actives found x Efficiency • Rewards efficiency as well as completeness
Practical Application • RP-based models using screening data on 3 targets • Activity treated as active/inactive • DiverseSolutionsR BCUT descriptors • RP-models used to forecast vendor compounds (1M) • Selected compounds purchased/screened • Hit-rates improved 530% over training sets • New structures and improved activity
Conclusions • Not all chemical descriptors equally effective • Whole molecule property-based less effective • Chemical feature-based appear more effective • Training models effectiveness • Averaged 28% of theory • Room for 4-fold improvement • Validation models effectiveness • Averaged 16% of theory • Room for 6-fold improvement
Acknowledgements • Dr. Linrong Yang, FMC Corporation • Completed the work • FMC Corporation • Release of the results • Prof. Peter Willett, University of Sheffield • Prof. Alex Tropsha, University of North Carolina • Prof. Doug Hawkins, University Minnesota • DuPont Corporation