570 likes | 614 Views
Explore theoretical chemistry and informatics methods in predicting aqueous solubility of druglike molecules. Learn about CheqSol method, thermodynamic cycles, and Random Forest machine learning for accurate solubility estimations. Compare theoretical and empirical modeling techniques for better drug discovery outcomes.
E N D
Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules Dr John Mitchell University of St Andrews
1. Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the pharma industry A good model for predicting the solubility of druglike molecules would be very valuable.
How should we approach the prediction/estimation/calculation of the aqueous solubility of druglike molecules? Two (apparently) fundamentally different approaches: theoretical chemistry & informatics.
Theoretical Chemistry • Calculations and simulations based on real physics. • Calculations are either quantum mechanical or use parameters derived from quantum mechanics. • Attempt to model or simulate reality. • Usually Low Throughput.
Dataset • The thermodynamically most stable polymorph was selected where possible. • All have experimental crystal structures. • All have experimental logS. • 10 have experimental ΔGsub and ΔGhydr • (circled in red).
CheqSol Method Supersaturated Solution 8 Intrinsic solubility values Subsaturated Solution ● First precipitation – Kinetic Solubility (Not in Equilibrium) ● Thermodynamic Solubility through “Chasing Equilibrium”- Intrinsic Solubility (In Equilibrium) Supersaturation Factor SSF = Skin – S0 In Solution Powder Repeatability better than 0.05 log units ●We continue “Chasing equilibrium” until a specified number of crossing points have been reached ● A crossing point represents the moment when the solution switches from a saturated solution to a subsaturated solution; no change in pH, gradient zero, no re-dissolving nor precipitating…. SOLUTION IS IN EQUILIBRIUM “CheqSol” * A. Llinàs, J. C. Burley, K. J. Box, R. C. Glen and J. M. Goodman. Diclofenac solubility: independent determination of the intrinsic solubility of three crystal forms. J. Med. Chem. 2007, 50(5), 979-983
Thermodynamic Cycle Gas Solution Crystal
Sublimation Free Energy Gas Crystal
Sublimation Free Energy Gas (rigid molecule approximation) Crystal
Sublimation Free Energy Gas Crystal Calculating ΔGsub is a standard procedure in crystal structure prediction
Theoretical method for crystal Gsub from lattice energy & a phonon entropy term; DMACRYS using B3LYP/6-31G** multipoles and FIT repulsion-dispersion potential.
Results for ΔGsub Lattice energies from DMACRYS with FIT atom-atom model potential and B3LYP/6-31G(d,p) distributed multipoles.
A 46 compound set has a larger error, mostly due to some large outliers. Error statistics vary with dataset.
Thermodynamic Cycle Gas Solution Crystal
Hydration Free Energy We expected that hydration would be harder to model than sublimation, because the solution has an inexactly known and dynamic structure, both solute and solvent are important etc.
Theoretical method for aqueous solution Ghydr from Reference Interaction Site Model with Universal Correction (3DRISM-KH/UC).
Reference Interaction Site Model (RISM) • Combines features of explicit and implicit solvent models. • Solvent density is modelled, but no explicit molecular coordinates or dynamics. ~45 CPU mins per compound
Reference Interaction Site Model (RISM) Palmer, D.S., et al., Accurate calculations of the hydration free energies of druglike molecules using the Reference Interaction Site Model. Journal of Chemical Physics, 2010. 133(4): p. 044104-11.
Results for ΔGhyd Perhaps surprisingly, error in Ghyd is smaller than in Gsub.
logS from Thermodynamic Cycle Gas Solution Crystal Add the two terms to get ΔGsol and hence logS.
Conclusions: Solubility from Theory • Must calculate Gsub & Ghyd separately; • RISM is efficient & fairly accurate for Ghyd; • Experimental data for Gsub & Ghyd sparse and errors may be large; • Dataset size and composition make comparisons of methods hard; • Not yet matched accuracy of informatics.
Informatics and Empirical Models • In general, informatics methods represent phenomena mathematically, but not in a physics-based way. • Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model. • Do not attempt to simulate reality. • Usually High Throughput.
What Error is Acceptable? • For typically diverse sets of druglike molecules, a “good” QSPR will have an RMSE ≈ 0.7 logS units. • A RMSE > 1.0 logS unit is probably unacceptable. • This corresponds to an error range of 4.0 to 5.7 kJ/mol in Gsol.
What Error is Acceptable? • A useless model would have an RMSE close to the SD of the test set logS values: ~ 1.4 logS units; • The best possible model would have an RMSE close to the SD resulting from the experimental error in the underlying data: ~ 0.5 logS units?
Random Forest Machine Learning Method
Random Forest: Solubility Results RMSE(oob)=0.68 r2(oob)=0.90 Bias(oob)=0.01 RMSE(te)=0.69 r2(te)=0.89 Bias(te)=-0.04 Ntrain = 658; Ntest = 300 DS Palmer et al., J. Chem. Inf. Model., 47, 150-158 (2007)
SVM: Solubility Results et al., RMSE(te)=0.94 r2(te)=0.79 Ntrain = 150 + 50; Ntest = 87
100 Compound Cross-Validation Theoretical energies don’t seem to improve descriptor models.
Replicating Solubility Challenge (post hoc) CDK descriptors: RF, RF,PLS, SVM RMSE(te)=1.09; 1.00; 0.89;1.08 r2(te)=0.39; 0.49; 0.58; 0.41 10; 12; 12;13/28 correct within 0.5 logS units Ntrain 94; Ntest 28
Replicating Solubility Challenge (post hoc) CDK descriptors: RF, RF,PLS, SVM Although the test dataset is small, it is a standard set. Ntrain 94; Ntest 28
Conclusions: Solubility from Informatics • Experimental data: errors unknown, but limit possible accuracy of models; • CheqSol - step in right direction; • Dataset size and composition hinder comparisons of methods; • Solubility Challenge – step in right direction.
2. Protein Target Prediction • Which protein does a given molecule bind to? • Virtual Screening • Multiple endpoint drugs - polypharmacology • New targets for existing drugs • Prediction of adverse drug reactions (ADR) • Computational toxicology
Predicted Protein Targets Actually we are predicting closely target-related MDDR classes • Selection of 233 classes from the MDL Drug Data Report • ~90,000 molecules • 15 independent 50%/50% splits into training/test set
Predicted Protein Targets Cumulative probability of correct prediction within the three top-ranking predictions: 82.1% (±0.5%)
ProteinTargetPrediction •Givenaspecificcompound,isitpossibletopredict computationally its biologicalinteractions with protein targets? •Veryimportantfor •Insilicoscreening(timeandmoneyefficient) •off-targetprediction(sideeffects) •Canbeusedforidentifyingsubstanceswithperformance- enhancingpotential Drugdiscovery:Predictingpromiscuity,AndrewL.Hopkins,Nature462,167-168(12November2009),doi:10.1038/462167a
SubstancesProhibitedinSports •WADApublishesandmaintainsaprohibitedlistof banned compounds,updatedevery6months •Substancesaresplitintothreemaincategories: Substancesprohibitedatalltimes (inandoutofcompetition) S0.Non-Approvedsubstances S1.AnabolicAgents S2.Peptidehormones,Growth FactorsandRelatedSubstances S3.Beta-2Agonists S4.HormoneAntagonistsand Modulators S5.DiureticsandOtherMasking Agents Substancesprohibitedincompetition S6.Stimulants S7.Narcotics S8.Cannabinoids S9.Glucocorticosteroids Substancesprohibitedinparticular sports P1.Alcoholwithaviolationthreshold of0.10g/L.(Archery,Karateetc) P2.Beta-BlockersprohibitedIn- Competitiononly(Bridge,Curling, Darts,Wrestling,Archeryetc.)
Methodology Stanozolol Database Rank 1 2 3 Class AnabolicAgents VitaminD Glucocorticoids
ChEMBL-Activities •Eachcompoundhas experimentaldatafora numberoftargets •ActivitydatabasedonIC50, EC50,Ki,Kdetc. •Someactivitiesjustlabelled “inactive”or“active” •Eachcompoundcanhave morethanonerecordfora giventarget
FilteringtheCheMBLFamilies •Eachofthe8,845targetshasa numberofcompoundsassigned •Notallcompoundshaveactual dataonthetargetorareactive •Wefiltered eachofthefamilies accordingto rules defining “active” and “inactive” •Therulesweredecidedby visualinspectionofdistributions •Rules •IC50 •≤50000nMactive&>50000nMinactive •Ki •<20000nMactive&≥20000nMinactive •Kd •≤10000nMactive&>10000nMinactive •EC50 •≤40000nMactive&>40000nMinactive •ED50 •≤10000nMactive&>10000nMinactive •Potency •≤10000nMactive&>10000nMinactive •Activity •≥40%active&<40%inactive •Inhibition •≥45%active&<45%inactive
Example: Distributions ofKi Index Index Index
RefinedFamilies • Filteredfamiliesconsist ofcompounds with significant experimentalactivities againstthe relevanttargets. • Manytargetshave distinctgroups ofligandswith differentscaffolds. • Maybebecausethereismore than onebindingsite,orbecause differentscaffoldscanfitthesame site. •Splittingsuchafamilyintosmaller groupsbasedonligandstructurewill allowustoidentifythedifferentsets ofligands. ???
RefinedFamilies-PFClust WeselectedthePFClustalgorithmbecauseitisa parameterfreeclusteringalgorithmanddoesnot requireanykindofparametertuning. PFClust:Anovelparameterfreeclusteringalgorithm.MavridisL,NathN,MitchellJBO.BMCBioinformatics2013,14:213.
DatabaseRefinement Original Families Refined Families •3563 •783690 •19639 •616600 Rule Filtering Database Clustering Database Compounds Compounds 5443 Families 1366460 Compounds Predictingtheproteintargetsforathleticperformance-enhancingsubstances.MavridisL,MitchellJBO.JCheminformatics 2013,5:31.
DatabaseRefinement-Validation •MonteCarloCross-Validation •Thethreeversionsofthedatabase wereexamined(Original,Filteredand Refined) •10%ofeachfamilywererandomly removedandusedasqueries •Ifthetoppredictionwasthefamily thatthequerywasamemberof,aTP wouldbecounted;ifnot,aFP •AverageMatthewsCorrelation Coefficient(MCC) •Original:0.02 •Filtered:0.03 •Refined:0.66 2.58%(6.61%) 66.98%(87.25%) 3.18% (7.21%) TopHit(Topfour)