Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules

Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules Dr John Mitchell University of St Andrews

1. Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the pharma industry A good model for predicting the solubility of druglike molecules would be very valuable.

How should we approach the prediction/estimation/calculation of the aqueous solubility of druglike molecules? Two (apparently) fundamentally different approaches: theoretical chemistry & informatics.

Theoretical Chemistry • Calculations and simulations based on real physics. • Calculations are either quantum mechanical or use parameters derived from quantum mechanics. • Attempt to model or simulate reality. • Usually Low Throughput.

Dataset • The thermodynamically most stable polymorph was selected where possible. • All have experimental crystal structures. • All have experimental logS. • 10 have experimental ΔGsub and ΔGhydr • (circled in red).

CheqSol Method Supersaturated Solution 8 Intrinsic solubility values Subsaturated Solution ● First precipitation – Kinetic Solubility (Not in Equilibrium) ● Thermodynamic Solubility through “Chasing Equilibrium”- Intrinsic Solubility (In Equilibrium) Supersaturation Factor SSF = Skin – S0 In Solution Powder Repeatability better than 0.05 log units ●We continue “Chasing equilibrium” until a specified number of crossing points have been reached ● A crossing point represents the moment when the solution switches from a saturated solution to a subsaturated solution; no change in pH, gradient zero, no re-dissolving nor precipitating…. SOLUTION IS IN EQUILIBRIUM “CheqSol” * A. Llinàs, J. C. Burley, K. J. Box, R. C. Glen and J. M. Goodman. Diclofenac solubility: independent determination of the intrinsic solubility of three crystal forms. J. Med. Chem. 2007, 50(5), 979-983

Thermodynamic Cycle

Thermodynamic Cycle Gas Solution Crystal

Sublimation Free Energy Gas Crystal

Sublimation Free Energy Gas (rigid molecule approximation) Crystal

Sublimation Free Energy Gas Crystal Calculating ΔGsub is a standard procedure in crystal structure prediction

Theoretical method for crystal Gsub from lattice energy & a phonon entropy term; DMACRYS using B3LYP/6-31G** multipoles and FIT repulsion-dispersion potential.

Results for ΔGsub Lattice energies from DMACRYS with FIT atom-atom model potential and B3LYP/6-31G(d,p) distributed multipoles.

A 46 compound set has a larger error, mostly due to some large outliers. Error statistics vary with dataset.

Thermodynamic Cycle Gas Solution Crystal

Hydration Free Energy We expected that hydration would be harder to model than sublimation, because the solution has an inexactly known and dynamic structure, both solute and solvent are important etc.

Theoretical method for aqueous solution Ghydr from Reference Interaction Site Model with Universal Correction (3DRISM-KH/UC).

Reference Interaction Site Model (RISM) • Combines features of explicit and implicit solvent models. • Solvent density is modelled, but no explicit molecular coordinates or dynamics. ~45 CPU mins per compound

Reference Interaction Site Model (RISM) Palmer, D.S., et al., Accurate calculations of the hydration free energies of druglike molecules using the Reference Interaction Site Model. Journal of Chemical Physics, 2010. 133(4): p. 044104-11.

Results for ΔGhyd Perhaps surprisingly, error in Ghyd is smaller than in Gsub.

logS from Thermodynamic Cycle Gas Solution Crystal Add the two terms to get ΔGsol and hence logS.

Results for ΔGsol

Conclusions: Solubility from Theory • Must calculate Gsub & Ghyd separately; • RISM is efficient & fairly accurate for Ghyd; • Experimental data for Gsub & Ghyd sparse and errors may be large; • Dataset size and composition make comparisons of methods hard; • Not yet matched accuracy of informatics.

Informatics and Empirical Models • In general, informatics methods represent phenomena mathematically, but not in a physics-based way. • Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model. • Do not attempt to simulate reality. • Usually High Throughput.

What Error is Acceptable? • For typically diverse sets of druglike molecules, a “good” QSPR will have an RMSE ≈ 0.7 logS units. • A RMSE > 1.0 logS unit is probably unacceptable. • This corresponds to an error range of 4.0 to 5.7 kJ/mol in Gsol.

What Error is Acceptable? • A useless model would have an RMSE close to the SD of the test set logS values: ~ 1.4 logS units; • The best possible model would have an RMSE close to the SD resulting from the experimental error in the underlying data: ~ 0.5 logS units?

Random Forest Machine Learning Method

Random Forest: Solubility Results RMSE(oob)=0.68 r2(oob)=0.90 Bias(oob)=0.01 RMSE(te)=0.69 r2(te)=0.89 Bias(te)=-0.04 Ntrain = 658; Ntest = 300 DS Palmer et al., J. Chem. Inf. Model., 47, 150-158 (2007)

Support Vector Machine

SVM: Solubility Results et al., RMSE(te)=0.94 r2(te)=0.79 Ntrain = 150 + 50; Ntest = 87

100 Compound Cross-Validation Theoretical energies don’t seem to improve descriptor models.

Replicating Solubility Challenge (post hoc) CDK descriptors: RF, RF,PLS, SVM RMSE(te)=1.09; 1.00; 0.89;1.08 r2(te)=0.39; 0.49; 0.58; 0.41 10; 12; 12;13/28 correct within 0.5 logS units Ntrain 94; Ntest 28

Replicating Solubility Challenge (post hoc) CDK descriptors: RF, RF,PLS, SVM Although the test dataset is small, it is a standard set. Ntrain 94; Ntest 28

Conclusions: Solubility from Informatics • Experimental data: errors unknown, but limit possible accuracy of models; • CheqSol - step in right direction; • Dataset size and composition hinder comparisons of methods; • Solubility Challenge – step in right direction.

2. Protein Target Prediction • Which protein does a given molecule bind to? • Virtual Screening • Multiple endpoint drugs - polypharmacology • New targets for existing drugs • Prediction of adverse drug reactions (ADR) • Computational toxicology

Predicted Protein Targets Actually we are predicting closely target-related MDDR classes • Selection of 233 classes from the MDL Drug Data Report • ~90,000 molecules • 15 independent 50%/50% splits into training/test set

Predicted Protein Targets Cumulative probability of correct prediction within the three top-ranking predictions: 82.1% (±0.5%)

ProteinTargetPrediction •Givenaspecificcompound,isitpossibletopredict computationally its biologicalinteractions with protein targets? •Veryimportantfor •Insilicoscreening(timeandmoneyefficient) •off-targetprediction(sideeffects) •Canbeusedforidentifyingsubstanceswithperformance- enhancingpotential Drugdiscovery:Predictingpromiscuity,AndrewL.Hopkins,Nature462,167-168(12November2009),doi:10.1038/462167a

SubstancesProhibitedinSports •WADApublishesandmaintainsaprohibitedlistof banned compounds,updatedevery6months •Substancesaresplitintothreemaincategories: Substancesprohibitedatalltimes (inandoutofcompetition) S0.Non-Approvedsubstances S1.AnabolicAgents S2.Peptidehormones,Growth FactorsandRelatedSubstances S3.Beta-2Agonists S4.HormoneAntagonistsand Modulators S5.DiureticsandOtherMasking Agents Substancesprohibitedincompetition S6.Stimulants S7.Narcotics S8.Cannabinoids S9.Glucocorticosteroids Substancesprohibitedinparticular sports P1.Alcoholwithaviolationthreshold of0.10g/L.(Archery,Karateetc) P2.Beta-BlockersprohibitedIn- Competitiononly(Bridge,Curling, Darts,Wrestling,Archeryetc.)

Methodology Stanozolol Database Rank 1 2 3 Class AnabolicAgents VitaminD Glucocorticoids

ChEMBL-Activities •Eachcompoundhas experimentaldatafora numberoftargets •ActivitydatabasedonIC50, EC50,Ki,Kdetc. •Someactivitiesjustlabelled “inactive”or“active” •Eachcompoundcanhave morethanonerecordfora giventarget

FilteringtheCheMBLFamilies •Eachofthe8,845targetshasa numberofcompoundsassigned •Notallcompoundshaveactual dataonthetargetorareactive •Wefiltered eachofthefamilies accordingto rules defining “active” and “inactive” •Therulesweredecidedby visualinspectionofdistributions •Rules •IC50 •≤50000nMactive&>50000nMinactive •Ki •<20000nMactive&≥20000nMinactive •Kd •≤10000nMactive&>10000nMinactive •EC50 •≤40000nMactive&>40000nMinactive •ED50 •≤10000nMactive&>10000nMinactive •Potency •≤10000nMactive&>10000nMinactive •Activity •≥40%active&<40%inactive •Inhibition •≥45%active&<45%inactive

Example: Distributions ofKi Index Index Index

RefinedFamilies • Filteredfamiliesconsist ofcompounds with significant experimentalactivities againstthe relevanttargets. • Manytargetshave distinctgroups ofligandswith differentscaffolds. • Maybebecausethereismore than onebindingsite,orbecause differentscaffoldscanfitthesame site. •Splittingsuchafamilyintosmaller groupsbasedonligandstructurewill allowustoidentifythedifferentsets ofligands. ???

RefinedFamilies-PFClust WeselectedthePFClustalgorithmbecauseitisa parameterfreeclusteringalgorithmanddoesnot requireanykindofparametertuning. PFClust:Anovelparameterfreeclusteringalgorithm.MavridisL,NathN,MitchellJBO.BMCBioinformatics2013,14:213.

DatabaseRefinement Original Families Refined Families •3563 •783690 •19639 •616600 Rule Filtering Database Clustering Database Compounds Compounds 5443 Families 1366460 Compounds Predictingtheproteintargetsforathleticperformance-enhancingsubstances.MavridisL,MitchellJBO.JCheminformatics 2013,5:31.

DatabaseRefinement-Validation •MonteCarloCross-Validation •Thethreeversionsofthedatabase wereexamined(Original,Filteredand Refined) •10%ofeachfamilywererandomly removedandusedasqueries •Ifthetoppredictionwasthefamily thatthequerywasamemberof,aTP wouldbecounted;ifnot,aFP •AverageMatthewsCorrelation Coefficient(MCC) •Original:0.02 •Filtered:0.03 •Refined:0.66 2.58%(6.61%) 66.98%(87.25%) 3.18% (7.21%) TopHit(Topfour)

Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules

Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules

Presentation Transcript

Dependency Parsing: Machine Learning Approaches

New Machine Learning Approaches for Level-1 Trigger

Approaches to learning

Seizure prediction and machine learning

Approaches to learning

Cloud Computing for Chemical Property Prediction

Machine Learning Based Models for Time Series Prediction

An Integrated Machine Learning Approach to Stroke Prediction

Approaches to Machine Translation

Gene Prediction approaches

Machine Learning Algorithms for Protein Structure Prediction

Approaches to Learning

Introduction: statistical and machine learning based approaches to neurobiology

Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules

Disulfide Connectivity Prediction Using Machine Learning Approaches

PROBABILISTIC AND LOGIC APPROACHES TO MACHINE LEARNING AND DATA MINING

Approaches to Learning

Stock Market Prediction Using Machine Learning | Machine Learning Tutorial | Simplilearn