220 likes | 357 Views
Know More Before You Score: An Analysis of Structure-Based Virtual Screening Protocols. Structure-Based Virtual Screening (SBVS) is a proven technique for lead discovery Still many areas for improvement Efforts generally focussed on scoring function
E N D
Know More Before You Score: An Analysis of Structure-Based Virtual Screening Protocols • Structure-Based Virtual Screening (SBVS) is a proven technique for lead discovery • Still many areas for improvement • Efforts generally focussed on scoring function • Often with little consideration of the assumptions underpinning SBVS • Here we consider a number of these processes in detail from the perspective of our primary SBVS tool (DOCK) • Ligand conformational search protocols • Varying site points definitions • Alteration of DOCK variables that directly affect sampling • Determine their impact on hit enrichment and search speed • Analyze implications for future research
Ligand Flexibility StudiesStrategy • SBVS CPU intensive • Conformational searching of ligand clearly important • Sampling limited to allow search completion in reasonable time frame • Test required to compare different conformational sampling methods • Ability to reproduce bioactive conformation tested • 145 ligands from a 1995 analysis of pdb complexes (Gschwend UCSF unpublished) • 30 compound subset chosen for analysis- selection based on visual and numerical inspection of diversity in ligand flexibility and functionality • Relatively small sample of molecules used, many peptidic in nature • Peptidic moieties are among the better parameterized systems, so this is in some ways a best case scenario
Ligand Flexibility StudiesProcedure • Multiple sampling techniques chosen: Catalyst-best / Catalyst-fast / Confort / Omega / DOCK • Variety of sampling levels • Starting from Concord structure, conformers generated and superimposed onto pdb ligand conformation. • Conformation with lowest heavy atom RMS to used as quality measure
Ligand Flexibility StudiesSearch Settings Employed • Dock - conformation_cutoff_factor=3/5/10 clash_overlap=0.7 times vdW radius for clash overlap with customized rules for bond increment settings • Confort - Rough (0.10 kcal) convergence, diverse conformer selection, boat ring search on - sampling at 5/10 confs per single bond + 500 max • Catalyst- Best/Fast Default settings - sampling at 5/10 confs per single bond + 100 max • Omega: Defaults +RMS_CUTOFF=1.0, GP_ENERGY_WINDOW=5.0, sampling at 100 max • In addition Concord generated and Sybyl minimized ligand xray structures also analyzed as “controls”
Ligand Flexibility Results The Pain Gain Ratio • Does extra noise introduced to scoring functions outweigh this improvement? Is it worth the extra CPU?
Ligand Flexibility ResultsVisual Analysis RMS=0.90 RMS=0.65 • Even at lower RMS, deviation in hydrogen positions an issue • As RMS rises (0.9) we begin to see more significant deviations in heavy atom positions - large enough to possibly prove troublesome to standard force fields
Ligand Flexibility ResultsVisual Analysis RMS=2.19 RMS=1.55 • As RMS rises further, hydrogen bond mapping begins to partially break down • Significant deviation begins to be seen although general shape complementarity is still reasonable • DOCKing tricky, pharmacophore searches possible with loose tolerances, although site point vector definitions (DISCO / Catalyst) a no no
Ligand FlexibilityConclusions • At current sampling levels used in virtual screening • Rough search techniques perform comparably to more exhaustive methods • Dock performs quite well, and Fast does slightly better than comparable Best run • Results highlight the need for “forgiving” scoring functions and pharmacophore constraint tolerances (especially for flexible molecules) • Generating function directly from crystal structure data may not be optimum • Use the conformation closest to the biologically relevant structure with chosen sampling technique • May be better to ignore more flexible molecules when possible (~>8 bonds) • Analysis of more extensive data set might provide basis for determining if optimum sampling settings exist (Best/Omega/Confort) • Coarseness of poling values for example
Structure-Based Search ProtocolsAn Analysis of DOCK • Working within current DOCK paradigm, what search protocols provide optimum search criterion? • Site point definitions • Alteration of sampling variables • Different scoring grids • Comparisons illustrated for 5 test systems with diverse active data sets • Analysis based on ranking within list that includes ~10000 “noise” compounds • “Random” selection within bounds of size and flexibility distribution seen in in-house database
Structure-Based Search ProtocolsDOCK variables • Contains many variables that effect performance • Ligand sampling within the site being the primary variant nodes 3/4 distance_tolerance 0.5/1.0 distance_minimum 3.0 bump_filter 4 conformation_cutoff_factor 5 clash_overlap 0.7 maximum_orientations 500/5000
Structure-Based Search ProtocolsDOCK and pharmacophoric constraints • It is possible to assign fairly sophisticated pharmacophoric (henceforth also known as chemical) definitions Current types: heavy atom donor acceptor hydrophobe aromatic aromatic_hydrophobic acid base donor_and_acceptor special (e.g. metal chelator) name acid # deprotonated carboxyl definition O.co2 ( C ) # tetrazole definition N.pl3 ( H ) ( N.2 ( N.2 ( N.2 ( C.2 ) ) ) ) definition N.pl3 ( H ) ( N.2 ( N.2 ( C.2 ( N.2 ) ) ) ) definition N.2 ( N.2 ( N.2 ( C.2 ( N.pl3 ( H ) ) ) ) ) definition N.2 ( N.2 ( C.2 ( N.pl3 ( H ) ( N.2 ) ) ) ) definition N.2 ( C.2 ( N.2 ( N.pl3 ( H ) ( N.2 ) ) ) ) definition N.2 ( N.2 ( C.2 ( N.2 ( N.pl3 ( H ) ) ) ) ) definition N.2 ( N.pl3 ( H ) ( N.2 ( N.2 ( C.2 ) ) ) ) # acyl sulphonamide definition N.am ( S ( 2 O.2 ) ) ( C.2 ( O.2 ) ) definition O.2 ( C.2 ( N.am ( H ) ( S ( 2 O.2 ) ) ) ) definition O.2 ( S ( O.2 ) ( N.am ( H ) ( C.2 ( O.2 ) ) )
Structure-Based Search ProtocolsSite Points Used in Kinase Search Region 1 ( + 4) acceptor / donor Region 2 Hydrophobic + 2 donors Region 3 Hydrophobic / Any heavy atom
Structure-Based Search ProtocolsTest Sets and Site Points Used • Sphgen used to generate site points for “generic” DOCK searches • Pharmacophore points derived from a mixture of non-data set bound ligands and in-house programs that process GRID maps and Connolly surfaces (plus plenty of human intervention) • Active data sets broken down into chemotypes to prevent the problem of common analogue bias - an under appreciated issue in all validations
Results - fatty acid binding protein 1 No. of hits after 7 chemotypes located by at least one search ( 500 compounds processed from 28 actives / 8 chemotypes) • Missing chemotype a citrazinate - not covered in chemical definitions - easy to fix - another advantage over electrostatics
Addition of critical region constraint alone worsens results • 500 orientations per conformer too few for search - leads to premature termination of docking analysis for many ligands • Generic searches with addition of conformational flexibility little improvement relative to rigid search • signal to noise issues • Adding chemical in addition to critical constraints provides best balance for sampling parameters • still required reasonable tolerances and forgiving scoring function for optimum results • Rigid conformer screens perform quite well in generic search mode • One system contains predominantly rigid chemotypes, two others require a predominantly extended conformation for binding • On addition of critical and chemical constraints, inability of rigid search to adapt to more exacting requirements severely compromises results Results-OverallCompounds processed for 50% Chemotype Coverage for All Systems
ResultsSample Hit Rate Comparisons • Kinase sites tend to be highly mobile • Forgiving DOCK scoring function more appropriate • Fatty acid active site deep and fairly rigid • Prometheus at least comparable performance to DOCK even with more simplistic constraints
ResultsSample Hit Rate Comparisons • Illustrates how addition of constraints can allow performance of simplistic scoring functions to surpass those deemed more sophisticated
ResultsSample Hit Rate Comparisons • Removing highly flexible molecules from the search reduces the noise at the top of the hit list • In a database of 250000, the top 100 becomes top 2500 • Could be crucial when only small data sets can be assayed • Smaller molecules generally make better leads
ConclusionsThe hypothesis hypothesis • Sampling choices have a profound effect on SBVS results • For maximum impact impact current methodology, scoring functions should either • Be designed/utilized with these limitations in mind • Forgiving / targeted at less flexible molecules • Improve results by such a high degree that additional sampling (and CPU) is warranted • In the mean time, utility of pharmacophoric hypotheses {critical region(s) with pharmacophoric constraints} is clear • Better results faster • Less sensitivity to model coarseness • Allows constraints exploiting known structural biology • Key to optimum use is balancing constraints and tolerances to ensure sufficient sampling • benchmarking with known ligands one way to do this
Acknowledgements • Thank you to my BMS CADD colleagues