Measuring Molecular Diversity: A Shell Game?

PARKE-DAVIS Molecular Diversity - A Shell Game? Experiments in Measuring Molecular Diversity C. John BlankleyDaylight User Group Meeting MUG ‘97 Laguna Beach, CAFebruary 25-28, 1997 Parke-Davis Pharmaceutical Research Division of Warner Lambert Company2800 Plymouth RoadAnn Arbor, MI 48105

Basic Concepts What do we mean by molecular diversity? Structural DiversityProperty Diversity • templates / scaffolds / backbones • lipophilicity • functional groups / fragments • acid/base • bridges / bioisosteres • H-bonding • aromatic / aliphatic • dipolarity • geometry (shape, chirality, • charge connectivity, spatial disposition) • size

Basic Concepts (cont.) Parameters/metrics/descriptors continuous, discrete, categorical Structural DescriptorsProperty Descriptors • topological indices • log P • molecular fingerprints • pKa • atom / group / fragment counts • molecular orbital indices • molecular dimensions • charge (volume, area, moments) • spectroscopic data • distances between key groups • molecular fields (atom pairs, pharmacophores) composite descriptors (principal properties) similarity/dissimilarity metrics

Issues in diversity • basis for comparison • perspective • macro or micro; • expansionary or inclusionary; expanding space, filling holes, increasing density • biologically relevant vs. chemical - correlative • how much diversity is necessary, possible, desirable (random, bias) • concordance between quantitative and qualitative notions • tailor to information available and purpose required

Types of structural diversity • “Global” or macro diversityimplies neither consistent or significant similar features • “Local” or micro diversity varietyimplies a consistent common feature(s) • template or scaffold • common functional group

Small (< ca. 500) datasets of practical interest • Building block datasets (functional group based) • Combinatorial arrays (template based) FAQs: • select a diverse subset(s) for screening • is one dataset “more diverse” than another • what subset will represent the diversity of a library • what compounds will increase/extend the diversity of an existing set

Other small datasets • SAR datasets • defined by potency/selectivity • defined by activity for a enzyme/receptor family or subtype • Benchmark datasets • miscellaneous drugs • miscellaneous chemical compounds • 20 natural amino acids • 400 natural dipeptides

Questions of a structural diversity measure • How does it perform for different “types” of diversity? suitability; sensitivity • How does it accord with measures of property diversity? • Is there a “saturation”effect? • How does it behave on partition or combination of datasets? • How can it be validated? • How does it accord with chemists’ visual perceptions? • Can it capture unperceived aspects of diversity?

Questions for a given dataset • Which class? • Extent of common feature in dataset • Variation around common feature • template variation • appendage variation • Outliers and their influence

Possibilities for quantitation • Statistics on bit counts • univariate measures • comparisons to “mean”; e.g., modal • Statistics on fragment counts • Statistics on dissimilarities • all or partial pairwise • comparison to “mean”; e.g., modal, centroid • Parametric on topological indices

Some proposed database diversity measures • mean pairwise dissimilarity (Willett et al.) • “self-similarity” (Tripos)(mean similarity to 1st nearest neighbor) • maximum # bits set (E. Martin et al.)(union bit set or modal fingerprint @ t = 0) • “diversity density” (E. Martin et al.)(# bits per molecular mass)

Stigmata • extracts similarity of bit strings with flexible threshold • low bits at high stringency - not much in common • high bits at low stringency - much variety • plateau bits at intermediate stringencies - significant common element • similarity to common element - large range signals diversity of dataset

Diversity measures and modal fingerprints • modal fingerprint  degree of similarity • existing metrics in Stigmata (Daylight fingerprints) • modp, msim;alab;Rminfp, Rmaxfp • new metrics • alab_av (regional or partial similarity(?));Rfp = Rmaxfp / Rminfp

Extend threshold analysis • t  0 (2 compounds) • maximal modal ( # non-unique bits) (≤ total # bits set) • concept of maximal, median and minimal modal

Relative vs. absolute • Rminfp (Rifp(Rm)) = f(modp, msim) 1/Rm = modp + modp/msim - 1 • #bits set for modal fp bcom = #bits (i) x Rm(i) • Thus: • maxbcom = modal # bits @ t  0 • medbcom = modal # bits @ t = 0.5 • minbcom = modal # bits @ t = 1.0 • Rtmax = maxbcom / minbcom

Other measures derived from bits or similarities • Average # bits • Mean similarities (msim) at t = ~0, 0.5, 1.0 • fraction of pairwise similarities > 0.85 or < 0.50 • mean distance from dataset “centroid” • standard deviations or coefficients of variation • normalize by • dataset size • average bits • molecular mass

dataset type1 type2 N av_mwt source BIOLOGICALkappa act p 59 409.46 CIPSLINEnonkappa act n 97 441.95 CIPSLINE pipopiate act p 48 382.84 opiate act p 32 383.37 piperidine act SS 18 374.76 peptide_op act p 24 612.41 D2_ag act p 33 249.94 Seeman et al.D2_antag act n 25 373.97 Seeman et al.renin_hisleu act SS 112 752.66 CIPSLINE ci976 sar p 74 410.47 Roth et al. acathet sar p 41 453.82 White et al. BUILDING BLOCK bbd_phnco bbd T 129 185.51 ACD bbd_arncs bbd SS 202 216.40 ACD aa20 bbd T 20 136.92 bbd_allaa bbd T 651 218.63 ACD COMBINATORIALdipeptides comb T 400 255.83 dhydantoin comb T 40 242.47 deWitt et al. benzodiaz1 comb SS 40 312.37 deWitt et al. benzodiaz2 comb SS 160 382.43 Ellman et al. REFERENCEnewtopdrugs_95 misc n 56 395.35 Med. Ad. Newsintrodrugs_9295 misc n 141 403.51 Ann. Repts. Med.Chem. ACDrandom2 misc n 51 181.29 ACD ACDrandom3 misc n 51 221.37 ACD

modal bit # compound bit # dataset max median min Rtmax min max mean Rfp BIOLOGICALkappa 1052 169 43 24.47 163 734 249.30 4.50nonkappa 1467 164 7 209.33 88 689 305.42 7.83 pipopiate 1219 276 47 25.94 138 689 378.63 4.99 opiate 451 404 88 11.69 164 689 439.71 4.20 piperidine 570 169 75 7.60 138 368 249.61 2.67 peptide_op 425 170 40 10.63 88 329 206.96 3.74D2_ag 606 145 43 14.09 80 390 203.82 4.87D2_antag 841 115 33 25.48 165 327 235.01 1.98renin_hisleu 998 278 179 5.58 249 466 337.04 1.87ci976 371 138 68 5.46 117 222 152.50 1.90acathet 695 195 93 7.47 167 303 245.80 1.81 BUILDING BLOCKbbd_phnco 433 63 61 7.10 61 136 92.48 2.23bbd_arncs 883 64 45 19.64 61 286 107.37 4.69aa20 155 43 34 4.56 34 171 67.00 5.03bbd_allaa 1785 55 34 52.50 34 429 127.10 12.62 COMBINATORIALdipeptides 448 78 53 8.45 53 269 126.53 5.08dhydantoin 437 129 71 6.15 71 323 176.70 4.55benzodiaz1 586 220 183 3.20 208 410 284.06 1.97benzodiaz2 533 294 222 2.40 229 431 308.92 1.88 REFERENCEnewtopdrugs_95 1629 80 4 407.75 52 640 254.85 12.31introdrugs_9295 1983 84 4 495.75 44 709 265.80 16.11ACDrandom2 592 19 0 >1000 23 166 73.75 7.22ACDrandom3 1067 38 0 >1000 23 497 130.08 21.61

average msim centroid mean dataset t ≈ 0 t = 0.5 t = 1 distance BIOLOGICAL kappa 0.23 0.56 0.18 0.34 nonkappa 0.21 0.42 0.03 0.48 pipopiate 0.31 0.50 0.15 0.40 opiate 0.42 0.62 0.24 0.26 piperidine 0.39 0.54 0.32 0.36 peptide_op 0.45 0.68 0.21 0.23 D2_ag 0.32 0.55 0.25 0.35 D2_antag 0.26 0.34 0.15 0.53 renin_hisleu 0.33 0.75 0.54 0.12 ci976 0.40 0.75 0.45 0.14 acathet 0.34 0.64 0.38 0.25 BUILDING BLOCK bbd_phNCO 0.21 0.68 0.67 0.20 bbd_arNCS 0.12 0.61 0.45 0.28 aa20 0.35 0.70 0.60 0.25 bbd_allaa 0.07 0.41 0.35 0.49 COMBINATORIAL dipeptides 0.28 0.62 0.48 0.27 dhydantoin 0.40 0.58 0.47 0.28 benzodiaz1 0.48 0.76 0.66 0.12 benzodiaz2 0.58 0.81 0.73 0.07 REFERENCE newtopdrugs_95 0.15 0.23 0.02 0.68 introdrugs_9295 0.22 0.24 0.04 0.68 ACDrandom2 0.11 0.19 0.00 0.74 ACDrandom3 0.11 0.20 0.00 0.75

dataset mps ss_nn1 mdd mfrgs_BCI BIOLOGICAL kappa 0.45 0.83 0.62 117.69 nonkappa 0.33 0.77 0.74 110.47 pipopiate 0.40 0.82 0.98 122.52 opiate 0.53 0.86 1.13 133.13 piperidine 0.45 0.72 0.66 229.94 peptide_op 0.57 0.79 0.35 277.92 D2_ag 0.46 0.88 0.80 78.48 D2_antag 0.32 0.71 0.64 195.64 renin_hisleu 0.67 0.90 0.45 112.26 ci976 0.67 0.94 0.37 48.01 acathet 0.55 0.89 0.54 120.10 BUILDING BLOCK bbd_phnco 0.58 0.88 0.51 23.87 bbd_arncs 0.49 0.89 0.50 32.87 aa20 0.57 0.81 0.48 78.05 bbd_allaa 0.31 0.87 0.57 21.79 COMBINATORIAL dipeptides 0.51 0.96 0.49 7.77 dhydantoin 0.52 0.92 0.72 56.83 benzodiaz1 0.68 0.94 0.92 59.68 benzodiaz2 0.75 0.98 0.81 16.64 REFERENCE newtopdrugs_95 0.19 0.48 0.72 238.73 introdrugs_9295 0.19 0.50 0.70 142.89 ACDrandom2 0.16 0.49 0.44 115.45 ACDrandom3 0.15 0.38 0.58 156.53

Principal Components (n = 23, k = 18) EigenValue: 9.19 4.35 1.85 1.15 0.44 Percent: 51.04 24.14 10.27 6.40 2.44 CumPercent: 51.04 75.18 85.45 91.86 94.30

Rotated Factor Pattern (dissimilarities and ln) N -0.04 -0.12 -0.95 -0.06 av_mwt 0.16 0.12 0.17 0.94 Rfp-0.84 0.06 -0.21 -0.14 ln_Rtmax-0.98 -0.01 -0.01 -0.01 mdsim0-0.77 -0.27 -0.38 -0.10 mdsim5-0.97 0.04 0.02 -0.08 mdsim1-0.88 0.15 0.23 0.19 ln_mxbcom-0.65 0.34 -0.42 0.45 ln_mdbcom0.560.63 0.14 0.50 ln_mnbcom0.93 0.18 -0.12 0.17 minbits 0.60 0.40 0.15 0.54 maxbits -0.39 0.77 -0.06 0.39 avbits 0.14 0.81 0.18 0.51 mpds-0.98 0.02 -0.02 -0.10 msds_nn1-0.91 -0.08 0.30 -0.02 mcentr_dist-0.98 0.00 0.04 -0.11 mdd 0.03 0.97 0.10 -0.18 ln_mBCI -0.45 0.12 0.72 0.41

increasing diversity opiate pipopiate Factor2 benzodiaz1 benzodiaz2 nonkappa introdrugs9295 D2ag newtopdrugs kappa dhydantoin piperidine Factor 1 bbd_allaa D2antag ACDrandom3 acathet renin_hisleu dipeptides bbd_arncs bbd_phnco aa20 ci976 peptide_op ACDrandom2 Datasets Plotted vs. First Two Rotated Factors

increasing diversity opiate pipopiate avbits renin_hisleu benzodiaz2 nonkappa benzodiaz1 introdrugs9295 kappa newtopdrugs piperidine acathet D2antag mdsim_5 peptide_op D2ag dhydantoin ci976 ACDrandom3 bbd_allaa dipeptides bbd_arncs bbd_phnco ACDrandom2 aa20 Datasets plotted vs. mean modal dissimilarity (t=0.5) and average # bits

increasing diversity Relative diversity by dataset type (groups along x axis) msds_nn1 ACDrandom3 newtopdrugs ACDrandom2 introdrugs9295 y D2antag piperidine nonkappa z x peptide_op aa20 pipopiate kappa opiate bbd_allaa D2ag bbd_phnco acathet bbd_arncs renin_hisleu dhydantoin ci976 benzodiaz1 dipeptides benzodiaz2

increasing diversity Relative diversity by dataset type (groups along x axis) mean centr_dist ACDrandom3 ACDrandom2 y newtopdrugs introdrugs9295 D2antag bbd_allaa nonkappa pipopiate z x piperidine D2ag kappa bbd_arncs dhydantoin dipeptides opiate acathet aa20 peptide_op bbd_phnco ci976 benzodiaz1 renin_hisleu benzodiaz2

increasing diversity Relative diversity by dataset type (groups along x axis) ln_mxbcom introdrugs9295 y bbd_allaa newtopdrugs nonkappa pipopiate ACDrandom3 kappa renin_hisleu bbd_arncs D2antag z x acathet D2ag ACDrandom2 benzodiaz1 piperidine benzodiaz2 opiate dipeptides dhydantoin bbd_phnco peptide_op ci976 aa20

Directions • behavior of metrics; e.g. • on combining or subsetting datasets • other similarity functions - same or different? • calibration of metrics with chemists’ perception • other fingerprints • BCI, MACCS,Tripos • 3D information • use of modal similarities as dataset parameters for correlation or classification • how to discover the congruence between molecular similarity and biological function

Conclusions to date • Different metrics can capture different aspects of structural diversity • One metric will not suffice to provide adequate discrimination for all different types of diversity • Modal fingerprints and similarities may prove to be useful additions to the measurement of diversity

Acknowledgments • Parke-Davis Biomolecular Structure and Drug Design Christine Humblet • Daylight CIS. Inc Norah Shemetulskis David Weininger Jeremy Yang

N av Rfp ln_mdsim0 mdsim5 mdsim1____________ln____________ Variable mwt Rtmax mxbcom mdbcom mnbcom N 1.00 av_mwt -0.22 1.00 Rfp 0.24 -0.25 1.00 ln_Rtmax 0.04 -0.15 0.82 1.00 mdsim0 0.38 -0.36 0.630.76 1.00 mdsim5 0.04 -0.23 0.770.940.73 1.00 mdsim1 -0.18 0.10 0.600.85 0.54 0.86 1.00 ln_mxbcom 0.30 0.26 0.55 0.63 0.56 0.61 0.58 1.00 ln_mdbcom -0.26 0.66 -0.52 -0.56 -0.72 -0.56 -0.23 0.01 1.00 ln_mnbcom 0.06 0.28 -0.80-0.96-0.71-0.89-0.79 -0.40 0.70 1.00 minbits -0.27 0.63-0.62 -0.57 -0.65 -0.57 -0.38 -0.03 0.830.70 maxbits -0.06 0.39 0.38 0.37 0.08 0.32 0.52 0.68 0.45 -0.17 avbits -0.28 0.64 -0.19 -0.13 -0.45 -0.14 0.14 0.32 0.86 0.33 mpds 0.08 -0.26 0.780.940.780.990.860.61 -0.58 -0.90 msds_nn1 -0.23 -0.11 0.760.89 0.56 0.880.78 0.42 -0.54 -0.91 centr_dist 0.03 -0.26 0.800.950.740.990.85 0.57 -0.59 -0.92 mdd -0.19 -0.04 0.01 -0.03 -0.29 0.04 0.10 0.21 0.54 0.17 ln_BCI -0.67 0.43 0.18 0.38 0.04 0.42 0.65 0.23 0.13 -0.36 Correlations among diversity measures (n = 23)

Variable minbits maxbits avbits mpds msds_nn1 cent_dist mdd ln_BCI minbits 1.00 maxbits 0.22 1.00 avbits 0.710.74 1.00 mpds 0.62 0.33 -0.17 1.00 msds_nn1 -0.49 0.26 -0.13 0.87 1.00 centr_dist 0.60 0.30 -0.17 0.990.91 1.00 mdd 0.34 0.630.73 0.02 -0.07 0.01 1.00 ln_BCI 0.08 0.38 0.36 0.40 0.59 0.43 0.12 1.00 Correlations (cont.)

increasing diversity opiate pipopiate avbits renin_hisleu benzodiaz2 nonkappa benzodiaz1 introdrugs9295 piperidine newtopdrugs acathet kappa D2antag mpds peptide_op D2ag dhydantoin ci976 ACDrandom3 bbd_allaa dipeptides bbd_arncs bbd_arnco ACDrandom2 aa20 Datasets plotted vs. mean pairwise dissimilarity and average # bits

opiate increasing diversity pipopiate avbits renin_hisleu benzodiaz2 nonkappa benzodiaz1 introdrugs9295 kappa newtopdrugs piperidine acathet D2antag mcentr_dist peptide_op D2ag dhydantoin ci976 ACDrandom3 bbd_allaa dipeptides bbd_arncs bbd_arnco ACDrandom2 aa20 Datasets plotted vs. mean centroid distance and average # bits

Measuring Molecular Diversity: A Shell Game?

Measuring Molecular Diversity: A Shell Game?

Presentation Transcript