180 likes | 287 Views
Selecting Diverse Sets of Compounds. C371 Fall 2004. Review. Similar Property Principle: If structurally similar compounds are likely to exhibit similar activity, then maximum coverage of the activity space should be achieved by selecting a structurally diverse set of compounds. Techniques.
E N D
Selecting Diverse Sets of Compounds C371 Fall 2004
Review • Similar Property Principle: If structurally similar compounds are likely to exhibit similar activity, then maximum coverage of the activity space should be achieved by selecting a structurally diverse set of compounds.
Techniques • High-Throughput Screening (HTS) • Combinatorial Chemistry • Early attempts led to large libraries, but little variability in the molecules created • Need a way to identify subsets of compounds for synthesis, purchase, or testing
Chemical Diversity • No unambiguous definition • Need to quantify the degree of diversity of a subset of compounds • Four main approaches: • Cluster analysis • Dissimilarity-based methods • Cell-based methods • Use of optimization techniques
CLUSTER ANALYSIS • Aim is to divide a group into clusters where objects in the cluster are similar, but objects in other clusters are dissimilar • Many algorithms for doing this • Hierarchical methods seem to be better than non-hierarchical • Sometimes called a “distance-based” approach to compound selection, because distance is measured between pairs of compounds
Key Steps in Cluster Analysis • Generate descriptors for each compound • Calculate the similarity or distance between all compounds • Use a clustering algorithm to group the compounds • Select a representative subset by taking one or more compounds from each cluster
“Distance” • 1-S, where S is the similarity coefficient • When molecules are represented by binary descriptors • Euclidean distance • When molecules are represented by physicochemical properties
Characteristics of Clustering Methods • Non-overlapping: each object in one cluster only (Most use this approach) • Hierarchical methods • Non-hierarchical methods • Overlapping: object can be in more than one cluster • Efficiency and effectiveness issues: some approaches have very intensive computational requirements
Hierarchical Clustering • Clusters increase in size, with each compound in a single cluster (a singleton) at one extreme • Agglomerative methods start at the bottom and merge similar clusters • Ward’s method: clusters are formed to minimize the variance (i.e., the sum of the squared deviations from the mean) • Others: centroid method and the median method • Divisive hierarchical clustering starts with all compounds in a single cluster and partitions the data
Selecting the Appropriate Number of Clusters • Need a cutoff value at which you are going to examine the molecules • Jaccard statistic of two clusters, C1 and C2 a -------------------------- a + b + c Where a is the number of compounds found in both clusters, b is the number that cluster in 1 but not 2, and c is the number in 2 but not 1 • Same as the Tanimoto coefficient
Non-Hierarchical Clustering • Compounds are clustered without forming a hierarchical relationship • Methods: • single-pass assigns a compound to a cluster according to a cut-off value • Problem: doesn’t give same results all of the time, i.e., dependent on the order of the molecules • nearest neighbor: Jarvis Patrick clustering • relocation: K-means
DISSIMILARITY-BASED SELECTION METHODS • Attempt to identify a diverse set of compounds directly • Based on calculating distances or dissimilarities between compounds
Basic Algorithm for Dissimilarity-Based Selection Methods • Decide on a desired size, n, of a final subset • Select a compound and place it in the subset • Calculate the dissimilarity between each of the other compounds and those in the subset • Choose the next compound as the one most dissimilar to those in the subset • If fewer than n in the subset, repeat the calculation of the dissimilarity until n is achieved • Complexity varies as the square of n
CELL-BASED METHODS • Operate within a pre-defined low-dimensional chemistry space, not dependent on the particular set of molecules being examined • Compounds are allocated to cells according to their molecular properties • Methods are very fast with a time complexity of O(N), but restricted to low-dimensional space • good for very large data sets • Examples: MW, logP, polarity, shape, hydrogen bonding, aromatic interactions
BCUT Descriptors • Matrix representation of molecules • Atomic properties used for diagonal • Atomic charges, polarizabilities, hydrogen bonding • Connectivity used for the off-diagonals • 2D graph or interatomic distances from 3D
Partitioning Using Pharmacophore Keys • Each potential 3- or 4-point pharmacophore is considered to constitute a cell • A given molecule could be in more than one cell • Promiscous molecules: those that contain a large number of pharmacophores, e.g., very flexible molecules
OPTIMIZATION METHODS • Techniques for sampling large sets of molecules • May want to spread the compounds evenly in space • Techniques: Monte Carlo, simulated annealing • Selective replacement
CONCLUSIONS • Some research suggests that compounds within 0.85 Tanimoto similarity have between 30% and 80% chance of sharing the same biological activity • No clear consensus on which screening approach is best • Faster computer techniques (e.g., parallel computing) may help • Descriptors used must be related to biological activity