800 likes | 933 Views
Similarity and Diversity Alexandre Varnek, University of Strasbourg, France. What is similar?. Colour. Shape. Pattern. Size. Different „spaces“, classified by:. 16 diverse aldehydes. ...sorted by common scaffold. ...sorted by functional groups. The „Similarity Principle“ :
E N D
Similarity and DiversityAlexandre Varnek, University of Strasbourg, France
Colour Shape Pattern Size Different „spaces“, classified by:
The „Similarity Principle“ : Structurally similar molecules are assumed to have similar biological properties Compounds active as opioid receptors
Structural Spectrum of Thrombin Inhibitors structural similarity “fading away” … reference compounds 0.56 0.72 0.53 0.84 0.67 0.52 0.82 0.64 0.39
Properties to describe elements (descriptors, fingerprints) • Distance measure („metrics“) Key features in similarity/diversity calculations:
molecule Mi = (descriptor1(i), descriptor2(i), …, descriptorn(i)) N-Dimensional Descriptor Space • Each chosen descriptor adds a dimension to the reference space • Calculation of n descriptor values produces an n-dimensional coordinate vector in descriptor space that determines the position of a molecule descriptorn descriptor2 descriptor1 descriptor3
descriptorn descriptor2 descriptor1 descriptor3 Chemical Reference Space • Distance in chemical space is used as a measure of molecular “similarity“ and “dissimilarity“ • “Molecular similarity“ covers only chemical similarity but also property similarity including biological activity DAB B A
Distance Metrics in n-D Space • If two molecules have comparable values in all the n descriptors in the space, they are located close to each other in the n-D space. • how to define “closeness“ in space as a measure of molecular similarity? • distance metrics
descriptorn descriptor2 descriptor1 descriptor3 Descriptor-based Similarity • When two molecules A and B are projected into an n-D space, two vectors, A and B, represent their descriptor values, respectively. • A = (a1,a2,...an) • B = (b1,b2,...bn) • The similarity between A and B, SAB, is negatively correlated with thedistance DAB • shorter distance ~ more similar molecules • in the case of normalized distance(within value range [0,1]), similarity = 1 – distance B DAB DBC C A DAB>DBCSAB<SBC
descriptorn descriptor2 descriptor1 descriptor3 Metrics Properties • The distance values dAB 0; dAA= dBB= 0 • Symmetry properties: dAB= dBA • Triangle inequality: dAB dAC+ dBC B DAB DBC C A
descriptorn DAB B A descriptor2 descriptor1 descriptor3 Euclidean Distance in n-D Space • Given two n-dimensional vectors, A and B • A = (a1,a2,...an) • B = (b1,b2,...bn) • Euclidean distance DAB is defined as: • Example: • A = (3,0,1); B = (5,2,0) • DAB= = 3
descriptorn DAB B A descriptor2 descriptor1 descriptor3 Manhattan Distance in n-D Space • Given two n-dimensional vectors, A and B • A = (a1,a2,...an) • B = (b1,b2,...bn) • Manhattan distance DAB is defined as: • Example: • A = (3,0,1); B = (5,2,0) • DAB= = 5
Distance Measures („Metrics“): Euclidian distance: [(x11 - x21) 2 + (x12 - x22)2] 1/2 = = (42 + 22)1/2 = 4.472 Manhattan (Hamming) distance: |x11 - x21| + |x12 - x22| = 4 + 2 = 6 Sup distance: Max (|x11 - x21|, |x12 - x22|) = = Max (4, 2) = 4
Popular Similarity/Distance Coefficients • Similarity metrics: • Tanimoto coefficient • Dice coefficient • Cosine coefficient • Distance metrics: • Euclidean distance • Hamming distance • Soergel distance
B A C Tanimoto Coefficient (Tc) • Definition: • value range: [0,1] • Tc is also known as Jaccard coefficient • Tc is the most popular similarity coefficient
binary A B a = 4, b = 4, c = 2 Example Tc Calculation
Dice Coefficient • Definition: • value range: [0,1] • monotonic with the Tanimoto coefficient
Cosine Coefficient • Definition: • Properties: • value range: [0,1] • correlated with the Tanimoto coefficient but not strictly monotonic with it
Hamming Distance • Definition: • value range: [0,N] (N, length of the fingerprint) • also called Manhattan/City Block distance
Soergel Distance • Definition: • Properties: • value range: [0,1] • equivalent to (1 – Tc) for binary fingerprints
Properties of Similarlity and Distance Coefficients Metric Properties • The distance values dAB 0; dAA= dBB= 0 • Symmetry properties: dAB= dBA • Triangle inequality: dAB dAC+ dBC The Euclidean and Hamming distances and the Tanimoto coefficients (dichotomous variables) obey all properties. The Tanimoto, Dice and Cosine coefficients do not obey inequality (3). Coefficients are monotonic if they produce the same similarlity ranking
Similarity search Using bit strings to encode molecular size. A biphenyl query is compared to a series of analogues of increasing size. The Tanimoto coefficient, which is shown next to the corresponding structure, decreases with increasing size, until a limiting value is reached. D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386
Similarity search Molecular similarity at a range of Tanimoto coefficient values D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386
Similarity search The distribution of Tanimoto coefficient values found in database searches with a range of query molecules of increasing size and complexity D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386
Molecular Similarity A comparison of the Soergel and Hamming distance values for two pairs of structures to illustrate the effect of molecular size A R. Leach and V. J. Gillet "An Introduction to Chemoinformatics" , Kluwer Academic Publisher, 2003
Molecular Similarity The maximum common subgraph (MCS) between the two molecules is in bold Similarity = Nbonds(MCS) / Nbonds(query) A R. Leach and V. J. Gillet "An Introduction to Chemoinformatics" , Kluwer Academic Publisher, 2003
How important is a choice of descriptors ? Inhibitors of acyl-CoA:cholesterol acyltransferase represented with MACCS (a), TGT (b), and Molprint2D (c) fingerprints.
discontinuous SARs continuous SARs gradual changes in structure result in moderate changes in activity • “rolling hills” (G. Maggiora) small changes in structure have dramatic effects on activity • “cliffs” in activity landscapes Structure-Activity Landscape Index: SALIij = DAij / DSij DAij(DSij) is the difference between activities (similarities) of molecules iand j R. Guha et al. J.Chem.Inf.Mod., 2008, 48, 646
6 nM MACCSTc: 1.00 Analog 2390 nM discontinuous SARs VEGFR-2 tyrosine kinase inhibitors small changes in structure have dramatic effects on activity • “cliffs” in activity landscapes • lead optimization, QSAR bad news for molecular similarity analysis...
Example of a “Classical” Discontinuous SAR Any similarity method must recognize these compounds as being “similar“ ... (MACCS Tanimoto similarity) Adenosine deaminase inhibitors
Libraries design Goal: to select a representative subset from a large database
Chemical Space Overlapping similarity radii Redundancy „Void“ regions Lack of information
Chemical Space „Void“ regions Lack of information
Chemical Space No redundancy, no „voids“ Optimally diverse compound library
Subset selection from the libraries • Clustering • Dissimilarity-based methods • Cell-based methods • Optimisation techniques
What is clustering? • Clustering is the separation of a set of objects intogroups such that items in one group are more likeeach other than items in a different group • A technique to understand, simplify and interpretlarge amounts of multidimensional data • Classification without labels (“unsupervisedlearning”)
Where clustering is used? General: data mining, statistical data analysis, datacompression, image segmentation, document classification(information retrieval) Chemical: • representative sample, • subsets selection, • classification of new compounds
Overall strategy • Select descriptors • Generate descriptors for all items • Scale descriptors • Define similarity measure (« metrics ») • Apply appropriate clustering method to group the items on basis of chosen descriptors and similarity measure • Analyse results
Data Presentation descriptors molecules molecules molecules Pattern matrix Proximity matrix Library contains nmolecules, each molecule is described by pdescriptors dii = 0; dij = dji
Agglomerative Divisive Clustering methods Single Link Complete Link Group Average Hierarchical Weighted Gr Av Monothetic Centroid Polythetic Median Single Pass Ward Jarvis-Patrick Nearest Neighbour Mixture Model Non-hierarchical Relocation Topographic Others