1 / 80

Similarity and Diversity Alexandre Varnek, University of Strasbourg, France

Similarity and Diversity Alexandre Varnek, University of Strasbourg, France. What is similar?. Colour. Shape. Pattern. Size. Different „spaces“, classified by:. 16 diverse aldehydes. ...sorted by common scaffold. ...sorted by functional groups. The „Similarity Principle“ :

armistead
Download Presentation

Similarity and Diversity Alexandre Varnek, University of Strasbourg, France

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Similarity and DiversityAlexandre Varnek, University of Strasbourg, France

  2. What is similar?

  3. Colour Shape Pattern Size Different „spaces“, classified by:

  4. 16 diverse aldehydes...

  5. ...sorted by common scaffold

  6. ...sorted by functional groups

  7. The „Similarity Principle“ : Structurally similar molecules are assumed to have similar biological properties Compounds active as opioid receptors

  8. Structural Spectrum of Thrombin Inhibitors structural similarity “fading away” … reference compounds 0.56 0.72 0.53 0.84 0.67 0.52 0.82 0.64 0.39

  9. Properties to describe elements (descriptors, fingerprints) • Distance measure („metrics“) Key features in similarity/diversity calculations:

  10. molecule Mi = (descriptor1(i), descriptor2(i), …, descriptorn(i)) N-Dimensional Descriptor Space • Each chosen descriptor adds a dimension to the reference space • Calculation of n descriptor values produces an n-dimensional coordinate vector in descriptor space that determines the position of a molecule descriptorn descriptor2 descriptor1 descriptor3

  11. descriptorn descriptor2 descriptor1 descriptor3 Chemical Reference Space • Distance in chemical space is used as a measure of molecular “similarity“ and “dissimilarity“ • “Molecular similarity“ covers only chemical similarity but also property similarity including biological activity DAB B A

  12. Distance Metrics in n-D Space • If two molecules have comparable values in all the n descriptors in the space, they are located close to each other in the n-D space. • how to define “closeness“ in space as a measure of molecular similarity? • distance metrics

  13. descriptorn descriptor2 descriptor1 descriptor3 Descriptor-based Similarity • When two molecules A and B are projected into an n-D space, two vectors, A and B, represent their descriptor values, respectively. • A = (a1,a2,...an) • B = (b1,b2,...bn) • The similarity between A and B, SAB, is negatively correlated with thedistance DAB • shorter distance ~ more similar molecules • in the case of normalized distance(within value range [0,1]), similarity = 1 – distance B DAB DBC C A DAB>DBCSAB<SBC

  14. descriptorn descriptor2 descriptor1 descriptor3 Metrics Properties • The distance values dAB 0; dAA= dBB= 0 • Symmetry properties: dAB= dBA • Triangle inequality: dAB dAC+ dBC B DAB DBC C A

  15. descriptorn DAB B A descriptor2 descriptor1 descriptor3 Euclidean Distance in n-D Space • Given two n-dimensional vectors, A and B • A = (a1,a2,...an) • B = (b1,b2,...bn) • Euclidean distance DAB is defined as: • Example: • A = (3,0,1); B = (5,2,0) • DAB= = 3

  16. descriptorn DAB B A descriptor2 descriptor1 descriptor3 Manhattan Distance in n-D Space • Given two n-dimensional vectors, A and B • A = (a1,a2,...an) • B = (b1,b2,...bn) • Manhattan distance DAB is defined as: • Example: • A = (3,0,1); B = (5,2,0) • DAB= = 5

  17. Distance Measures („Metrics“): Euclidian distance: [(x11 - x21) 2 + (x12 - x22)2] 1/2 = = (42 + 22)1/2 = 4.472 Manhattan (Hamming) distance: |x11 - x21| + |x12 - x22| = 4 + 2 = 6 Sup distance: Max (|x11 - x21|, |x12 - x22|) = = Max (4, 2) = 4

  18. Binary Fingerprint

  19. Popular Similarity/Distance Coefficients • Similarity metrics: • Tanimoto coefficient • Dice coefficient • Cosine coefficient • Distance metrics: • Euclidean distance • Hamming distance • Soergel distance

  20. B A C Tanimoto Coefficient (Tc) • Definition: • value range: [0,1] • Tc is also known as Jaccard coefficient • Tc is the most popular similarity coefficient

  21. binary A B a = 4, b = 4, c = 2 Example Tc Calculation

  22. Dice Coefficient • Definition: • value range: [0,1] • monotonic with the Tanimoto coefficient

  23. Cosine Coefficient • Definition: • Properties: • value range: [0,1] • correlated with the Tanimoto coefficient but not strictly monotonic with it

  24. Hamming Distance • Definition: • value range: [0,N] (N, length of the fingerprint) • also called Manhattan/City Block distance

  25. Soergel Distance • Definition: • Properties: • value range: [0,1] • equivalent to (1 – Tc) for binary fingerprints

  26. Similarity coefficients

  27. Properties of Similarlity and Distance Coefficients Metric Properties • The distance values dAB 0; dAA= dBB= 0 • Symmetry properties: dAB= dBA • Triangle inequality: dAB dAC+ dBC The Euclidean and Hamming distances and the Tanimoto coefficients (dichotomous variables) obey all properties. The Tanimoto, Dice and Cosine coefficients do not obey inequality (3). Coefficients are monotonic if they produce the same similarlity ranking

  28. Similarity search Using bit strings to encode molecular size. A biphenyl query is compared to a series of analogues of increasing size. The Tanimoto coefficient, which is shown next to the corresponding structure, decreases with increasing size, until a limiting value is reached. D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386

  29. Similarity search Molecular similarity at a range of Tanimoto coefficient values D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386

  30. Similarity search The distribution of Tanimoto coefficient values found in database searches with a range of query molecules of increasing size and complexity D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386

  31. Molecular Similarity A comparison of the Soergel and Hamming distance values for two pairs of structures to illustrate the effect of molecular size A R. Leach and V. J. Gillet "An Introduction to Chemoinformatics" , Kluwer Academic Publisher, 2003

  32. Molecular Similarity The maximum common subgraph (MCS) between the two molecules is in bold Similarity = Nbonds(MCS) / Nbonds(query) A R. Leach and V. J. Gillet "An Introduction to Chemoinformatics" , Kluwer Academic Publisher, 2003

  33. Activity landscape

  34. How important is a choice of descriptors ? Inhibitors of acyl-CoA:cholesterol acyltransferase represented with MACCS (a), TGT (b), and Molprint2D (c) fingerprints.

  35. discontinuous SARs continuous SARs gradual changes in structure result in moderate changes in activity • “rolling hills” (G. Maggiora) small changes in structure have dramatic effects on activity • “cliffs” in activity landscapes Structure-Activity Landscape Index: SALIij = DAij / DSij DAij(DSij) is the difference between activities (similarities) of molecules iand j R. Guha et al. J.Chem.Inf.Mod., 2008, 48, 646

  36. 6 nM MACCSTc: 1.00 Analog 2390 nM discontinuous SARs VEGFR-2 tyrosine kinase inhibitors small changes in structure have dramatic effects on activity • “cliffs” in activity landscapes • lead optimization, QSAR bad news for molecular similarity analysis...

  37. Example of a “Classical” Discontinuous SAR Any similarity method must recognize these compounds as being “similar“ ... (MACCS Tanimoto similarity) Adenosine deaminase inhibitors

  38. Libraries design Goal: to select a representative subset from a large database

  39. Chemical Space Overlapping similarity radii  Redundancy „Void“ regions  Lack of information

  40. Chemical Space „Void“ regions  Lack of information

  41. Chemical Space No redundancy, no „voids“  Optimally diverse compound library

  42. Subset selection from the libraries • Clustering • Dissimilarity-based methods • Cell-based methods • Optimisation techniques

  43. Clustering in chemistry

  44. What is clustering? • Clustering is the separation of a set of objects intogroups such that items in one group are more likeeach other than items in a different group • A technique to understand, simplify and interpretlarge amounts of multidimensional data • Classification without labels (“unsupervisedlearning”)

  45. Where clustering is used? General: data mining, statistical data analysis, datacompression, image segmentation, document classification(information retrieval) Chemical: • representative sample, • subsets selection, • classification of new compounds

  46. Overall strategy • Select descriptors • Generate descriptors for all items • Scale descriptors • Define similarity measure (« metrics ») • Apply appropriate clustering method to group the items on basis of chosen descriptors and similarity measure • Analyse results

  47. Data Presentation descriptors molecules molecules molecules Pattern matrix Proximity matrix Library contains nmolecules, each molecule is described by pdescriptors dii = 0; dij = dji

  48. Agglomerative Divisive Clustering methods Single Link Complete Link Group Average Hierarchical Weighted Gr Av Monothetic Centroid Polythetic Median Single Pass Ward Jarvis-Patrick Nearest Neighbour Mixture Model Non-hierarchical Relocation Topographic Others

More Related