1 / 26

Molecular Similarity and Chemical Families: The Homogeneity Approach

Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou , B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000 2nd Sheffield Chemoinformatics Conference, Sheffield, UK. Presentation Outline. Introduction Molecular similarity Observations on chemical data

komar
Download Presentation

Molecular Similarity and Chemical Families: The Homogeneity Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 20002nd Sheffield Chemoinformatics Conference, Sheffield, UK Bioreason, Inc

  2. Presentation Outline • Introduction • Molecular similarity • Observations on chemical data • Analyzing screening data • Using a traditional approach • The Homogeneity Approach • Definitions • Implementation and experimental results • Conclusions Bioreason, Inc

  3. Molecular Similarity • Widely used all over drug discovery process • Sample applications: • Assessing diversity of a chemical dataset • Picking representative dataset from compound library • Given a compound and a compound library, identifying subset of similar compounds • Analyzing screening data • Major step: • Organizing screening data into chemical families Bioreason, Inc

  4. Data Assay Typical Drug Discovery Process Library *Screening* *Data Analysis* Further exploration Start Chemistry Drug Candidates Bioreason, Inc

  5. Technology Employed • Compound representation methods • Fingerprints/bit vectors, graph-based, ... • 2D-keys Vs 3D-keys, fragment Vs distance based, ... • Similarity and distance measures • Tanimoto, Euclidean, …, graph-based, ... • Clustering methods • Classification methods • Substructure searching/(sub)graph matching • ... Bioreason, Inc

  6. Analyzing Chemical Compounds (1) Dictionary of Keys O N-N Q-QH Q-C(-N)-C CH3-A-CH3 Q-N N-A-A-O N-C-O O not % A % A N-A-O Q-Q QH > 1 CH3 > 1 N > 1 NH ... H N N O H O 10111000001... Bioreason, Inc

  7. Analyzing Chemical Compounds (2) • Compounds are multi-domain: • multiple occurrences of a key/substructure • members of more than one chemical family Bioreason, Inc

  8. Analyzing Chemical Compounds (3) Information loss! E.g. “How” a key hits? Bioreason, Inc

  9. Dataset Used • Derived from the NCI anti-HIV program • Latest release, Oct. 99, 43 382 compounds • Cell based, EC50 (effective concentration at which the test compound protects the cells by 50%) • Pre-processing: • Molecular weight <=500 • Multiple EC50 values for compounds; kept highest concentration • 33245 compounds left • Activities: converted from molar concentrations to -log • Activity threshold used: 5.5 • Training set size (actives): 503 Bioreason, Inc

  10. Analyzing Screening DataTypical Approach • Goal: Data Reduction • To manageable size • Organized fashion • With minimal information loss • Represent molecules as vectors, often binary • Similarity/distance measure • Clustering Algorithm • Metacluster selection method (e.g. cluster level selection methods for hierarchical clustering) Bioreason, Inc

  11. Hierarchical Agglomerative Clustering Method • NCI - HIV dataset • 503 subset based on activity • Clustered using Wards, Euclidean distance, bit-vectors obtained via application of MACCS-like keys • Cluster level selection using the Kelley method • Results: • 70 (meta)clusters • Complete coverage of the dataset, no singletons! • Average metacluster size: 7.2 compounds Bioreason, Inc

  12. Method Evaluation - Chemists • Results validation by comparing to known truth: • Some known chemical families were detected, e.g. AZTs, pyrimidine nucleosides, ... • Smaller, less well-represented families not always detected, e.g. stilbenes, ... • Results validation by assessing their quality • On average chemists approved only 20-30 of the 70 clusters as chemical families of related compounds • The remaining clusters(~2/3) were difficult to interpret • Compounds that shouldn’t be in some clusters • Compounds that should have been in some clusters (misclassified or not) • Clusters that were made of dissimilar/diverse compounds • Experts were puzzled by the absence of singletons Bioreason, Inc

  13. Method Evaluation - Computational • Analyzed 70 groups of compounds: • Simple method: • average nearest neighbor distance within a set of compounds • distance computed using the bit-vectors of the compounds • 43/70: pretty low average nearest neighbor distance • 22/70: moderate average nearest neighbor distance • 5/70: quite high average nearest neighbor distance. • Overall most of the groups had a low diversity; expected since the metaclusters were built using bit-vectors Bioreason, Inc

  14. The problem • Confusing? • Method functioned just right from a computational perspective • But, the results were not as satisfying to the human expert • Clustering results often don’t: • match expectations • make chemical sense • Why? • Clustering is performed on molecular representations, often based on small keys, not on the molecules themselves • No chemical “common sense” influence on the clustering process Bioreason, Inc

  15. The road ahead… (1) • What is the end goal of screening data analysis? • Finding the chemical families of interest, i.e. those that exhibit favorable biological characteristics • How are we attempting to do it? • Clustering and classification methods using vector encoding representations of molecules • But, • clustering only gives groups of compounds that have similar vector representations and, • a successful classification session requires that one knows the chemical families of interest a priori. Bioreason, Inc

  16. The road ahead… (2) • So, what do we do now that we are aware of the loose coupling between clusters obtained traditionally and human experts’ expectations? • Discover what the experts want • Adapt our process to match results and expectations Bioreason, Inc

  17. Definitions • Chemical family: • A set of highly similar compounds sharing a common scaffold; else a set of compounds with high homogeneity • Homogeneity: • High structural similarity • Based not only on similarity of molecular vectors but also on the presence of a significant common scaffold • Scaffold: • A substructure defined as a specific configuration of atom types and bond types Bioreason, Inc

  18. Processing traditional method results • Processing the results of traditional methods: • Easier to do than a complete re-design/re-implementation • Will “remove” results not chemically sensible • Will make life easier for human analysts by allowing them to focus on easily recognizable and interpretable pieces of knowledge • Approach: • Compute and use structural homogeneity on results of traditional methods. Basically construct “chemically sensible” methods for selecting the important compound groups Bioreason, Inc

  19. Identifying Scaffolds • Maximum Common Substructure(MCS) extraction: • Using extremely fast and efficient own implementations • Highlights of analysis: • 7 out of 70 compound sets: common scaffold size < 2! • 5 MCSs appeared multiple times • Range: 2-6, mostly benzene rings • A total of 53 different scaffolds • MCS size: • Ranged from less than 2 atoms to greater than 14 atoms Bioreason, Inc

  20. Introducing Homogeneity • Clusters Homogeneity: • Fingerprint Homogeneity: • Overall quite good average nearest neighbor distance • Structural Homogeneity: • Used: # of atoms in mcs / avg. # of atoms in set molecules • Structural Homogeneity Threshold: 1/3 • MCS covering at least a third of the average molecule size • Results: • 23/70 clusters below threshold • 47 above threshold Bioreason, Inc

  21. Method Assessment (1) • Results were used to assign priority to clusters: • Low Priority - low likelihood of chemical sense: • clusters with small scaffolds, low structural homogeneity • clusters with insignificant scaffolds, low-to-moderate structural homogeneity • High Priority - high likelihood of chemical sense: • well defined clusters, with high structural homogeneity and big, significant scaffolds • Approach did make life easier to human analysts • Ability to find important information faster Bioreason, Inc

  22. Method Assessment (2) • Prioritization assessment: • the 23 non-structurally homogeneous clusters were uninteresting to chemists. • the 47 structurally homogeneous included all those (20-30) approved before by chemists as chemical families • However, experts complained about: • low information content of the clustering process results • Too many clusters, too little knowledge • the amount of information never found! • High priority clusters contained only 2/3 of compounds analyzed! • Clusters approved as chemical families from which knowledge could be derived easily contained only 1/3 of the compounds!!! • Known knowledge never found. Bioreason, Inc

  23. The road ahead… (3) • Do traditionally obtained clusters relate to chemical families? • Do we need a different approach? • Introduce chemically “aware” methods • No simple clustering methods • Take into account structural homogeneity • Accommodate multi-domain nature of molecules • Present results in a format that facilitates interpretation and knowledge discovery by chemists Bioreason, Inc

  24. A different approach: Can it work? • Have been working on “chemically aware” screening data analysis methods • Same dataset results with a typical Bioreason analysis: • 102 classes, all with high structural homogeneity • All classes were easy to interpret • Only 10% of classes not interesting to chemists (~50 compounds) • 47 singletons (~10% of dataset) • Information content much higher than traditional approach • 90% of compounds placed in homogeneous clusters (Vs 66% in traditional method) • 80% of compounds placed in clusters approved as structural families (Vs 34% in traditional method) • Multi-domain nature is accommodated Bioreason, Inc

  25. Conclusions • Molecular fingerprint similarity does not supply a certain indication of high structural molecular similarity • Most traditional chemical data analysis methods make heavy use of molecular fingerprint similarity • As a consequence, relations -including clusters- obtained via traditional methods often don’t make chemical sense • Structural Homogeneity may be employed to enable formation of clusters and identification of chemical relations closer to chemists’ expectations Bioreason, Inc

  26. Acknowledgements • Patricia Bacha • Bobi Den Hartog • Info: • nicolaou@bioreason.com • www.bioreason.com Bioreason, Inc

More Related