260 likes | 281 Views
Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou , B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000 2nd Sheffield Chemoinformatics Conference, Sheffield, UK. Presentation Outline. Introduction Molecular similarity Observations on chemical data
E N D
Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 20002nd Sheffield Chemoinformatics Conference, Sheffield, UK Bioreason, Inc
Presentation Outline • Introduction • Molecular similarity • Observations on chemical data • Analyzing screening data • Using a traditional approach • The Homogeneity Approach • Definitions • Implementation and experimental results • Conclusions Bioreason, Inc
Molecular Similarity • Widely used all over drug discovery process • Sample applications: • Assessing diversity of a chemical dataset • Picking representative dataset from compound library • Given a compound and a compound library, identifying subset of similar compounds • Analyzing screening data • Major step: • Organizing screening data into chemical families Bioreason, Inc
Data Assay Typical Drug Discovery Process Library *Screening* *Data Analysis* Further exploration Start Chemistry Drug Candidates Bioreason, Inc
Technology Employed • Compound representation methods • Fingerprints/bit vectors, graph-based, ... • 2D-keys Vs 3D-keys, fragment Vs distance based, ... • Similarity and distance measures • Tanimoto, Euclidean, …, graph-based, ... • Clustering methods • Classification methods • Substructure searching/(sub)graph matching • ... Bioreason, Inc
Analyzing Chemical Compounds (1) Dictionary of Keys O N-N Q-QH Q-C(-N)-C CH3-A-CH3 Q-N N-A-A-O N-C-O O not % A % A N-A-O Q-Q QH > 1 CH3 > 1 N > 1 NH ... H N N O H O 10111000001... Bioreason, Inc
Analyzing Chemical Compounds (2) • Compounds are multi-domain: • multiple occurrences of a key/substructure • members of more than one chemical family Bioreason, Inc
Analyzing Chemical Compounds (3) Information loss! E.g. “How” a key hits? Bioreason, Inc
Dataset Used • Derived from the NCI anti-HIV program • Latest release, Oct. 99, 43 382 compounds • Cell based, EC50 (effective concentration at which the test compound protects the cells by 50%) • Pre-processing: • Molecular weight <=500 • Multiple EC50 values for compounds; kept highest concentration • 33245 compounds left • Activities: converted from molar concentrations to -log • Activity threshold used: 5.5 • Training set size (actives): 503 Bioreason, Inc
Analyzing Screening DataTypical Approach • Goal: Data Reduction • To manageable size • Organized fashion • With minimal information loss • Represent molecules as vectors, often binary • Similarity/distance measure • Clustering Algorithm • Metacluster selection method (e.g. cluster level selection methods for hierarchical clustering) Bioreason, Inc
Hierarchical Agglomerative Clustering Method • NCI - HIV dataset • 503 subset based on activity • Clustered using Wards, Euclidean distance, bit-vectors obtained via application of MACCS-like keys • Cluster level selection using the Kelley method • Results: • 70 (meta)clusters • Complete coverage of the dataset, no singletons! • Average metacluster size: 7.2 compounds Bioreason, Inc
Method Evaluation - Chemists • Results validation by comparing to known truth: • Some known chemical families were detected, e.g. AZTs, pyrimidine nucleosides, ... • Smaller, less well-represented families not always detected, e.g. stilbenes, ... • Results validation by assessing their quality • On average chemists approved only 20-30 of the 70 clusters as chemical families of related compounds • The remaining clusters(~2/3) were difficult to interpret • Compounds that shouldn’t be in some clusters • Compounds that should have been in some clusters (misclassified or not) • Clusters that were made of dissimilar/diverse compounds • Experts were puzzled by the absence of singletons Bioreason, Inc
Method Evaluation - Computational • Analyzed 70 groups of compounds: • Simple method: • average nearest neighbor distance within a set of compounds • distance computed using the bit-vectors of the compounds • 43/70: pretty low average nearest neighbor distance • 22/70: moderate average nearest neighbor distance • 5/70: quite high average nearest neighbor distance. • Overall most of the groups had a low diversity; expected since the metaclusters were built using bit-vectors Bioreason, Inc
The problem • Confusing? • Method functioned just right from a computational perspective • But, the results were not as satisfying to the human expert • Clustering results often don’t: • match expectations • make chemical sense • Why? • Clustering is performed on molecular representations, often based on small keys, not on the molecules themselves • No chemical “common sense” influence on the clustering process Bioreason, Inc
The road ahead… (1) • What is the end goal of screening data analysis? • Finding the chemical families of interest, i.e. those that exhibit favorable biological characteristics • How are we attempting to do it? • Clustering and classification methods using vector encoding representations of molecules • But, • clustering only gives groups of compounds that have similar vector representations and, • a successful classification session requires that one knows the chemical families of interest a priori. Bioreason, Inc
The road ahead… (2) • So, what do we do now that we are aware of the loose coupling between clusters obtained traditionally and human experts’ expectations? • Discover what the experts want • Adapt our process to match results and expectations Bioreason, Inc
Definitions • Chemical family: • A set of highly similar compounds sharing a common scaffold; else a set of compounds with high homogeneity • Homogeneity: • High structural similarity • Based not only on similarity of molecular vectors but also on the presence of a significant common scaffold • Scaffold: • A substructure defined as a specific configuration of atom types and bond types Bioreason, Inc
Processing traditional method results • Processing the results of traditional methods: • Easier to do than a complete re-design/re-implementation • Will “remove” results not chemically sensible • Will make life easier for human analysts by allowing them to focus on easily recognizable and interpretable pieces of knowledge • Approach: • Compute and use structural homogeneity on results of traditional methods. Basically construct “chemically sensible” methods for selecting the important compound groups Bioreason, Inc
Identifying Scaffolds • Maximum Common Substructure(MCS) extraction: • Using extremely fast and efficient own implementations • Highlights of analysis: • 7 out of 70 compound sets: common scaffold size < 2! • 5 MCSs appeared multiple times • Range: 2-6, mostly benzene rings • A total of 53 different scaffolds • MCS size: • Ranged from less than 2 atoms to greater than 14 atoms Bioreason, Inc
Introducing Homogeneity • Clusters Homogeneity: • Fingerprint Homogeneity: • Overall quite good average nearest neighbor distance • Structural Homogeneity: • Used: # of atoms in mcs / avg. # of atoms in set molecules • Structural Homogeneity Threshold: 1/3 • MCS covering at least a third of the average molecule size • Results: • 23/70 clusters below threshold • 47 above threshold Bioreason, Inc
Method Assessment (1) • Results were used to assign priority to clusters: • Low Priority - low likelihood of chemical sense: • clusters with small scaffolds, low structural homogeneity • clusters with insignificant scaffolds, low-to-moderate structural homogeneity • High Priority - high likelihood of chemical sense: • well defined clusters, with high structural homogeneity and big, significant scaffolds • Approach did make life easier to human analysts • Ability to find important information faster Bioreason, Inc
Method Assessment (2) • Prioritization assessment: • the 23 non-structurally homogeneous clusters were uninteresting to chemists. • the 47 structurally homogeneous included all those (20-30) approved before by chemists as chemical families • However, experts complained about: • low information content of the clustering process results • Too many clusters, too little knowledge • the amount of information never found! • High priority clusters contained only 2/3 of compounds analyzed! • Clusters approved as chemical families from which knowledge could be derived easily contained only 1/3 of the compounds!!! • Known knowledge never found. Bioreason, Inc
The road ahead… (3) • Do traditionally obtained clusters relate to chemical families? • Do we need a different approach? • Introduce chemically “aware” methods • No simple clustering methods • Take into account structural homogeneity • Accommodate multi-domain nature of molecules • Present results in a format that facilitates interpretation and knowledge discovery by chemists Bioreason, Inc
A different approach: Can it work? • Have been working on “chemically aware” screening data analysis methods • Same dataset results with a typical Bioreason analysis: • 102 classes, all with high structural homogeneity • All classes were easy to interpret • Only 10% of classes not interesting to chemists (~50 compounds) • 47 singletons (~10% of dataset) • Information content much higher than traditional approach • 90% of compounds placed in homogeneous clusters (Vs 66% in traditional method) • 80% of compounds placed in clusters approved as structural families (Vs 34% in traditional method) • Multi-domain nature is accommodated Bioreason, Inc
Conclusions • Molecular fingerprint similarity does not supply a certain indication of high structural molecular similarity • Most traditional chemical data analysis methods make heavy use of molecular fingerprint similarity • As a consequence, relations -including clusters- obtained via traditional methods often don’t make chemical sense • Structural Homogeneity may be employed to enable formation of clusters and identification of chemical relations closer to chemists’ expectations Bioreason, Inc
Acknowledgements • Patricia Bacha • Bobi Den Hartog • Info: • nicolaou@bioreason.com • www.bioreason.com Bioreason, Inc