320 likes | 465 Views
Fusing database rankings in similarity-based virtual screening Peter Willett, University of Sheffield. Overview. Similarity-based virtual screening Combination of similarity rankings Similarity fusion Group fusion Comparison of fusion rules. Drug discovery.
E N D
Fusing database rankings in similarity-based virtual screeningPeter Willett, University of Sheffield
Overview • Similarity-based virtual screening • Combination of similarity rankings • Similarity fusion • Group fusion • Comparison of fusion rules
Drug discovery • The pharmaceutical industry has been one of the great success stories of scientific research, discovering a range of novel drugs for important therapeutic areas • The computer has revolutionised how the industry uses chemical (and increasingly biological) information • Many of these developments are within the discipline we now know as chemoinformatics • “Chem(o)informatics is a generic term that encompasses the design, creation, organization, management, retrieval, analysis, dissemination, visualization and use of chemical information” (G. Paris at a 1999 ACS meeting, quoted at http://www.warr.com/warrzone.htm) • Focus on structural information (2D or 3D) cf bioinformatics
Virtual screening • Chemoinformatics covers a wide range of techniques • Here, focus on virtual screening of existing public and in-house databases • Tools to rank compounds in order of decreasing probability of activity • The top-ranked molecules are then prioritised for biological screening • A range of virtual screening methods available, with similarity searching being one of the best established and most widely used
Similarity searching • Use of a similarity measure to quantify the resemblance between an active reference (or target) structure and each database structure • Given a reference structure find molecules in a database that are most similar to it (“give me ten more like this”) • Compare the reference structure with each database structure and measure the similarity • Sort the database in order of decreasing similarity • Display the top-ranked structures (“nearest neighbours”) to the searcher
H N O H H H N H N O H N 2 N N N N H N 2 Reference structure N N O H H N H 2 N H O N H N N N N 2D similarity searching
N N N O O O O O O O H O O H O O H Morphine Codeine Heroin Rationale for similarity searching • The similar property principle states that structurally similar molecules tend to have similar properties • Given a known active reference structure, a similarity search of a database can be used to identify further molecules for testing • NB many exceptions to the similar property principle
Similarity measures • A similarity measure has two principal components • A structure representation • Characterise reference and database structures to enable rapid comparison • A similarity coefficient to compare two representations • Quantitative measure of the resemblance of these characterisations • The most common measure is based on the use of 2D fingerprints and the Tanimoto coefficient (as in previous example)
Fingerprints • A simple, but approximate, representation that encodes the presence of fragment substructures in a bit-string or fingerprint • Cf keywords indexing textual documents • Each bit in the bit-string (binary vector) records the presence (“1”) or absence (“0”) of a particular fragment in the molecule. • Typical length is a few hundred or few thousand bits • Two fingerprints are regarded as similar if they have many common bits set
Tanimoto coefficient • Tanimoto coefficient for binary bit strings • C bits set in common between Reference and Database structures • R bits set in Reference structure • D bits set in Database structure • More complex form for use with non-binary data, e.g., physicochemical property vectors • Many other similarity coefficients exist
Data fusion: I • Many comparisons of effectiveness using different screening methods (e.g., different coefficients, different fingerprints, 2D or 3D methods) • Sheridan and Kearsley, Drug Discov. Today, 7, 2002, 903 • “We have come to regard looking for ‘the best’ way of searching chemical databases as a futile exercise. In both retrospective and prospective studies, different methods select different subsets of actives for the same biological activity and the same method might work better on some activities than others” • Different types of coefficient and different types of representation reflect different molecular characteristics, so may enhance search performance by using more than one similarity measure
Data fusion: II • Use of ideas from textual information retrieval (IR) given analogies between the two domains • Documents, keywords with highly skewed frequency distributions, and relevance to a query • Molecules, fragments with highly skewed frequency distributions, and activity against a specific biological target • IR-like fusion first studied in the late Nineties • Generate multiple rankings from the same reference structure using different similarity measures (similarity fusion) • Found to give improved performance over use of a single similarity measure (more consistent, or even better than best individual) • Later work in chemoinformatics • Generate multiple rankings from different reference structures using the same similarity measure (group fusion)
Similarity fusion • Conventional similarity searching yields a single database ranking • Work in IR on the “Authority Effect” • Experiments in TREC show that documents retrieved by multiple search engines more likely to be relevant to a query than if retrieved by a single search engine • Does the Effect also apply in chemoinformatics? • Extensive virtual screening experiments to investigate whether structures retrieved by multiple virtual screening methods more likely to be active than if retrieved by a single method
Experimental details: I • Test collection methodology analogous to that used in IR • Use of MDDR (ca. 102K structures) and WOMBAT (ca. 130K structures) databases • Sets of molecules with known biological activities (several hundred known actives in each class) • Simulated virtual screening using an active as the reference structure • How many of the top-ranked molecules from a search are also active?
Experimental details: II • Sets of 25 searches for a reference structure: • 5 different similarity coefficients (Tanimoto, cosine, Euclidean distance, Forbes, Russell-Rao) • 5 different fingerprints (MDL, BCI, Daylight, Unity and ECFP_4) • Apply cut-off to take, e.g., top-1% of a ranking • Numbers of molecules, and numbers of active molecules, retrieved by 1, 2….24, 25 searches • Average over different reference structures for each activity class, and over different activity classes
Retrieval of molecules: WOMBAT top-1% searches (average over classes) Zipf-like distribution
Retrieval of active molecules: WOMBAT top-1% searches (average over classes)
Similarity fusion: conclusions • Using multiple searches hence results in: • Rapid decrease in the numbers of molecules retrieved • Rapid increase in the percentage of those retrieved molecules that are active • Multiple searches could hence increase the effectiveness of similarity-based virtual screening • Provides empirical basis for similarity fusion (but very simple fusion rule). What about group fusion?
Reference 2 Reference 3 Use of groupfusion: I Reference 1
After truncation to required rank Reference 2 Reference 1 Reference 3
Group fusion • Use of MDDRdatabase (ca. 102K structures) • Measured numbers of actives retrieved in top-5% of ranking? • Group fusion searches where pick ten actives at random • Comparison with the average of all the individual actives for each activity class • Comparison with the best single active for each activity class • Use of Unity and ECFP4 fingerprints) • Group fusion markedly out-performs the use of individual reference structures • Best results obtained using combination of scores and the MAX rule (see later) • Hertet al., J. Chem. Inf. Comput. Sci., 44, 2004, 1177
70 Unity 65 ECFP_4 60 55 Recall at 5% (%ReReccall Recall (%)) 50 45 40 35 30 25 Single Single Data Fusion Similarity - Similarity - (Scores - Max) Average Maximum Group fusion: average over 11 activity classes
Fusion rules • Given multiple input rankings, a fusion rule outputs a single, combined ranking • The rankings can be either the computed similarity values or the resulting rank positions • Work in IR and chemoinformatics has used simple arithmetical operations to combine rankings (though many other, more complex types of rule available): • CombMAX for similarity data • CombSUM for rank data • Detailed comparison of a range of rules
Fusion rules for the x-th database structure • CombMax = max{S1(x), S2(x)..Si(x)..Sn(x)} • Also CombMIN • CombSum = ΣSi(x) • Also CombMED and other averages, using all or just some of the rankings • CombRKP = Σ(1/Ri(x)) • Used only with rank data
Very simple rules! • Other studies use supervised rules (logistic regression, belief theory etc) • But normally very limited training data (i.e., structures and bioactivity information) at the stage you want to use data fusion • If such data are available, other chemoinformatics approaches preferable
Experimental details • Searches carried out using • Similarity fusion and group fusion • Various percentages of the ranked database • 15 different fusion rules • Results show conclusively that best results (for both similarity fusion and group fusion) obtained when: • Use just the top 1-5% of each ranked list in the fusion • Use the CombRKP fusion rule on the ranked lists
Virtual screening seeks to rank molecules in decreasing order of probability of activity: MDDR searches (J. Med. Chem., 48, 2005, 7049) show a hyperbola-like plot Use of CombRKP: I
Fusion scores for CombRKP best approximate probability of activity, and hence CombRKP likely to perform well, Results averaged over 200 MDDR searches Use of CombRKP: II
Conclusions • Similarity-based virtual screening using fingerprints well-established • Can enhance screening effectiveness by use of data fusion: • Combining the rankings from different similarity measures • Combining the rankings from different reference structures • Range of simple fusion rules available for this purpose
Acknowledgments • Organisations • Accelrys, Daylight Chemical Information Systems, Digital Chemistry, EPSRC, Government of Malaysia, Sunset Molecular, Royal Society, Tripos, Wolfson Foundation • People • Claire Ginn, Jerome Hert, John Holliday, Evangelos Kanoulas, Nurul Malim, Christoph Mueller, Naomie Salim