340 likes | 646 Views
Methods for Virtual Screening and Scaffold Hopping for Chemical Compounds. Nikil Wale 1 , George Karypis 1 , Ian A. Watson 2 1 Department of Computer Science & Engineering, University of Minnesota. nwale@cs.umn.edu , karypis@cs.umn.edu
E N D
Methods for Virtual Screening and Scaffold Hopping for Chemical Compounds Nikil Wale1, George Karypis1, Ian A. Watson2 1Department of Computer Science & Engineering, University of Minnesota. nwale@cs.umn.edu, karypis@cs.umn.edu 2Eli Lilly Inc. Lilly Research Labs, Indianapolis. watson_a_ian@lilly.com
Small scale production Laboratory and animal testing Production for clinical trials File IND Drug Development Pipeline Drug screening/ lead development/ lead optimization Choose a drug target 106 → 1
Drug Screening • Experimental • High Throughput Screening • Target specific assays • ..... • InSilico • clustering • ranked-retrieval • classification • docking • .....
H H H H N O H H O O H O H O O NH2 NH2 C l O O Ranked-Retrieval Problem • Given a query compound, rank the compounds in the database based on how similar they are to the query in terms of their bioactivity. 1 2 3 query Bioactivity Ranking Database of Compounds
Key Principle: Structure Activity Relationship Knowledge / Information Retrieval / Data-mining Based Approaches The properties of a chemical compound largely depend on its structure (structure-activity relationship or SAR). Exploit structural similarity Capturing structure → structural descriptors (descriptor-space)
Drawbacks of Structural Descriptor-Space based Ranked-Retrieval Retrieve compounds that are structurally diverse and different from the query and yet bioactive to avoid the above drawback Too much emphasis on structural similarity Query and subsequently the hits (top ranked compounds) may have bad ADME properties. Structures may be toxic or promiscuous.
Scaffold-Hopping Problem H H H H N O H H O O H O H O O NH2 NH2 C l O O • Given a query compound, rank the compounds in the database based on how similar they are to the query in terms of their bioactivity but as dissimilar as possible in terms of their structure to that of the query. 1 2 3 query Diverse structure but same Bioactivity Database of Compounds with a descriptor-space representation
Hard Problem Runs counter to SAR For a query its hard to distinguish a genuinely structurally different but active compound from an inactive compound. Definition of a scaffold-hop for a query is not clear or objective in many cases.
Examples Bextra Celebrex Vioxx COX2 (cyclooxygenase-2) inhibitors Viagra Levitra Cialis PDE5 (phosphodiesterase type 5) inhibitors
Methods for Scaffold-Hopping O O O CH3 NH2 High structural (direct) similarity HO High structural (direct) similarity O CH3 HO CH3 NH2 Low structural similarity but similarity by association CH3 • Includes information beyond structural similarity → allows structural diversity. • AutomaticRelevance-Feedback based methods • Graph analysis based methods. Based on Indirect measures of similarity.
Based on Indirect Similarity derived from by automatic relevance feedback mechanism HO O O O O CH3 NH2 NH2 CH3 NH2 Automatic Relevance Feedback based Methods CH3 HO O HO CH3 Set of three most similar compounds to the query in terms of structure Query CH3 HO O Compound with low structural similarity to the query but high similarity to set of most similar compounds to the query.
Automatic Relevance Feedback based Methods q q + A c39 {sim(q, c39)} c9 {simA(q, c9)} c2 {sim(q, c2)} c13 {simA(q, c13)} c13 {sim(q, c13)} c40 {simA(q, c40)} cj cj TopKAvg top-k A= {c39, c2, c13} Ranked Database Re-ranked Database simA(q,c) = α sim(q,c) + (1-α) simavg(c,A)
Automatic Relevance Feedback based Methods q + A q c8 {simA(q, c8)} c39 {sim(q, c39)} c10 {simA(q, c10)} c2 {sim(q, c2)} c1 c13 {simA(q, c1)} {sim(q, c13)} cj cj ClustWt q A c8 1 Cluster into m clusters c13 c10 top-k A= {c8, c10, c13} 2 c40 c1 Clusters ranked with respect to their similarity to the query Ranked Database Re-ranked Database simA(q,c) = α sim(q,c) + (1-α) simavg(c,A)
Automatic Relevance Feedback based Methods c1 c2 c3 ci BestSumDescSim q A = A U {cnext } BestMaxDescSim A = {} D = D – {cnext} Database (D)
Methods for Scaffold-Hopping c8 c1 c6 c5 c2 and c4are strongly related by the metric of the number of paths of size 2 that connect them, even though they do not have a direct relation (path of size 1). c7 c4 c2 c3 Graph formed using information on the neighborhood of each chemical compound Based on Indirect measure of similarity derived from a nearest-neighbor graph for compounds.
Network Analysis based Methods c2 ci q c1 cj c14 c44 c18 top-k c2 c5 c2 q c9 c1 q c1 ck ck ci ci D U {q} – {c2} D U {q} – {cj} D U {q} – {q} D U {q} – {c1} Rank every compound in the database and the query with respect to every other compound
Network Analysis based Methods Nearest Neighbor graph (NG) Mutual Nearest Neighbor graph (MG) There exist an edge between two nodes (compounds) ci and cj if ci occurs in the list of top-k neighbors of cj or vice-versa. There exist an edge between two nodes (compounds) ci and cjif cioccurs in the list of top-k neighbors of cjand vice-versa.
Example – NG and MG c1 c2 cj q c14 c44 c18 q c2 c5 c2 c1 top-3 c9 c1 q cj c44 cj c5 q c2 adjG(q) = {c1} adjG(q) = {c44, c5, c1} c1 adjG(c2) = {cj, c1} adjG(c2) = {cj, q, c1} c18 c44 cj c5 cj q q c2 c1 c1 c2 c18 NG MG
The graph based similarity is given by This similarity can be used in conjunction with Sum or Max search schemes. Network Analysis based Methods • Four methods derived from graph based similarity – BestSumNG, BestMaxNG, BestSumMG, BestMaxMG.
Experimental Methodology q c39 {sim(q, c39)} c2 {sim(q, c2)} c13 {sim(q, c13)} cj • A combination of 6 target specific datasets and 3 descriptor-spaces (GF, ECFP, ErG) used resulting in a total of 18 problems. • Tanimoto similarity used to measure all direct similarities. • Standard Retrieval used as baseline for comparison. • Two related schemes, Turbo Max/Sum, are also compared. Standard Retrieval Ranked Database
q q + A c39 {sim(q, c39)} c9 {simA(q, c9)} c2 {sim(q, c2)} c13 {simA(q, c13)} c13 {sim(q, c13)} c40 {simA(q, c40)} cj cj Related Schemes Turbo SumFusion/MaxFusion top-k A= {c39, c2, c13} Ranked Database Re-ranked Database TurboSumFusion: simA(q,c) = sim(q,c39) + sim(q,c2) + sim(q,c13) TurboMaxFusion: simA(q,c) = max{sim(c,c39), sim(q,c2), sim(q,c13)}
Definition of Scaffold-Hops • For every active (ai) in a dataset scaffold-hops are defined. • For every active, all other actives are ranked against it using path-based fingerprints. The lowest 50% actives in this list form scaffold- hops for ai. q = ai a4 a10 Top 50% of total actives a6 Bottom 50% of total actives ak Scaffold-hops for ai
Performance Evaluation rank bioactivity 1 active 2 active 3 inactive 11 inactive 12 active 50 inactive q • For each problem, every active in it was used as query exactly once. • The ranked-retrieval and scaffold-hopping performance is measured using uninterpolated precision in the top 50. • Two methods compared using the average of log2 ratios of their uninterpolated precision for the 18 problems. id precision c44 1/1 c5 2/2 c1 c18 ci 3/12 ck Uninterpolated precision = (1/1 + 2/2+ 3/12) / 50 = 0.045
Results – Scaffold-Hopping Performance Row method statistically better BestSumNG BestSumMG BestMaxDS BestMaxNG BestSumDS BestMaxMG TurboMax TurboSum ClustWt TopkAvg StdRet Row method statistically worse StdRet TurboSum Row and column methods statistically same TurboMax TopkAvg ClustWt BestSumDS BestMaxDS BestSumNG BestMaxNG BestSumMG BestMaxMG
Results – Ranked-Retrieval Performance Row method statistically better BestSumNG BestSumMG BestMaxDS BestMaxNG BestSumDS BestMaxMG TurboMax TurboSum ClustWt TopkAvg StdRet Row method statistically worse StdRet TurboSum Row and column methods statistically same TurboMax TopkAvg ClustWt BestSumDS BestMaxDS BestSumNG BestMaxNG BestSumMG BestMaxMG
Conclusions & Future Work • Automatic relevance feedback mechanism inspired and Indirect similarity improves scaffold-hopping performance. • Indirect similarity based methods are more powerful that direct similarity based measures and show significant improvements over the state of the art. • Problem of selecting the right value for the parameter ‘k’. • Selecting the right descriptor space.
Thanks to – • Karypis Research • Kevin DeRonne • Christopher Kauffman • Huzefa Rangwala • Xia Ning • Yevgeniy Podoylan THANK YOU! Questions? www.cs.umn.edu/~karypis