260 likes | 494 Views
A Multiobjective Approach to Combinatorial Library Design. Val Gillet University of Sheffield, UK. Outline. SELECT GA based program for combinatorial library design Combinatorial subset selection in product-space Multiobjective optimisation via weighted-sum fitness function
E N D
A Multiobjective Approach to Combinatorial Library Design Val Gillet University of Sheffield, UK
Outline • SELECT • GA based program for combinatorial library design • Combinatorial subset selection in product-space • Multiobjective optimisation via weighted-sum fitness function • Limitations of a weighted-sum approach • MoSELECT • Multiobjective optimisation via MOGA
Library Design is a Multiobjective Optimisation Problem • Early HTS results disappointing • Low hit rates • Hits too lipophilic; too flexible; high molecular weights… • Diverse libraries • Distance-based/cell-based diversity • Bioavailability; cost; ease of synthesis… • Focused/targeted libraries • Similarity to known active; predicted active by QSAR model; fit to receptor site • Bioavailability; cost,….
Product-Based Library Design • A two-component combinatorial library can be represented by a 2D array • A combinatorial subset can be defined by intersecting rows and columns of the array • Exploring all combinatorial subsets is equivalent to testing all permutations of the rows and columns of the array
R1 R2 6 ´4 subset 11 8 2 30 7 25 10 1 19 18 Selecting Combinatorial Subsets Using a GA • Chromosome encoding • each chromosome represents a combinatorial subset as an integer string • one partition for each reactant pool • the size of a partition equals the no. of reactants required from the corresponding pool • Crossover, mutation and roulette wheel parent selection are used to evolve new potential solutions
Multiobjective Optimisation in SELECT • Weighted-sum fitness function • enumerate the combinatorial library represented by a chromosome • calculate descriptors for molecules in the library • Objectives are scaled and user defined weights are applied
Multiobjective Optimisation in SELECT cont. • Diversity indices • distance-based (e.g. sum of pairwise dissimilarities and Daylight fingerprints) • cell-based • Physical property terms • minimise the difference between the distribution in the library and some reference distribution, e.g. • “drug-like” profile derived from WDI • Cost: £ • minimise the cost of the library
Library Enumeration in SELECT • Virtual library is enumerated upfront • ADEPT (A Daylight Enumeration and Profiling Tool) • Identify potential reactants • Filter out unwanted ones • Enumerate virtual library • Reaction Tookit (Reaction transforms; MTZ language) • Descriptors are calculated upfront • Combinatorial subset accessed via fast lookup
10K virtual library 100 amines ´ 100 carboxylic acids 30 x 30 amide subsets WDI – World Drugs Index Reactant-based selection: diversity (Diversity 0.564 ) Product-based Reactant-based • Product-based selection: diversity & molecular weight profile (Diversity 0.573) Example: Amide Library 25 WDI 20 15 Percentage of Compounds 10 5 0 0 200 400 600 800 Molecular weight
Limitations of a Weighted-Sum Fitness Function • Definition of fitness function difficult especially for different types of objectives • e.g. molecular weight profile and cost • Setting of weights is non-intuitive • Can result in regions of search space being obscured especially when objectives are in competition • Difficult to monitor progress since >1 objective to follow simultaneously • A single solution is found
Varying Weights in SELECT • Objectives are in competition resulting in trade-offs • A family of alternative solutions exist that are all equivalent
Multiobjective Optimisation • Evolutionary algorithms, e.g., GAs • operate with a population of individuals • well suited to search for multiple solutions in parallel • readily adapted to deal with multiobjective optimisation • MOGA: MultiObjective Genetic Algorithm • Fonseca & Fleming. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 28(1), 1998, 26-37.
MOGA • Multiple objectives are handled independently without summation and without weights • A hyper-surface is mapped out in the search space • represents a continuum of solutions where all solutions are seen as equivalent • represents compromises or trade-offs between the various objectives • solutions are called non-dominated, or Pareto solutions. • A family of non-dominated solutions is sought rather than a single solution
0 0 2 4 0 0 0 0 1 • Pareto ranking: an individual’s rank corresponds to the number of individuals in the current population by which it is dominated 0 0 0 0 Dominance & Pareto Ranking • A non-dominatedindividual is one where an improvement in one objective results in a deterioration in one or more of the other objectives when compared with the other individuals in the population f2 A B f1
SELECT MoSELECT* Initialise Population Initialise Population Select parents Select parents Apply genetic operators Apply genetic operators Calculate objectives: a,b,c... Calculate objectives: a,b,c... Calculate dominance: a, b,c Apply fitness function f=w1a + w2b + w3c + ... Rank using Pareto Ranking: based on dominance Rank based on fitness Test for convergence Test for convergence Family of solutions Single solution * Patent Applied for
0 iterations 100 iterations 1000 iterations 5000 iterations MoSELECT: Search Progress
0.574 0.578 0.582 Diversity 0.586 0.59 0.594 0.58 0.6 0.62 0.64 D MW Family of Solutions • Each run of MoSELECT results in a family of solutions • Finding the same coverage of solutions using SELECT would require multiple runs using various combinations of weights • One run of MoSELECT takes the same cpu time as one run of SELECT 5000iterations
Focused Library: Aminothiazoles • a-bromoketones & thioureas extracted from ACD • ADEPT used to • filter reactants (MW < 300; RB < 8) • enumerate virtual library => 12850 products (74 a-bromoketones & 170 thioureas) • MoSELECT used to design 15×30 subsets optimised on • Similarity to a target compound (Daylight fingerprints) • Cost ($/g)
5000 iterations MoSELECT Solutions: 1 0 iterations
Running MoSELECT with niching MoSELECT Solutions: 2 5000 iterations
Moving to > 2 Objectives:Parallel Graph Representation 5000 iterations 0.578 0.582 Diversity 0.586 0.59 0.594 0.58 0.6 0.62 0.64 D MW Each objective is scaled using the Max and Min values achieved when the objective is optimised independently
Focused Library: Amides • 100 × 100 virtual library • MoSELECT used to design 10 × 10 subsets • Objectives • Similarity to a target • Sum of similarities using Daylight fps • Predicted bioavailability • Each compound rated from 1 to 4 • Sum of ratings • Hydrogen bond profile • Rotatable bond profile
MoSELECT Solutions • Population size 50 • Iteration 5000 • Niching 30% • Number of solutions = 11 • CPU 53s (R12K 360 MHz)
Conclusions • Advantages of MoSELECT • a family of equivalent solutions is obtained in a single run with each solution representing one combinatorial library • this is achieved at vastly reduced computational cost compared to performing multiple runs of SELECT • no need to determine weights for objectives • optimisation of different types of objectives is readily achieved • visualisation of the search progress allows trade-offs between objectives to be observed • the user can make an informed choice on which solution(s) to explore
Acknowledgements • Illy Khatib, Peter Willett; Information Studies, University of Sheffield • Peter Fleming; Automatic Control and Systems Engineering, University of Sheffield • Darren Green, Andrew Leach; GlaxoSmithKline, UK • Funding by GlaxoSmithKline, UK • John Bradshaw; Daylight • Daylight for software support