570 likes | 792 Views
Laboratoire d’Informatique Fondamentale de Lille. Parallel Cooperative Optimization Research Group. Evolutionary Aglorithms for the Protein Structure Prediction Problem and Molecular Docking. A.-A. Tantar, N. Melab and E-G. Talbi {tantar, melab, talbi}@lifl.fr. INRIA DOLPHIN Project.
E N D
Laboratoire d’InformatiqueFondamentale de Lille Parallel Cooperative Optimization Research Group Evolutionary Aglorithms for theProtein Structure Prediction Problemand Molecular Docking A.-A. Tantar, N. Melab and E-G. Talbi {tantar, melab, talbi}@lifl.fr INRIA DOLPHIN Project Supported by the French Research Agency
Outline • Protein Structure Prediction, Molecular Docking: modeling and complexity analysis • Parallel Hybrid Metaheuristics for the PSP Problem • Grid experimentation on GRID5000 • Conclusion and Future Work
Protein Structure Prediction Problem E N E R G Y M I N I M I Z A T I O N Conformational Sampling Amino-acids ... ... A protein Native Conformation Protein Structure Prediction (PSP) ~ finding the ground-state (tertiary structure) conformation of a protein, given its amino-acid sequence - the primary structure
HIV-1 PROTEASE XK263 INHIBITOR Molecular Docking HIV-1 PROTEASE + XK263 RECEPTOR + LIGAND Molecular Docking ~ the prediction of the optimal bound conformation of two molecules exerting geometrical and chemical complementarity.
WEAK London Forces Van der Waals dipole-dipole stacking hydrogen bond STRONG ionic interactions Scoring Energy Function (A) Empirical Force Fields – Classical Mechanics • AMBER – Assisted Model Building with Energy Refinement • CHARMM – Chemistry at HARvard Molecular Mechanics • OPLS – Optimized Potentials for Liquid Simulations • GROMOS – GROningen MOlecular Simulation • SYBYL • D-Score - DOCK • F-Score - FlexX • G-Score - Gold • AutoDock/AutoGrid – suite of toos for automated docking • ... From recent resultsEmpirical Methods OUTPERFORM Ab Initio Methods (!!) • Neumaier, A. (1997). Molecular modeling of proteins and mathematical prediction of protein structure. SIAM Review, 39(3):407 – 460.
Scoring Energy Function (B) • AutoDock • DOCK - SYBYL/D-Score
Scoring Energy Function (C) • GOLD - SYBYL/G-Score • FlexX - SYBYL/F-Score • Renxiao Wang. Yipin Lu, and Shaomeng Wang, Comparative Evaluation of 11 Scoring Functions for Molecular Modeling, J. Med. Chem., 10.1021/jm0203783, 2003, 46, 2287-2303
Scoring Energy Function (D) • the factors model oscillating entities – forces simulated by interconnecting springs between atoms; offer an easy concept and fast energy evaluations. • constants derived from higher-level calculations (e.g. Ab initio) – difficult to obtain and to fit. • empirical force fields have the DRAWBACK of offering results not directly comparable with results obtained through another differently parameterized force field. • RMSD (Root Mean Squared Deviation) – typical distance measure between conformations.
AMINO-ACID TORSION-ANGLE BASED CHROMOZOME PSP modeling: Encoding of the conformations • Cartesian atomic coordinates representation • amino-acid-based encoding – hydrophobic/hydrophilic models • all heavy atoms coordinates • backbone C coordinates • backbone coordinates and side-chain centroid coordinates • torsion-angle based representation 286 165 316 . . . 7 69
Molecular Docking: Encoding of the conformations Model complexity High Complexity • flexible docking- both the ligand and the receptor are modeled as flexible molecules; limitations may be imposed • partially flexible docking – to some extent flexibility is modeled by focusing on the smaller molecule or by defining comprehensive regions of significance • rigid docking – extreme simplification; both the ligand and the receptor are rigid entities, no flexibility being allowed at any point Low Complexity + Multiple Potential Binding Sites AutoDock Representation - string of real valued genes • three Cartesian coordinates for ligand translation • four variables for defining a quaternion specifying ligand orientation • one real value for each ligand torsion • Garret M. Morris et al., Automated Docking Using a Lamarckian Genetic Algorithm and an Empirical Binding Free Energy Function, Journal of Computational Chemistry, Vol. 19, No. 14, 1639-1662, 1998.
Complexity Analysis A molecule of 40 residues 10 conformations per residue 1040 conformations Levinthal’s Paradox 1014 conformations per second 1028 years 1011 local optima for the [met]-enkephalin pentapeptide • 75 atoms • Five amino-acids (Tyr-Gly-Gly-Phe-Met) • 22 variable backbone dihedral angles
Motivation and objectives • PSP and Molecular Docking can be modeled as a optimization problems … • … which are multi-modal • … and require a huge amount of resources for large proteins • Need of hybrid metaheuristics • Need oflarge scale parallelism (Grid computing) • $$$ : $150000-$250000 per X-ray structure • Study of two hybrid metaheuristics for PSP: • Genetic Algorithms and Simulated Annealing
Classification of the Existing Methods • Energy minimization vs. Geometry complementarity approaches • Ab initio, de nuovo electronic structure calculations • rely on quantum mechanics for determining different molecular characteristics • comprise no approximations and no a priori experimental data is required • high computational complexity => reduced size systems • Semi-empirical methods • make use of approximations as substitution for ab initio techniques • employ simplified models for electron-electron interactions • Empirical methods • rely upon molecular dynamics (classical mechanics based methods) • often the only applicable methods for large molecular systems, i.e. proteins and polymers • do not dissociate atoms into electrons and nuclei - indivisible entities • Continuous vs. Discretization - Combinatorial Optimization
slow folders isomerization, water dynamics bond vibration fastest folders typical folders 10-15 femto MD step 10-9 nano, MD long run 10-6 micro 100 seconds 10-12 pico 10-3 mili Classification – Algorithmic Overview (A) • Global Optimization – Energy Landscape Projections, Convex Global Underestimation, etc. • construct a metamodel of the energy surface - avoid most of the inconvenients determined by the funnel with bumps structure of the landscape • tend to require less computational time than classical techniques – a few orders of magnitude • apparently have a lower probability of remaining blocked in kinetic traps • employ suppositions regarding the landscape which may not extend to different classes • require the construction of a metamodel, underestimation, etc. which offers no optimality/corectness guarantee • Molecular Dynamics • offer accurate results as far as the force field model fits the real energy landscape • the method can be applied straightforward – apart the formalism, no parameters to be tuned • require extensive computational resources – may only be applied to reduced models • the design and the development of the formalism may be far from trivial • given the high computational complexity, stand-alone approaches cannot be practical
Classification – Algorithmic Overview (B) • Monte Carlo / Simulated Annealing / Metropolis-based algorihtms • do not require a large number of parameters – adaptive versions overcome to an extent the inconvenients determined by the initial specified values • offer a performance guarantee for ataining the optimal value • compared to gradient based methods, are less prone to getting traped in local minima • require a non feasible number of sampling points for having the performance guarantee • appropiate for local search optimization or for the refinement of specific areas • might employ specific distributions for performing the conformational sampling • depending on the nature of the method, it may pose important parallelization problems • Evolutionary algoritmhs – Genetic Algorithms • exhibit strong exploration characteristics while lacking intensification capabilities – hybridization approaches tend to be more appropiate • offer the base for complex parallelization techniques • require implementing different components, having different parameters • the design of the operators may require to consider structural aspects (i.e. xover operator should not mix angles from different amino-acids when performing the recombination, etc.) • do not offer any performance guarantee – statistical convergence
The sketch of an evolutionary algorithm Best individual Initialization Parents Evaluation Stop ? Selection Evolutionaryengine Genitors Replacement Recombination,mutation Offspring Evaluation
Generation of the Initial population • Seeding - information inserted from structural databases • ensure that the population is seeded with the substructures that lead to the optimal conformation • does not represent the desired approach as strucutural information may not be available in all the cases • Ab initio approach – no initial structural information is given • uniform random initialization of the dihedral angles – the most common option • depending on structural data different distributions may be employed for offering a bias • The size of the population • may have an impact on the convergence rate: • reduced number of chromosomes ->premature convergence • larger number of chromosomes ->extra computational time • should be related to the cardinality of the alphabet used for encoding, chromosome length, algorithm parameters, etc.; empirical, restricted to special cases - no rigorous specifications • DUSAN P. DJURDJEVIC, MARK J. BIGGS, Ab Initio Protein Fold Prediction Using Evolutionary Aglorithms: Influence of Design and Control Parameters on Performance, 10.1002/jcc.20440, Wiley InterScience, 2005.
Selection and Replacement (A) Selection pressure the probability of selecting the best chromosome as compared to the average selection probability of all the chromosomes Selection intensity population expected average fitness value after performing a selection step having as base a normalized Gaussian distribution Selection variance population expected variance of the fitness distribution after performing a selection step having as base a normalized Gaussian distribution Loss of diversity the percentage of non-selected individuals during the selection phase Bias the absolute difference between the expected reproduction probability of a specific chromozome and the chromosome’s normalized fitness Spread range of possible values for the offsprings of a given chromosome
Selection and Replacement (B) Selection strategy • fitness-proportionate selection • no bias, does not guarantee a minimum spread • rank-based selection • selection relative to rank and not to the chromosome’s real fitness value • stochastic tournament selection • tournament among randomly chosen individuals • selection pressure can be adjusted by modifying the size of the tournament • offer a constant selection pressure • uniform selection • selection bias towards sparse fitness levels and not towards best fit chromosomes • might not behave well under extensive genetic diversity conditions (i.e, initial phases)
Selection and Replacement (C) Replacement strategy • Global replacement vs. Local replacement • generational replacement – replace all the parents with the offsprings • uniform replacement - less offsprings than parents, replace at random • elitist replacement - less offsprings than parents, replace worst parents • fitness-based replacement - more offsprings than parents, insert only the best offsprings • random, first-in-first out – the simplest ones, do not exert strong performace characteristics • worst-fit replacement – the least fit chromosome(s) are replaced • exponential replacement – chromosomes are ranked from the least fit to the the most fit, the replacement process employing an associated probability distribution • ...
Recombination Operators • discrete recombination – exchange of chromosome values; for each locus in the offspring, the parent chromosomes have equal probabilities of contributing • intermediate recombination – offsprings obtained by interpolation of the parents Offspring[ loci ] = ui * A[ loci ] + ( 1-ui ) * B [ loci ], ui[ -, 1+ ], i 1,n • line recombination – particular case of intermediat recombination, a single uniformly random generated variable being considered Offspring[ loci ] = u * A[ loci ] + ( 1-u ) * B [ loci ], u[ -, 1+ ], i 1,n • one/multi point crossover – implies the generation of one/multiple cutting points marking the segments to be combined into offsprings • uniform crossover – each locus is consider as potential cutting point • ...
Mutation Operators • real valued mutation – the main parameters to adjust are the mutation step size and the mutation probability – usually chosen to be 1/n, n – no. of variables • variable step mutation – small step mutations are accepted with high probability while large step mutations are accepted with low probability M[ loci ] = M[ loci ] + si * ri * pi , 1,n • si{ -1, +1 } • ri = r * |Bi – Ai|, r ~ 0.1, M[ loci ] [Ai, Bi] • pi =2-u*k, u[ 0, 1 ] • step-size adaptation mutation – n step-sizes / one direction / n-directions; do not offer consistent improvements for extreme multi-modal / high noise functions • ...
∂1 ∂2 ∂i ∂n ... ... GA population ... ... ... Guided Mutation - Mutation + LS Angle Mutation ∂1 ∂2 ∂i' ∂n ... ... Local Search ∂'1 ∂'2 ∂'n Optimized Individual
Termination Criteria • preset number of fitness function evaluations / number of generations • termination once the evolution ceases • optimal value has been found (!!) • ...
∂i ∂j ∂k ∂i Distance - Diversification GA population Root Mean Squared Deviation ... ... CONVERGENCE ... ... Compute Distances between Conformations Levenberg-Marquadt DIVERSIFICATION RMSD(i,j) RMSD(i,k) RMSD(j,k) Transform Conformations ... ... ∂j ∂k • Ananth Ranganathan, The Levenberg-Marquardt Algorithm, citeseer.ist.psu.edu/638988.html,June 2004
Literature examples (A) • Dusan P. Djurdjevic, Mark J. Biggs, Ab Initio Protein Fold Prediction Using Evolutionary Algorithms: Influence of Design and Control Parameters on Performance, DOI 10.1002/jcc.20440, Wiley Interscience • Generational Evolutionary Algorithm • Steady-State Evolutionary Algorithm • tournament selection, single-point/multi-point and uniform crossover, uniform distribution mutation with angles within the range of 0-360, termination in case of lack of improvement for a specified number of generations • Christopher D. Rosin University of California, San Diego, A Comparison of Global and Local Search Methods in Drug Docking, Proceedings of the Seventh International Conference on Genetic Algorithms, ICGA, 1997 • Simulated Annealing – 100 tests with 1.5~1.8 billion function evaluations per test, linearly reduced temperature • Solis and Wet’s Algorithm (a class of local and global search algorithms) – part of the GA+LS hybrid, deviations chosen from a normal distribution - does not require gradient information • Genetic Algorithm+LS – local search applied to only 7% of the population in each generation; stochastic selection, two-points crossover, Cauchy-deviate mutation
Literature examples (B) • Garrett M. Morris et al, Automated Docking Using a Lamarckian Genetic Algorithm and an Empirical Binding Free Energy Function, Journal of Computational Chemistry, Vol. 19, No. 14, 1639-1662, 1998. • Monte Carlo – applied on a pre-calculated grid of interaction energies • Simulated Annealing • Genetic Algorithm, Lamarckian Genetic Algorithm • uniform random value initialization (-180.0, 180.0), • maximum number of generations/function evaluations, • proportional elitist selection, constant-size population, • two-point crossover with breaks between genes, Cauchy distribution-based mutation, • local search at the end of each generation on a user-defined percentage of the population • each method is given ~1.5 million function evaluations / up to 41.5 minutes on a 200MHz Silicon Graphics MIPS with 128 Mb of RAM
Outline • Protein Structure Prediction (PSP): modeling and complexity analysis • Parallel Hybrid Metaheuristics for the PSP Problem • Grid experimentation on GRID5000 • Conclusion and Future Work
PARAllel and DIStributed Evolving Objectshttp://paradiseo.gforge.inria.fr/ ParadisEO for computational Grids PVM, PThreads MPI (LAM, CH) EO MO MOEO Condor-MW Globus • Message passing (MPI, PVM) • Clusters, Networks of Workstations, • Multi-programming (PThreads) • Shared Memory Multi-processors (SMP) • Parallel distributed computing • Clusters of SMPs (CLUMPS) • Grid computing • Condor-MW and Globus (MPICH-G2) Evolving Objects framework (EO) European project (Geneura Team, INRIA, LIACS) Transparent use http://eodev.sourceforge.net Our contributions • Multi-Objective EO (MOEO) for the design of multi-objective evolutionary algorithms • Moving Objects (MO) for the design of local search algorithms • ParadisEO for parallel hybrid metaheuristics • S. Cahon, N. Melab and E-G. Talbi. ParadisEO: A Framework for the Reusable Design of Parallel and Distributed Metaheuristics. Journal ofHeuristics, Elsevier Science, Vol.10(3), pages 357-380, May 2004.
Contributions ParadisEO-EO (A framework of Evolutionary Algorithms) ParadisEO -MOEO Techniquesrelated tomulti-objective optimization ParadisEO-PEO (parallel and distributed metaheuristics) ParadisEO-MO http://paradiseo.gforge.inria.fr Solution-basedmetaheuristics Parallelism (Partitioning solutions at several steps) Hill Climbing Simulated Annealing Cooperation/Hybridization(Algorithms exchange solutions)Ex. Island cooperation Tabu search
Hybrid metaheuristic Independently cooperating metaheuristics Encapsulated metaheuristics High level Low level Inheritance relationships Trivial ParadisEO distributed multi-agent model Relay Coevolutionary Concurrently executing metaheuristics Metaheuristics executed in sequence Development of hybrid metaheuristics • The deployment of concurrent independent/cooperative metaheuristics • The parallelization of a single step of the metaheuristic (based on distribution of the handled solutions) • The parallelization of the processing of a single solution Unifying view of the three parallel hierarchical levels E-G. Talbi, « A taxonomy of hybrid metaheuristics », Journal of Heuristics, 2002.
Design of several levels of parallelization/hybridization Processing of a single solution(Objective / Data partitioning) Independent walks, Multi-start model, Hybridization/Cooperationof metaheuristics Parallel evaluation ofthe neighborhood/population Scalability Heuristic Population / Neighborhood Solution
Parallel asynchronous hierarchical Lamarkian GA Cooperative GAs (Island model) Parallel evaluation of the population Low-level co-evolutionary hybridization
E.A. E.A. Migration E.A. E.A. Asynchronous Island Model
E.A. Solution Fullfitness The Parallel Evaluation of the Population
Synchronous Multi-Start Model Parallel Neighbourhood Exploration Gradient Local Search Optimization Parallel Hybrid Simulated Annealing Generate S0 k := 0 while Tk > Tthresholddo for s:=1 to nbSamples do Srand := randomMove( S0 ); ΔE := eval( Srand ) - eval( S0 ); ifΔE < 0 then S0 := Srand; else S0 := Srand with prob. 1.0 / ( 1.0 + eΔE/Tk); endif endfor k := k + 1; Adjust Tk; endwhile
Outline • Protein Structure Prediction (PSP): modeling and complexity analysis • Parallel Hybrid Metaheuristics for the PSP Problem • Grid experimentation on GRID5000 • Conclusion and Future Work
Deployment of ParadisEO-G4 1. Reservation of the desired nodes • Lille, Nice-Sophia Antipolis, Lyon, Nancy, Rennes: 400CPUs • exclusive reservation – no interference may occur; the processor are completely available during the reservation time. CLUSTER A 2. Select a master node for the Globus GRID 3. Configure the Globus GRID (certificates, user credentials, xinetd, postgresql, etc.) M6 M4 4. Deployment and execution – MPICH-G2 M5 CLUSTER C CLUSTER B M2 M2 M3 M3 GRID5000: A fully reconfigurable grid! The configuration phase relies on the deployment of pre-built Linux « images » having Globus and MPICH-G2 already installed.
1L2Y and α-Cyclodextrin Triptophan-Cage – Protein Data Bank ID: 1L2Y α-β-γ Cyclodextrin
Parallel asynchronous hierarchical Lamarkian GA - Cyclodextrin • Native energy at 242 kcal mol-1 MAX_GEN 100 POPULATION 300 CXOVER 0.95 CMUTATION 0.05 LS_XOVER 0.15 LS_MUTATION 0.05 XOVER 1.0 MUTATION 0.1 MIGRATION_R 15% MIGRATION_STEP 5Gen
Parallel Hybrid SA – 1L2Y & Cyclodextrin • the same number of individuals is sampled by both algorithms, GA and SA TRYPTOPHAN-CAGE (1L2Y) GA vs. SA CYCLODEXTRIN SA
Cyclodextrin Native Energy: 242.4 kcal mol-1 Tryptophan-cage Native Energy: 46.446 kcal mol-1 Experimental Results Cyclodextrin Min Max Avg StDev GA 2470 5845.27 3790.56 708.54 SA 1029.14 4593 2359.26 281.2 Random+LS 1204 6329.91 3132.8 800.85 GA+LS, no isl. GA+LS SA SA+LS Cyclodextrin 264.599 164.973 1029.14 904.046 1L2Y 93.5521 57.5447 201.37 86.3106 Min Max Avg StDev 1012.54 13471.4 4420.91 1521.67 Random Search 1L2Y GA+LS, Island GA+LS,No Ils. GA+Isl. GA Cyclodextrin 31m20s 29m32s 7m58s 7m24s 1L2Y 29m57s 31m17s 7m47s 7m45s
Outline • Protein Structure Prediction (PSP): modeling and complexity analysis • Parallel Hybrid Metaheuristics for the PSP Problem • Grid experimentation on GRID5000 • Conclusion and Future Work
Conclusion and Future Work (1) • The GA behaves better on the considered benchmarks, especially on Cyclodextrin • Bellow native-conformation energy results were obtained • The SA results are comparable to the results obtained by the GA with no hybridization • The SA allows for little intrinsic parallelism … • … but, it is not getting easily trapped in local minima (like directed LS, i.e. gradient LS)
Conclusion and Future Work (2) • Hierarchical multi-stage parallel models may prove to be efficient on the considered benchmarks • Hybridization schemes combining … • … the GA (strong exploration capabilities) • … and the SA (intensification + less prone to getting trapped in local minima than gradient methods)
ParadisEO-PEO Evolutionary Algorithm pop_eval ( population ); do { eoPop< EOT > offsprings; select ( population, offsprings ); transform ( offsprings ); pop_eval ( offsprings ); replace ( population, offsprings ); } while ( cont ( population ) ); • evaluation function - the designed class has to be derived from the peoPopEval class • peoSeqPopEval • peoParaPopEval • selection strategy – applied at each iteration in order to obtain the offspring population • transformation operators - crossover operator(s), mutation operator(s); peoEA requires a peoTransform derived object • peoSeqTransform • peoParaSGATransform • replacement strategy – for integrating the offsprings back into the initial population • continuation criterion - maximum number of generations, checkpoints, etc. peoEA ( eoContinue< EOT >& __cont, peoPopEval< EOT >& __pop_eval, eoSelect< EOT >& __select, peoTransform< EOT >& __transform, eoReplacement< EOT >& __replace );
ParadisEO-PEO EA Representation of the Individuals #include <EO.h> ... class Conformation : public EO< eoScalarFitness< double, greater< double > > > { ... int operator[]( const int& index ) const { return chromosome->getChromo( index+1 ); } ... Individu* chromosome; ... static Molecule* molecule; static Hamiltonian* hamiltonian; static Population* population; }; extern void pack( const dockingAtGrid::Conformation& conformation ); extern void unpack( dockingAtGrid::Conformation& conformation );
ParadisEO-PEO EA Representation – Packing & Unpacking void pack( const dockingAtGrid::Conformation& conformation ) { if ( conformation.invalid() ) pack( (unsigned int)0 ); else { pack( (unsigned int)1 ); pack( conformation.fitness() ); } for ( unsigned int index = 0; index < conformation.size(); index++ ) pack( conformation[ index ] ); } M I D L E W A R E Pack Data to Send void unpack( dockingAtGrid::Conformation& conformation ) { eoScalarFitness<double, std::greater<double> > fitness; unsigned int validFitness; unpack( validFitness ); if ( validFitness ) { unpack( fitness ); conformation.fitness( fitness ); } else { ... } } Unpack Recv. Data