780 likes | 948 Views
Evolutionary Algorithms for the Protein Folding Problem. Giuseppe Nicosia Department of Mathematics and Computer Science University of Catania. Talk Outline. An overview of Evolutionary Algorithms The Protein Folding Problem Genetic Algorithms for the ab initio prediction.
E N D
Evolutionary Algorithms for the Protein Folding Problem Giuseppe Nicosia Department of Mathematics and Computer Science University of Catania DMI - Università di Catania
Talk Outline • An overview of Evolutionary Algorithms • The Protein Folding Problem • Genetic Algorithms for the ab initio prediction DMI - Università di Catania
An overview of Evolutionary Algorithms EAs are optimization methods based on an evolutionary metaphor that showed effective in solving difficult problems. DMI - Università di Catania
Computational Intelligence and EAs “Evolution is the natural way to program”Thomas Ray DMI - Università di Catania
Evolutionary Algorithms 1. Set of candidate solutions (individuals): Population. 2. Generating candidates by: • Reproduction: Copying an individual. • Crossover: 2 parents 2 children. • Mutation: 1 parent 1 child. 3. Quality measure of individuals: Fitness function. 4. Survival-of-the-fittest principle. DMI - Università di Catania
Main components of EAs 1. Representation of individuals: Coding. 2. Evaluation method for individuals: Fitness. 3. Initialization procedure for the 1st generation. 4. Definition of variation operators (mutation and crossover). 5. Parent (mating) selection mechanism. 6. Survivor (environmental) selection mechanism. 7. Technical parameters (e.g. mutation rates, population size). DMI - Università di Catania
Mutation and Crossover EAs manipulate partial solutions in their search for the overall optimal solution . These partial solutions or `building blocks' correspond to sub-strings of a trial solution - in our case local sub-structures within the overall conformation. DMI - Università di Catania
`Optimal' Parameter Tuning: • Experimental tests. • Adaptation based on measured quality. • Self-adaptation based on evolution ! DMI - Università di Catania
Constraint handling strategies[Michalewicz, Evolutionary Computation, 4(1), 1996] • Repair strategy: whenever an unfeasible solution is produced "repair" it, i.e. find a feasible solution "close“ to the unfeasible one; • Penalize strategy:admit unfeasible individuale in the population, but penalize them adding a suitable term to the energy. DMI - Università di Catania
The evolution Loop DMI - Università di Catania
Algorithm Outline procedure EA; { t = 0; initialize population P(t); evaluate P(t); until (done) { t = t + 1; parent_selection P(t); recombine P(t); mutate P(t); evaluate P(t); survive P(t); } } DMI - Università di Catania
Example DMI - Università di Catania
Evolutionary Programming procedure EP; { t = 0; initialize population P(t); evaluate P(t); until (done) { t = t + 1; parent_selection P(t); mutate P(t); evaluate P(t); survive P(t); } } • The individuals are real-valued vectors, ordered lists, graphs. • All N individuals are selectedto be parents, and then are mutated, producing N children. These children are evaluated and N survivors are chosen from the 2N individuals, using a probabilistic function based on fitness (individuals with a greater fitness have a higher chance of survival). • Mutationis based on the representation used, and is often adaptive. For example, when using a real-valued vector, each variable within an individual may have an adaptive mutation rate that is normally distributed. • No Recombination. DMI - Università di Catania
Evolution Strategies • ES typically use real-valued vector. • Individuals are selecteduniformly randomly to be parents. • Pairs of parents produces children via recombination. The number of children created is greater than N. • Survivalis deterministic: • ES allows the N best children to survive, and replaces the parents with these children. • ES allows the N best children and parents to survive. • Like EP, adapting mutation. • Unlike EP, recombination does play an important role in ES, especially in adapting mutation. procedure ES; { t = 0; initialize population P(t); evaluate P(t); until (done) { t = t + 1; parent_selection P(t); recombine P(t) mutate P(t); evaluate P(t); survive P(t); } } DMI - Università di Catania
Genetic Algorithms procedure GA; { t = 0; initialize population P(t); evaluate P(t); until (done) { t = t + 1; parent_selection P(t); recombine P(t) mutate P(t); evaluate P(t); survive P(t); } } • GAs traditionally use a more domain independent representation, namely, bit-strings. • Parents are selectedaccording to a probabilistic function based on relative fitness. • N children are created via recombinationfrom the N parents. • The N children are mutated and survive, replacing the N parents in the population. • Emphasis on mutation and crossoveris opposite to that in EP. • Mutation flips bits with some small probability (background operator). • Recombination, on the other hand, is emphasized as the primary search operator. DMI - Università di Catania
There are 5 major preparatory steps in using GP for a particular problem. 1) selection of the set of terminals (e.g., the actual variables of the problem, zero-argument functions, and random constants, if any) 2) selection of the set of functions 3) identication of the evaluation function 4) selection of parameters of the system for controlling the run 5) selection of the termination condition. Each tree (program) is composed of functions and terminals appropriate to the particular problem domain; the set of all functions and terminals is selected a priori in such a way that some of the composed trees yield a solution. The initial pop is composed of such trees. The evaluation function assigns a fitness value which evaluates the performance of a tree. The evaluation is based on a preselected set of test cases,a fitness cases; in general, the fitness function returns the sum of distances between the correct and obtained results on all test cases. Genetic Programming 1/2 DMI - Università di Catania
procedure GP; { t = 0; initialize population P(t); /* randomly create an initial pop of individuals computer program */ evaluate P(t); /* execute each program in the pop and assign it a fitness value */ until (done) { t = t + 1; parent_selection P(t); /* select one or two program(s) with a probability based on fitness (with reselection allowed) */ create P(t); /* create new programs for the pop by applying the following ops with specified probability */ reproduction; /* Copy the selected program to the new pop */ crossover; /* create new offspring programs for the new pop by recombining randomly chosen parts from 2 selected prgs*/ mutation; /* Create one new offspring program for the new pop by randomly mutating a randomly chosen part of one selected program. */ Architecture-altering ops; /* Choose an architecture-altering operation from the available repertoire of such op. and create one new offspring program for the new pop by applying the chosen architecture-altering op. to the one selected prg */ } } Genetic Programming 2/2 DMI - Università di Catania
Scaling Suppose one has two search spaces. The first is described with a real-valued fitness function F. The second search space is described by a fitness function Gthat is equivalent to F p, where p is some constant. The relative positionsof peaks and valleys in the two search spaces correspond exactly. Only the relative heights differ (i.e., the vertical scale is different). Should our EA search both spaces in the same manner? DMI - Università di Catania
If we believe that the EA should search the two spaces in the same manner, then selection should only be based on the relative ordering of fitnesses, only the rank of individuals is of importance. • ES • Parent selection is performed uniformly randomly, with no regard to fitness. • Survival simply saves the N best individuals, which is only based on the relative ordering of fitnesses. • EP • All individuals are selected to be parents. Each parent is mutated once, producing N children. • A probabilistic ranking mechanism chooses the N best individuals for survival, from the union of the parents and children. Ranking selection (ES, EP) DMI - Università di Catania
Probabilistic selection mechanism GA Many people, in the GA community, believe that F and G should be searched differently. Fitness proportional selection is the probabilistic selection mechanism of the traditional GA. • Parent selection is performed based on how fit an individual is with respect to the population average. For example, an individual with fitness twice the population average will tend to have twice as many children as average individuals. • Survival, though, is not based on fitness, since the parents are automatically replaced by the children. DMI - Università di Catania
Lacking the killer instinct One problem with this latter approach is that, as the search continues, more and more individuals receive fitnesses with small relative differences. This lessens the selection pressure, slowing the progress of the search. This effect, often referred to as "lacking the killer instinct", can be compensated somewhat byscaling mechanisms, that attempt to magnify relative differences as the search progresses.. DMI - Università di Catania
Mutation and Adaptation • GAs typically use mutation as a simple background operator, to ensure that a particular bit value is not lost forever. Mutation in GAs typically flips bits with a very low probability (e.g., 1 bit out of 1000). • Mutation is far more important inESs and EP. Instead of a global mutation rate, mutation probability distributions can be maintained for every variable of every individual. More importantly, ESs and EP encode the probability distributions as extra information within each individual, and allow this information to evolve as well (self-adaptationof mutation parameters, while the space is being searched). DMI - Università di Catania
Recombination and Adaptation • There are a number of recombination methods for ESs, all of which assume that the individuals are composed of real-valued variables. Either the values are exchanged or they are averaged. The ES community has also considered multi-parent versions of these operators. • The GA community places primary emphasis on crossover. • One-point recombination inserts a cut-point within the two parents (e.g., between the 3rd and 4th variables, or bits). Then the information before the cut-point is swapped between the two parents. • Multi-point recombination is a generalization of this idea, introducing a higher number of cut-points. Information is then swapped between pairs of cut-points. • Uniform crossover, however, does not use cut-points, but simply uses a global parameter to indicate the likelihood that each variable should be exchanged between two parents. DMI - Università di Catania
Representation • Traditionally, GAs use bit strings. In theory, this representation makes the GA more problem independent. We can also see this as a more genotypic level of representation, since the individual is in some sense encoded in the bit string. Recently, however, the GA community has investigated more phenotypic representations, including vectors of real values, ordered lists, neural networks. • The ES and EP communities focus on real-valued vector representations, although the EP community has also used ordered list and finite state automata representations. • Very little has been done in the way of adaptive representations. DMI - Università di Catania
Strength of the selection and population’s carrying capacity • Strong selection refers to a selection mechanism that concentrates quickly on the best individuals, while weaker selection mechanismsallow poor individuals to survive (and produce children) for a longer period of time. • Similarly, the population can be thought of as having a certain carrying capacity, which refers to the amount of information that the population can usefully maintain. A small population has less carrying capacity, which is usually adequate for simple problems. Larger populations, with larger carrying capacities, are often better for more difficult problems. • Perhaps the evolutionary algorithm can adapt both selection pressure and the population size dynamically, as it solves problems. DMI - Università di Catania
Accumulated payoff • EP and ES usually have optimization for a goal. In other words, they are typically most interested in finding the best solution as quickly as possible. • De Jong (1992) reminds us that GAsare not function optimizers per se, although they can be used as such. There is very little theory indicating how well GAs will perform optimization tasks. Instead, theory concentrates on what is referred to as accumulated payoff. • The difference can be illustrated by considering financial investment planning over a period of time (e.g., you play the stock market). Instead of trying to find the best stock, you are trying to maximize your returns as the various stocks are sampled. Clearly the two goals are somewhat different, and maximizing the return may or may not also be a good heuristic for finding the best stock. DMI - Università di Catania
Fitness correlation • Fitness correlation, which appears to be a measure of EA-Hardness that places less emphasis on optimality (Manderick et al., 1991). Fitness correlation measures the correlation between the fitness of children and their parents. Manderick et al. found a strong relationship between GA performance and the strength of the correlations. • Another possibility is problem modality. Those problems that have many suboptimal solutions will, in general, be more difficult to search. • Finally, this issue is also very related to a concern of de Garis, which he refers to as evolvability. de Garis notes that often his systems do not evolve at all, namely, that fitness does not increase over time. The reasons for this are not clear and remain an important research topic. DMI - Università di Catania
Distributed EAs Because of the inherent natural parallelism within an EA, much recent work has concentrated on the implementation of EAs on parallel machines. Typically either one processor holds one individual (in SIMD machines), or a subpopulation (in MIMD machines). Clearly, such implementations hold promise of execution time decreases. More interestingly, are the evolutionary effects that can be naturally illustrated with parallel machines, namely, speciation, nicheing, and punctuated equilibria (Belew and Booker,1991). DMI - Università di Catania
Resume • Selection serves to focus search into areas of high fitness. • Other genetic operators (recombination and mutation) perturb the individuals, providing exploration in nearby areas. • Recombination and mutation provide different search biases, which may or may not be appropriate for the task. • The key to more robust EA systems probably lies in the adaptive selection of such genetic operators. DMI - Università di Catania
Hint: Gray Code Procedure binaryToGray { g1=b1; for k=2 to m do gk=bk-1 XOR bk; } Gray code is often used in GAs for mapping between a decimal number and a bit string. Mapping each digit of a decimal number to a string of four bits corresponds to choosing 10 strings from 16 possibilities. In Gray code, two neighbouring decimal digits are always represented by adjacent bit strings that differ by only one bit position. DMI - Università di Catania
Advantages of EAs • Widely applicable, also in cases where no (good) problem specic techniques are available: • Multimodalities, discontinuities, constraints. • Noisy objective functions. • Multiple criteria decision making problems. • No presumptions with respect to the problem space. • Low development costs; i.e. costs to adapt to new problem spaces. • The solutions of EA's have straightforward interpretations. • They can be run interactively (online parameter adjustment). DMI - Università di Catania
Disadvantages of EAs • No guarantee for finding optimal solutions within a finite amount of time: True for all global optimization methods. • No complete theoretical basis (yet), but much progress is being made. • Parameter tuning is largely based on trial and error (genetic algorithms); solution: Self-adaptation (evolution strategies). • Often computationally expensive: Parallelism. DMI - Università di Catania
Applications of EAs • Optimization and Problem Solving; • NP-Complete Problem; • Protein Folding; • Financial Forecasting; • Automated Synthesis of Analog Electrical Circuits; • Evolutionary Robotics; • Evolvable Hardware; • Modelling. DMI - Università di Catania
Turing ‘s vision July 20, 2001 DMI - Università di Catania
Turing’s three approaches for creating intelligent computer program One approach was logic-driven while a second was knowledge-based. The third approach that Turing specifically identified in 1948 for achieving machine intelligence is “... the genetical or evolutionary search by which a combination of genes is looked for, the criterion being the survival value”. A. M Turing A. M.,Intelligent machines, Machine Intelligence, 1948. DMI - Università di Catania
In 1950 Turing described how evolution and natural selection might be used to create an intelligent program: “... we cannot expect to find a good child-machine at first attempt. One must experiment with teaching one such machine and see how well it learns. One can then try another and see if it is betteror worse”. Turing A. M.,Computing machinery and intelligence,Mind,1950. DMI - Università di Catania
Turing’s third approach & EAs There is an obvious connection between this process and evolution, by identifications: • Structure of the child machine = Hereditary material; • Changes of the child machine = Mutations; • Judgment of the experimenter = Natural selection. The above features of Turing's third approach to machine intelligence are common to the various forms of evolutionary computation developed over the past four decades. DMI - Università di Catania
Genes were easy: The Protein Folding Problem Genomics Transcriptomics Proteomics DMI - Università di Catania
Reductionistic and synthetic approaches DMI - Università di Catania
Basic principles DMI - Università di Catania
Why are computer scientists interested in biology ? DMI - Università di Catania
Scientific answer • Biology is interesting as a domain for AI research (i.e. drug design). • Biology provides a rich set of metaphors for thinking about intelligence: genetic algorithms, neural networks and Darwinian automata are but a few of the computational approaches to behavior based on biological ideas. There will, no doubt, be many more (Artificial Immune System). DMI - Università di Catania
Pragmatic answer • “Gene sequencing’s Industrial Revolution” [IEEE Spectrum, November 2000]. • IBM predicts the IT market for biology will grow from $3.5 billion to more $9 billion by 2003. The volume of life science data doubles every six months. [IEEE Spectrum, January 2001] • “Golden rice to Bioinformatics” [Scientific American 2001]. • Biotechnology, BioXML, BioPerl, BioJava, Bio-inspired models, Biological data analisys … Bio-all. DMI - Università di Catania
The building blocks: the 20 natural Amino acids 3-letter code, single code, name residued, charge,Polar, Hydrophobic. DMI - Università di Catania
Proteins are necklaces of amino acids The protein is a linear polymer of the 20 different kinds of amino acids, which are linked by peptide bonds. Protein sequence length: 20 – 4500 aa. DMI - Università di Catania
Hydrophobic & hydrophilic residues • Hydrophobic residues tend to come together to form compact core that exclude water. Because the environment inside cells is aqueous (primarily water), these hydrophobic residues will tend to be on the inside of a protein, rather than on its surface. • Hydrophobicity is one of the key factors that determines how the chainof amino acids will fold up into an active protein (Hydrophilic: attracted to water, Hydrophobic: repelled by water). • The polarityof a molecule refers to the degree that its electrons are distributed asymmetrically. A non-polar molecule has a relatively even distribution of charge. DMI - Università di Catania
Primary structure • The scaffold is always the same. • The side-chain Rdetermines the amino acid type. DMI - Università di Catania
Grand Challenge Problems in Bioinformatics [T.Lengauer, Informatics – 10 Years Back, 10 Years Ahed, LNCS 2000] • Finding Genes in Genomic Sequences • Protein Folding and Protein Structure Prediction • Estimating the Free Energy of Biomolecules and their complexes • Simulating a Cell DMI - Università di Catania
The famed Protein Folding problem asks how the amino-acid sequence of a protein adopts its native three-dimensional structure under natural conditions (e.g. in aqueous solution, with neutral pH at room temperature). DMI - Università di Catania
Sequence Structure Function While the nature of the fold is determined by the sequence, it is encoded in a very complicated manner. Thus, protein folding can be seen as a connection between the genome (sequence) and what the proteins actually do (their function). DMI - Università di Catania