230 likes | 359 Views
Genetic Algorithms and Protein Folding. Based on lecture by Dr. Steffen Schulze-Kremer http://www.techfak.uni-bielefeld.de/bcd/Curric/ProtEn/proten.html. Genetic Algorithm :
E N D
Genetic Algorithms and Protein Folding Based on lecture by Dr. Steffen Schulze-Kremer http://www.techfak.uni-bielefeld.de/bcd/Curric/ProtEn/proten.html
Genetic Algorithm: is a heuristic method that operates on pieces of information like nature does on genes in the course of evolution. • Individuals are represented by a linear string of letters of an alphabet (in nature nucleotides, in genetic algorithms bits) • Individuals are allowed to mutate, crossover and reproduce. • Fitness function evaluates individuals. • Depending on the generation replacement mode a subset of parents and offspring enters the next reproduction cycle. • After a number of iterations the population consists of individuals that are well adapted in terms of the fitness function. • It cannot be proven that the individuals of a final generation contain an optimal solution for the objective encoded in the fitness function.
Initialise a population of individuals. • This can be done either randomly or with domain specific background knowledge to start the search with promising seed individuals. (Where available the latter is always recommended. ) • Individuals are represented as a string of bits. • A fitness function must be defined that takes as input an individual and returns a number (or a vector) that can be used as a measure for the quality (fitness) of that individual. • The application should be formulated in a way that the desired solution to the problem coincides with the most successful individual according to the fitness function.
II. Evaluate all individuals of the initial population. III. Generate new individuals. The reproduction probability for an individual is proportional to its relative fitness within the current generation.
Crossover two point crossover 0101001111000011010101011110111 1010101101011100101110001010101 uniform crossover 0101001111000011010101011110111 1010101101011100101110001010101
Genetic Operators: Mutation. Substitute one or more bits of an individual randomly by a new value (0 or 1). Variation. Change the bits in a way that the number encoded by them is slightly incremented or decremented. Crossover. Exchange parts (single bits or strings of bits) of one individual with the corresponding parts of another individual. Originally, only one-point crossover was performed but theoretically one can process up to L - 1 different crossover sites (with L as the length of the individual).
IV. Select individuals for the new parent generation. Schemes: 1) Complete offspring is selected while all parents are discarded (original genetic algorithm). This is motivated by the biological model and is called total generation replacement. 2) The n best individuals (from old and new generation) This method is called elitist generation replacement. V. Go back to step 2 until either a desired fitness value was reached or until a predefined number of iterations was performed
Evaluate Apply Genetic Operations Select the next generation Init the first generation
Representation Formalism • hybrid approach - genetic algorithm is configured to operate on numbers, not bit strings as in the original genetic algorithm. Disadvantages: • the mathematical foundation of genetic algorithms holds only for binary representations, although some of the mathematical properties are also valid for a floating point representation. • Binary representations run faster in many applications. • An additional encoding/decoding process may be required to map numbers onto bit strings.
Protein Structure Prediction Individuals - Protein Conformations Fitness Function – Force Field
-> representation by torsion angles Representation Cartesian 3D coordinates is not a good choice
The frequency of each torsion angle in intervals of 10° was determined and the ten most frequently occurring intervals are made available for substitution of individual torsion angles by the MUTATE operator. • At the beginning of the run, individuals were initialized with either a completely extended conformation where all torsion angles are 180° or by a random selection from the ten most frequently occurring intervals of each torsion angle. • For the w torsion angle the constant value of 180° was used because of the rigidity of the peptide bond between the atoms Ci and Ni+1.
Search Space Generally molecules with n atoms have 3n - 6 degrees of freedom -> 100 residues * approximately 20 atoms per residue = 5994 degrees of freedom Systems of equations with this number of variables are analytically intractable today. Discrete approximation: (5 torsion angles per residue * 5 likely values per torsion angle) = 25100
Fitness Function - Potential Energy Charmm energy func: = + + + + + + + + . bond length potential (set to const) bond angle potential (set to const) torsion angle potential improper torsion angle potential (set to const) van der Waals pair interactions electrostatic potential hydrogen bonds (set to const) interaction with the solvent (set to const -> in vacum) Simplified to: = + + (since there are no interactions with the solvent, there is not enough force to drive the protein to a compact folded state)
Simplified Energy Function Empirical relation between the number of residues and the diameter: = + + + . pseudo entropic term
First Testprotein Crambin, 46 a.a. Table 3. Steric Energies in the Last Generation Table 2. R.m.s. Deviations to Native Crambin The genetic algorithm favoured individuals with lowest total energy which in this case was most easily achieved by optimising electrostatic contributions. Simple summation of different components has the disadvantage that components with larger numbers would dominate the fitness function whether or not they are important or of any significance at all for a particular conformation. In other words -> bad fitness function
Improvements • Instead of using separate phi psi value distributions, apply phi-psi (2D) clustering procedure. • Use secondary structure prediction algorithm (70% accuracy). • Specialised Genetic Operators • LOCAL TWIST (local conformation changes by performing the ring closure algorithm for polymers) The LOCAL TWIST operator led to significant improvements in prediction accuracy and also to a substantial decrease in overall computation time.
Improvements(2)Fitness Function -> vector r.m.s. only for verification
Vector Fitness Function • Candidate selection for the next generation: • If there is an individual that has better (i.e. lower) values in each fitness component, then we take it. Continue until no unambiguously better individuals are found. • Then remove the worst individuals, i.e. those with higher values in each fitness component than any other individual. • The remaining set of individuals is heuristically reduced until the exact number of individuals for the next generation is reached. This is done by iteratively removing an individual with the worst fitness value in a randomly selected fitness component.
Capability of Genetic Algorithm in General? Tests on other proteins (Local Twist and rms fitness) gave also close to native conformations (less than 3.0 A) Conclusion: applying an appropriate fitness function genetic algorithm achieves the desired results.
Test case – Crambin 46 a.a. I. Fitness vector polar, , , , hydro, Crippen and solvent , hydro, Crippen, solvent decreased with rms polar, , mislead the algorithm to non-native conformation -> Rms 6.27 II. Fitness vector Crippen, clash, hydro and scatter + constraints on the secondary structures -> Rms 4.36 trypsin inhibitor -> 6.65
Conclusions • Genetic algorithms proved to be an efficient search tool for 3-D representations of proteins. For a 3-D protein model with a simple, additive force field as fitness function and using a rather small population the genetic algorithm produced several individuals (i.e. protein conformations) of dissimilar topology but each with highly optimized fitness values. • Given an appropriate fitness function the genetic algorithm application described here finds the desired solution within only small deviations. • The major problem lies in the fitness function. If there were one or a set of indicators that return 1for the object is native protein conformation and 0 for the object is not a native protein conformation one could expect the genetic algorithm approach to deliver reasonably accurate ab initio predictions. However, neither mathematical models, empirical, semi-empirical or statistical force fields are yet accurate enough to reliably discriminate native from non-native conformations without additional constraints. Thus, the genetic algorithm produces (sub-)optimal conformations in a different sense than that of nativeness. Notice:the same problem (fitness-scoring function) exists in the Protein Docking problem. The correct transformation (within 3-5A) is found in realistic time (almost in all cases). However, to assign a high score to the native complex is a problematic task. We don’t know yet a proper scoring function.
Side Chain Placement rms 1.86