390 likes | 515 Views
Predicting Natural RNA's using Evolutionary Computation. Schwartz Eyal. Berman Dror. Instructor : Dr. Danny Barash. Presentation outline. RNA overview. RNA secondary structures prediction. Genetic Algorithm. Using GA in our project. Results. A look into the future. RNA.
E N D
Predicting Natural RNA's using Evolutionary Computation Schwartz Eyal Berman Dror Instructor : Dr. Danny Barash
Presentation outline • RNA overview • RNA secondary structures prediction • Genetic Algorithm • Using GA in our project • Results • A look into the future
RNA a single-stranded nucleic acid made up of 4 nucleotides : adenine (A), guanine (G), cytosine (C), and uracil (U). Found in the nucleus and cytoplasm of cells, it plays an important role in protein synthesis and other chemical activities of the cell DNA to RNA Animation
Types of RNA There are several classes of RNA molecules : Messenger RNA (mRNA) is translated into protein by the joint action of transfer RNA (tRNA) and the ribosome.Ribosome is composed of numerous proteins and two major ribosomal RNA (rRNA) molecules.Other small RNAs (smRNA) exists, serving a great variety of purposes.
RNA secondary and tertiary structures • Stem-loops, hairpins, and other secondary structures can form by base pairing between distant complementary segments of an RNA molecule. • Interactions between the flexible loops may result in further folding to form tertiary structures such as the pseudoknot.
predicting RNA secondary structure RNA Folding by Energy Minimization One way for RNA structure prediction is to assign an energy to each base pair in a secondary structure. That is, there is a function e such that e(ri,rj) is the energy of a base pair. The energy of the entire structure, is then given by:
Using the energy function G = -46.5KJ • optimally folded according to a criterion of lowest free energy using the FOLD algorithm of Zuker and Stiegler • Suboptimal folding using the same algorithm but imposing the biochemically mandated constraint that the adenines at positions 39 and 53 (color) should not be base paired. G = -43.44KJ
Tools we used to predict secondary structure: The Zuker Group - usingmfold Vienna RNA Package– using RNA fold Input : RNA Sequence Output : Predicted structure, based on the lowest energy values for this sequence, energy values of optimal and sub-optimal solutions.
What are we looking for ? Natural RNA’s Our goal is to predict Natural RNA’s Using Evolutionary Computation P5abc - Sub Domain
So…what is the problem? If we are looking for RNAs that will minimize a certain function - we have to many options. For a small size RNA of 56 nucleotides, there are 456!!! possible sequences. NP-complete! Solution… Genetic algorithm
Genetic Algorithm A genetic algorithm is an optimisation algorithm based on the mechanisms of Darwinian evolution which uses random mutation, crossover and selection procedures to breed better models or solutions from an originally random starting population or sample
[Selection]Select two parent chromosomes from a population according to their fitness • [Crossover] With a crossover probability cross over the parents to form new offspring (children). If no crossover was performed, offspring is the exact copy of parents. • [Mutation] With a mutation probability mutate new offspring at each position in chromosome. • [Accepting] Place new offspring in the new population Genetic Algorithm • [Start]Generate random population of n chromosomes (suitable solutions for the problem) • [Fitness] Evaluate the fitness f(x) of each chromosome x in the population • [New population]Create a new population by repeating the following steps • [Replace] Use new generated population for a further run of the algorithm • [Test] If the end condition is satisfied, stop, and return the best solution in current population • [Loop] Go to step 2
Using GA in our Project Population Our population – a random group of RNA’s, each consists of 56 nucleotides random
Using GA in our Project Selection Selecting parent chromosomes from a population according to their fitness – the better fitness, the bigger chance to be Selected. Roulette Wheel Technique
accguaccgucugagccgguagaagccguaggggcaguaguc accgucguaggggcaguagucgaagcaccgucugagccggua Cross-over Using GA in our Project Cross-Over • A certain probability exists that two selected organisms will actually breed • Organisms can mate or propagate into the next generation unchanged • Crossover results in two new child chromosomes, which are added to the new generation For example:
A C acguggcgaggugccggcuac acgaggcgaggugucggcuac G U Mutation Using GA in our Project Mutation Types: • Transition • Transversion For example: Transition / Transvertion Rate is 2:1
Using GA in our Project Elitism • Each Generation a certain amount of the fittest individuals are past to the next generation unchanged. • This principle is proven to provide better and faster results
Using GA in our Project Elitism Average fitness : 12.468 0 : GATGTCTCAAATGCAAAAACTTGCATCAGGTAGGTCAGGAGGTATTATTCATAGAA 1 : GCAATTACGTGGCAGTGCACAAAACATCTTCCAGCTCCATCGCGGTGAAGCCGCCA 2 : CACATTCTCGGGAGGCATTGTCGTTTAGACGCCTGAGTTTGCGGTATTTGCGATGT 3 : GGCGATACTGGCCCCTTTCGTAGGTTCTTTGCCAACTATGGCATGCTCAAATCGCA 4 : CGTACCGTCGACGTTAATTTAGAATATAGCAATTACAGAGAATGAGGAGGTGAATT 5 : AGTTTTTTGTATGACGAACAGTCACATGAGCCACAAATTTGTGATTTTTAACTCGC 6 : CCTGTATTCTTGGGCACTCAGAACAAGTCAAGCTAAATACGTTAGACTTGACGAGG 7 : ACCCCGTTCATCTTTGTGGCTTAGCAATAGCATTCCCCAGCTAATTGGCCTAATTG 8 : ATCACTCCGGGTTGCACCCAATGGACGCCCTCAACGTGTCCCAATGCATGCACTGG 9 : CATGGGTGGAAGTTTAAAATGCACTCCCATTCAGTGAGAGTCAGAAGCAGAGAATT 10 : CCAGATTACTGCCTAAAAGAAACATGGTGGGATTGTGCAAAGCGCCGCGCGGCTTA 11 : CCTATGAGCGGTTGTAACGGGATACCTTCGTGTTGTCGCGATCACCAGGGAAGTCA 12 : CATGGGACCTAGCGAGCGGTTGCCACCGAGGCGCTAAAGCTGAAAAGGGACCGGGG 13 : TACTGTCCCACCATGTGGAGTGACTCTCTCAGCCGAATCCTGGAGCTATTGGGTAC 14 : ATGAAGGGTAGATTCTCATTCGTAGGTACTCCGTCGGAACAGCACTTTTGGAAGAG 15 : ATGCGTGATATCATGAGAATTTGGCCGGTGATGTAAGGCCGAGGTCTCCTCATTGA 16 : AAGTGTGAGGCACGGTGAGCCCTGAAGTTAAAAGTTCGTTAAACGGCAGTGAACGA 17 : CCAACAAGGACAGATGCTATCCAAAGAATGAATAACACTTCATTAGCCGCCTGCTG 18 : TTGGGTGCTGGATCTACGTGACTGGAGCCCTACGGTCAAATTAGATTGCGAGTTAG 19 : AGTCAGGCAAACCAGATGGAGCGTAGCTCGCCAATATCCTCCCGGTGCCCCTGTTG 20 : CAGTGTATATTTACGGGTAAGTGAATTGTGCATTTCGAAGTACACAGTTGAGCGGC 21 : CCAAACCTAAAGACCACGAGGGCGACAGTGTCTTCTAGGATTTTAATCGTTCCATG 22 : GTACCTGATAATGGACCTCCTAGCACGCGCTAATCCTAGGAGCGACAGACTTCGCC 23 : TTTCCGCCGTTCTCTTTACTGCCGGCGATTCGGAATTCCCAAGTCCGACATTCCGA 24 : GAACTCTCGTCCCGGCGACTCTTGTGGCTACCACGTGGAACCCGTTACTCAAATTA 25 : GCCCCGTCTCACTAGCGTTCTTTGATTCTGCCTGGAACCTTCAGCGTTGTCCGATT 26 : TGAGACTTTGTTTAGGCGCTCAGTTTAGTTCTGCCGGCGCTCAGGGCTAGGCGCAG 27 : AAAAACTGGAAACGCAACTGTACTGACACCGCGGCGTAACCACGTGTTTGCGGGGA 28 : GTATATCGCGACTAGACAGAGCTGTAACGGCCCGAGCCAGACTTCGTGGCGATCGG 29 : CTAACCCTTCCATCTTGGGAACGGGCTCGCAAAAAGCCCCGGCCTAAGTGGTTAGG First Elite Pick RNA No. 12 Fitness of 33.04 Second Elite Pick RNA No. 25 Fitness of 24.06 Converging into a local minima The Danger :
Using GA in our Project Fitness Function – Naïve Approach Main Idea : going for the lowest free energy value Fitness(RNA) = Min_Energy(RNA) The Results RNA’s with very low energy value but without biological value
Using GA in our Project Fitness Function – Naïve Approach Conclusions • Fitness function based just on Minimum Energy functions tend to converge into un-natural structures • The output sequences consists mainly of C-G nucleotides bonds which leads to very rigid low energy structures • The GA Algorithm works well – BUT the Fitness Function is not suitable
Using GA in our Project Fitness Function – Different Approach • Research had studied the optimal vs. Suboptimal solutions • The results shows that in Nature RNA’s : • Best Sub-Optimal Solution ~ 95% of the Optimal Solution • Usually there is only a few stable sub-optimal solution • The RNA structure energy is low though enables a certain energy freedom – meaning not too low and rigid
Using GA in our Project Fitness Function – Different Approach Building the fitness function : • Consisting of the three former conditions, the core fitness function is built to converge towards Natural RNA’s sequences • The parameters can be set so that each component may have a different importance
Using GA in our Project #1 : Number Of Structures Fitness Function – Different Approach Based on Three Components The Idea : there are significantly fewer Sub-Optimal structures close to the optimal structure in natural RNA sequences than in random sequences Outcome : higher values of fitness will be given as a sequence converges into having few structures within this range Comment : usually more than one structure appears
The Idea : The ground state free energies of natural RNA sequences are significantly lower than those of random sequences implementation : A structure will have higher fitness as it’s optimal structure has lower energy Caution : as a structure needs to function, it can’t be too rigid (look at the naive approach). We take this into our consideration and try to put it in the right proportion Using GA in our Project #2 : Minimum Energy Structure Fitness Function – Different Approach
Using GA in our Project Fitness Function – Different Approach #3 : 5 percent ∆ The Idea : natural RNA’s first Sub-Optimal solution, statistically has energy value of around 95 percent of the optimal structure energy Implementation : A structure will have higher fitness as its first Sub-Optimal structure energy value is closer to the 95% of the optimal one |(95% optimal solution) – (first sub-optimal solution)| ~ 0
Using GA in our Project Fitness Function – Different Approach Combining the components Fitness (RNA) = P_A * (No. of Sub-Optimal Solutions) + P_B * (Minimum Energy) + P_C * |(95% Optimal) – (first Sub-Optimal)| Each Parameter reflects the relative importance of its component in the fitness function
Using GA in our Project Algorithm Implementation - Code The project was implemented with C Language Each loop the program uses the Mfold package in order to evaluate for each sequence : • The optimal structure energy value • All Sub-Optimal Structures values within 10 percent of the optimal The program then : • Set the fitness for each sequence • Creates the next generation of RNA’s
Results So… Does It Work ? Natural RNA – P5abc Sub-Domain Predicted RNA after 200 Generations The Truth Is Out There ..
Results Run #1 Example Runs Run Parameters : Number of RNA’s in the population = 30 Number of Generations = 300 RNA length (number of nucleotides) = 56 Elite Size = 2 Output Sequence : GGCAGGATCGAAGTGCTCGACCTGTAACCCAGGTGTGCGTTGTGCCTAGCTAGGGG • 2 structures (best) • 5% difference (best) • low energy structure (average) Analyzing Sequence using Mfold Structure 1 : Initial dG = -20.0 kcal/mole Structure 2 : Initial dG = -19.0 kcal/mole Conclusion The GA has produced sequence that fits well with our demands
Results Run #1 – Output Sequence Structures
Results Evidence of quick convergence – Local Minima Run #2 Run Parameters : RNA length (number of nucleotides) = 56 Number of RNA’s in the population = 30 Elite Size = 3 10% First Examination : After 15 Generations Output Sequence : TTATGTGAGACCGGGGGCATCAGCGAGTTGTGCTCCGACCGGTCTCTAGGGCGCGA Analyzing Sequence using Mfold Structure 1 : Initial dG = -22.2 kcal/mole Structure 2 : Initial dG = -21.1 kcal/mole • 2 structures (best) • 5% difference (best) • low energy structure (average)
After 15 Generations Results
Results Run #2 - Same Run Second Examination : After 300 Generations Output Sequence : TTATGTGAGGCCGGGGGCACCAGGAAGCTGTGCTTCGACCGGTCTCTAGGGCGCGA Analyzing Sequence using Mfold Structure 1 : Initial dG = -23.0 kcal/mole Structure 2 : Initial dG = -21.9 kcal/mole • 2 structures (best) • 5% difference (best) • low energy structure (better) High Elite Group percentage might cause to quick convergence into a local minima Conclusion :
Results Run #2 Structure After 300 Generations Quick Convergence Refinements
Output Sequence : AGGGGAACACACAACAGGACCCCCGCGACCCATACCTTCATTAGTGCTTCCCTTGA Analyzing Sequence using Mfold Structure 1 : Initial dG = -12.1 kcal/mole Structure 2 : Initial dG = -11.2 kcal/mole Results Run #3 – Proportions Changed Overlooking lower energies – consisting just 15% of the fitness function Run Parameters : Number of RNA’s in the population = 40 Number of Generations = 300 RNA length (number of nucleotides) = 56 Elite Size = 1 Conclusion • 2 structures (best) • 7% difference (average) • low energy structure (fits tRNA) GA has produced sequence fits well with tRNA energy values average
Results Run #3 – Output Comparisons Predicted RNA after 200 Generations Natural RNA – tRNAPHE
Conclusions • Predicting natural RNA’s can be done quite well using Evolutionary Computation • The basics of getting good results lies in a proven & balanced fitness function • Using several arguments within the fitness function, one should set the right relative proportion between them
Future Look Running the GA with different parameter values and Analyzing the results Changing the heart of the program The Fitness Function: 1. Structural Changes caused by Point Mutations 2. RNA Data-Base as a key for constructing a new RNA
Our Thanks Dr. Danny Barash Nir Dromi Assaf Avihoo Adaya Cohen