430 likes | 565 Views
Sequence Optimization For Synthetic Genes Using Genetic Algorithms. David Sigfredo Angulo 1 Rob Vogelbacher 1, Benjamin R. Capraro 2 , Tobin Sosnick 2 , Shohei Koide 2 1 School of Computer Science Telecommunications and Information Systems DePaul University
E N D
Sequence Optimization For Synthetic GenesUsing Genetic Algorithms David Sigfredo Angulo1 Rob Vogelbacher1, Benjamin R. Capraro2, Tobin Sosnick2, Shohei Koide2 1 School of Computer Science Telecommunications and Information Systems DePaul University 2 Department of Biochemistry and Molecular Biology The University of Chicago
Introduction • Genetic Algorithms: • Using ideas based on the biology of genes • Create software to use such a stochastic means to search through large searchspaces • Resulting algorithm has nothing to do with genes • Designing Genes • This search space is huge • REALLY NOVEL IDEA: • Use Genetic Algorithms based on genes to design genes!!
Outline • Short biology Tutorial • DNA Sequence Generation • Why is the problem difficult? • IBG Gene Designer • Genetic Algorithm (GA) solution • Heuristics and Fitness Evaluation
First • Before the problem can be described • Must give some background biochemistry principles • Tutorial outline • DNA • Codons • Protein • Synthetic genes • What are they and what are they used for? • Restriction Enzymes • Expressing Proteins using Vectors
Transcription/Translation Central Dogma of Molecular Biology Transcription Translation DNA RNA Protein RNA Polymerase Ribosomes
DNA • Deoxyribonucleic acid • Strand backbone is made of sugar & phosphate molecules • Strands connected by nitrogen containing nucleotide bases • Two strands join making a double helix • Each strand is made of nucleotides joined together
2 nm 11 nm 30 nm 300 nm 700 nm 1100 nm Short region of DNA 2bl helix "beads on a string" form of Chromatin 30 nm chromatin fiber of packed nucleosomes Section of chromosome in an extended form Condensed section of chromosome Entire mitotic chromosome
DNA Four Nucleotides: AGTC
Short Biology Tutorial • Tutorial outline • DNA • Codons • Protein • Restriction Enzymes • Expressing Proteins using Vectors
DNA Sequence Generation:Codon to Amino Acid Translation http://campus.queens.edu/faculty/jannr/Genetics/images/codon.jpg
Short Biology Tutorial • Tutorial outline • DNA • Codons • Protein • Restriction Enzymes • Expressing Proteins using Vectors
Proteins • Amino Acid Chains Fold Into complex 3D Structures • Functional properties depend on3D structure • Usefulness depends onfunctional properties • E.g. designing drugs
Designed/Expressed Proteins Extremely Useful • Designed Proteins • Can be used to study protein structure • Can be used to study effects of otther proteins • Can be designed to “knock out” other proteins • Can be designed to “block” the acgtion of other proteins • Expressed proteins • Expressed in cow’s milk or chicken eggs • Can manufacture drugs on large scales in this way • E.g. insulin
Synthetic Genes • DNA sequences • “backtranslated” from a novel Protein or Amino Acid sequence Transcription Translation DNA RNA Protein RNA Polymerase Ribosomes • We’ll put the DNA for our designed protein into an organism (a vector) • Then that vector will make (express) our protein • But, how do we get the DNA into an organism???
Short Biology Tutorial • Tutorial outline • DNA • Codons • Protein • Restriction Enzymes • Expressing Proteins using Vectors
Restriction Enzyme Digests • Watson – Crick 1953 • Took 20 years to be able to do anything with DNA • H. Smith (and others) made a discovery that allowed manipulation and deciphering of DNA • Discovery was that bacteria produced enzymes that introduce breaks in double stranded DNA molecules whenever they encountered a specific string of nucleotides • These enzymes are called Restriction Enzymes • Restriction Enzymes can be used as precise scissors • They let biologists cut (and paste) portions of DNA
EcoRI 5'-GAATTC-3' 3'-CTTAAG-5' Regulated by EcoRI 5'-G AATTC-3' 3'-CTTAA G-5' • EcoRI was the very first Restriction Enzyme discovered • "Eco" because it was isolated from E. Coli (Escherichia Coli) • "R" because it is a Restriction Enzyme • "I" because it was the first Restriction Enzyme from E. Coli • Now over 300 Restriction Enzymes known • EcoRI cleaves (restricts, digests) DNA • Between the G and A nucleotides • Only when it encounters them in the string 5'-GAATTC-3' • This is called therestriction site
Sticky Ends 5'-GAATTC-3' 3'-CTTAAG-5' Regulated by EcoRI 5'-G AATTC-3' 3'-CTTAA G-5' • Many restriction enzymes in such a way that some single stranded DNA is left at both ends • These nucleotide sequences • Are complimentary to each other • Are 5'-AATT-3' in the case of EcoRI • Can base pair with other nucleotides in a sequence • Thus, are called "sticky ends" • Can temporarily hold twoDNA strands together • The enzyme ligasewill permanently jointhose strands • This is calledligation
Short Biology Tutorial • Tutorial outline • DNA • Codons • Protein • Restriction Enzymes • Expressing Proteins using Vectors
Gene Synthesis:On the Lab Bench • Initial Sequence Construction • Oligonucleotides (short strands of DNA) are defined with complementary overlapping sites • The “sticky ends” • Assembly PCR • Oligonucleotides and polymerase are mixed and placed in a thermocycler • Creates contiguous DNA sequence from component oligos
Gene Synthesis:On the Lab Bench (cont) • After PCR, generated DNA sequence cut with restriction enzymes • Expression hosts's plasmid cut with restriction enzymes • Synthetic gene inserted into plasmid and plasmid repaired • Expression Vectors • Host organisms used to express the synthetic genes (make the protein) • Typically E. Coli • Possibly Chickens or Cows • Expression vector can now express protein coded for by synthetic gene • A bit more complicated than described above!!!
Outline • Short biology Tutorial • DNA Sequence Generation • Why is the problem difficult? • IBG Gene Designer • Genetic Algorithm (GA) solution • Heuristics and Fitness Evaluation
DNA Sequence Generation:The Computational Problem • Why is the problem difficult? • Conflicting goals • Avoid restriction sites • Maximizing Codon Preference • Thus, cannot use deterministic algorithm • Degeneracy (redundancy) of the DNA code – 64 codons, 20 (21) amino acids (see next slide) • Several synonymous codons are translated into the same amino acid • Synonymous codons per AA vary from one to six (average is four codons per AA) • Huge number of possible DNA Sequences • Average 2N for protein of amino acid length n • Codon Preference • Varying levels of tRNA assembly components in organisms • Codon usage for a particular AA greatly influence protein expression • (continued)
DNA Sequence Generation:Codon to Amino Acid Translation http://campus.queens.edu/faculty/jannr/Genetics/images/codon.jpg
DNA Sequence Generation:The Computational Problem (cont) • Why is the problem difficult? • (continued) • Restriction Enzymes • The vector will contain many restriction enzymes • If these cut up our DNA, we won’t express our proteins • We must design the DNA string using synonymous codons so that there are no restriction sites • Helpful to include some other restriction sites • We must design the DNA string using synonymous codons so that these are included • (continued)
DNA Sequence Generation:The Computational Problem (cont) • Why is the problem difficult? • (continued) • mRNA Secondary Structure • In prokaryotes, mRNA can fold into complex shapes • This inhibits protein creation • Oligonucleotide generation • Want a specific melting temperature so that the complex folding doesn’t take place • The “sticky ends” must have the same melting temperature so that they will bind together.
Outline • Short biology Tutorial • DNA Sequence Generation • Why is the problem difficult? • IBG Gene Designer • Genetic Algorithm (GA) solution • Heuristics and Fitness Evaluation
IBG GeneDesigner:Our Solution • IBG GeneDesigner
IBG GeneDesigner:Genetic Algorithm • Uses a Genetic Algorithm for sequence optimization • Tournament selection model • Uniform and single-point crossover (behind the scenes – not user selectable at present.) • Mutation causes codon “wobbling” • Sequence “fitness” determined by heuristic evaluation
IBG GeneDesigner:Fitness Evaluation • GeneDesigner heuristics • Manipulation of nucleotide percentages/ratios to reduce mRNA secondary structure formation • Inclusion and Exclusion of restriction sites • Restriction sites requested for inclusion should only occur once • Matching of codon preference • Oligonucleotide generation • Fitness determined by melting points, start and end nucleotide
IBG GeneDesigner:Future Work • Algorithm parameters • Systematically manipulate GA parameters to identify default values for sequence optimization • Population size • Number of generations • Mutation rate • Convergence criteria • Modify heuristic weighting scheme • Selection models • Experiment with alternative selection models (Roulette wheel, elitism, limit population replacement)
IBG GeneDesigner:Future Work • Move algorithm to ECJ architecture • Use the Strength-Pareto multi-objective optimization algorithm • Create web-based version of application • Explore island model effects on optimization
Results • IBG GeneDesigner utilized to generate a nucleotide sequence for the SH3 domain of a-spectrin1. • The codon optimization option was set for expression in E. coli with a 40% G/C bias • We also used the application to generate four assembly PCR template oligonucleotide sequences to produce the protein coding sequence flanked by desired restriction enzyme recognition sites. • The calculated Tm values of the three overlapping regions were within 1.6oC • Promoting similar annealing behavior between strands. • Success of the reaction was confirmed by DNA sequencing of a pUC19 expression vector containing the PCR product cloned between restriction sites included in the gene design. • Summary: Protein Made!!!
Acknowledgements • Graduate student who did much of the coding • Rob Vogelbacher • University of Chicago undergraduate who used it to build a protein • Benjamin R. Capraro • His advisor • Tobin Sosnick • Our collaborator at University of chicago • Shohei Koide