1 / 27

Predicting Expression Levels Using Codon Usage Bias

Doug Raiford Lesson 19. Predicting Expression Levels Using Codon Usage Bias. Nice to be able to predict. Actually have very expensive experiments that do this Sequence only would be nice. An example. Worked on a project that predicted metabolic efficiency

ekram
Download Presentation

Predicting Expression Levels Using Codon Usage Bias

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Doug Raiford Lesson 19 Predicting Expression Levels Using Codon Usage Bias

  2. Nice to be able to predict • Actually have very expensive experiments that do this • Sequence only would be nice Expression Prediction with CUB

  3. An example • Worked on a project that predicted metabolic efficiency • Tendency for organisms to utilize, where possible, less expensive amino acids • Tested by looking at expression vs. protein biosynthetic cost Biosynthetic Cost Protein production rate (expressivity) Expression Prediction with CUB

  4. Early on noted biased usage CTA CTC CTG CTT TTA TTG One of the most highly expressed genes in Escherichia coli K12 has 9 CTG codons and zero of all other codons that code for leucine • In highly expressed, extremely biased usage of certain codons Leucine Expression Prediction with CUB

  5. tRNAs ala ala ala ala ala ala ala ala ala ala ala ala ala ala ala ala ala gcu gcu gcu gcu gcu gcu gcu gcu gcc gcu gcg gcc gcu gca gcc gcc gcu Protein Strand gc? (ala) Ribosome mRNA Why? • Translational efficiency Expression Prediction with CUB

  6. Question: how harness? Expression of Genes 1 2 3 4 5 6 7 8 . . . N • How would you use this biased usage to predict expression? • Frequency of preferred codons (FOP) • Just look at most highly expressed • Either experimentally determined or genes known to be highly expressed • Calculate usage for all genes • Usage predictive of expressivity Expression Prediction with CUB

  7. Can it be automated? • That is, given sequence data only, can we determine probable expression levels? 1 atgggttggt caatcatctg atttaatggg caaattttta aagatgcaca ttatatcagc 61 aaaaaatcga acctgttggg tcttgcgcag ggtgccggac ttggcctagt tttgggcctc 121 aagatgacga tcaaatgacg aaagcttgcc tggtcgaggg ttttttcaac cgtcgattgc 181 gggagcgggg ttgtgcggcc gtatggcgga aatcgctatt cggttgagct gggacgatgg 241 caggacgggg agcggtgcgc ttggacacgc aaacttggca ggaacagggg ctcgaaaccc 301 ggtctccggg acgcacgcgc ggtgaaatca gccaggatga actggcgcac cagtggagcc 361 gtgttcgcgg ccgacttcag gaagaaatcg gcgaggtcga gtaccgcaac tggttgcggc 421 aagccgtgct gcatgggctc gacggcgatg aagtgactgt catgctgccg acccgcttcc 481 tgcgtgactg ggtgaacaag gaatatggca acctgctgac cgcgttctgg caggccgaga 541 acccggcggt acggcgcgtg gatatccgga cccggccggc cggcaccagc gagcgcgcgc 601 ccgacctcgc cgaggtggag ccgaagaccg cgatcgcgcg gcccgccgcc gcggcgcgcc 661 gcgaggccga ggaacgcccg gacatgagcg cgccgctcga cccgcgcttc acctttgata 721 cattcgtggt cggcaagccg aacgaattcg cctatgcctg cgcgcgccgc gtcgccgacg Expression Prediction with CUB

  8. First step • Look at data in matrix • If we assume that the major force driving variance in codon usage is translational efficiency • If highly expressed genes have high usage of preferred, low usage of non-preferred, weakly expressed have more balanced usage (or even avoidance of preferred) • What does this sound like? Expression Prediction with CUB

  9. PCA • Can find axis of greatest variance • Genes projected on this axis • Highly expressed at one end and weakly at other Expression Prediction with CUB

  10. Factor analysis… • …finding which codons are preferred • If codon’s usage is correlated with location on PC… • That is if genes at one end exhibit low usage and genes at other exhibit high Correlated? Projection of genes on first principle component Probably a preferred codon Expression Prediction with CUB

  11. Is this as accurate as we can get? • Region in middle • Really more accurate look at distance from cluster Expression Prediction with CUB

  12. Greedy algorithm • SCCI (Carbone, et al.) • Looks for most self consistent set of genes Search for these genes Expression Prediction with CUB

  13. Algorithm • Looking for subset of genes (reference set) that define a bias to which they themselves adhere more strongly than the rest of the genes Algorithm • Start with all genes as reference set • Loop till reference set size 1% • Determine which codons are preferred • Determine average usage for all genes • Sort by adherence • Take genes in top half to be the new reference set • Repeat Expression Prediction with CUB

  14. Do these always work? • Do you think all organisms have translational efficiency bias? • How would you expect metabolic efficiency trends to look in organisms that do not have? ? Biosynthetic Cost Protein production rate (expressivity) Some actually exhibited significant and positive trends Expression Prediction with CUB

  15. What the heck? • What could cause a positive trend? • Organisms preferentially utilize the most expensive aa’s in the most highly expressed genes? • We decided the problem must be in our prediction of expressivity • Somehow we got it wrong—in fact, it seems we got it exactly opposite ? Biosynthetic Cost Protein production rate (expressivity) Expression Prediction with CUB

  16. What did the misbehavers look like? • Misbehavers were all high and low GC-content organisms • But how would this cause a positive trend • Breakthrough came with Nostoc • Greedy algorithm was finding high AT-content that were on opposite side of PCA 2D codon usage space Expression Prediction with CUB

  17. Search space • Algorithm is a search for self-consistent genes • What does search space look like—why did the algorithm get fooled • Our lab was heavy into GA’s • Think of all optimization problems in terms of being a search • Fitness landscape Carbone’s algorithm found the reference set associated with the dominant bias—what about the next most dominant Expression Prediction with CUB

  18. How visualize? • How arrange solutions along two axes (with fitness in a third) • How reduce the number of solutions Number of possible solutions Axes Expression Prediction with CUB

  19. Noticed that… • Reference sets tend to be proximal • If choose nearest neighbors will only have to calculate fitness for each gene • We already have a method for viewing gene placement in a 2D space: PCA • Elevated regions: highly self-consistent AT-content ridge dominates search space Expression Prediction with CUB

  20. How can we use this? • How fix algorithm? • I modified the SCCI algorithm to avoid unbalanced GC-content regions • Push down Expression Prediction with CUB

  21. In the neighborhood… • Greedy algorithm gets perfect self-consistency scores • Modified algorithm does not • Decided to try using a GA to improve g1 g2 g3 g4 g5 … gN Parent One g1 g2 g3 g4 g5 … gN Parent Two Mutate g1 g2 g3 g4 g5 … gN Child We can rebuild him. We have the technology. We have the capability to build the world's first bionic man. Steve Austin will be that man. Better than he was before. Better, stronger, faster. Expression Prediction with CUB

  22. Multi-objective approach • Searched for a set of genes that were both • Self-consistent • And that identified a bias to which known highly expressed genes strongly adhered • Two Objectives • Ranking of HEGs • Self-consistent Expression Prediction with CUB

  23. Pareto front and fitness • Count the number of solutions that dominate (better in both dimensions) • Solutions on the Pareto front: no other solution is better in both dimensions • The fewer there are the higher the fitness • Genes on front given highest fitness • Ranking of HEGs • Self-consistent Expression Prediction with CUB

  24. Turns out… • Those that identified a bias to which known highly expressed genes strongly adhered was by far the best • But the reference set we identified were not among the most highly expressed… yet the bias it discovered (the codon preferences it identified) yielded much better predictions of actual expressivity Best Solutions • Ranking of HEGs • Self-consistent Expression Prediction with CUB

  25. Why? Not a better set of genes • We just found a better set of codon preferences • Why not directly search for codon preferences? • Reframe the problem • Instead of “given a set of known highly expressed genes, determine which codons they seem to prefer and use these preferences to rank the whole genome” • We asked “given a set of known highly expressed genes, which set of codon preferences (weights associated with each codon) yield a gene ranking with known highly expressed genes at the top” Reframe the problem A better set of codons Expression Prediction with CUB

  26. Reframed • Given a set of known highly expressed genes, which set of codon preferences (weights associated with each codon) yield a gene ranking with known highly expressed genes at the top w1 w2 w3 w4 w5 … w59 Parent One w1 w2 w3 w4 w5 … w59 Parent Two Mutate w1 w2 w3 w4 w5 … w59 Child Expression Prediction with CUB

  27. Expression Prediction with CUB

More Related