Predicting Expression Levels Using Codon Usage Bias

Doug Raiford Lesson 19 Predicting Expression Levels Using Codon Usage Bias

Nice to be able to predict • Actually have very expensive experiments that do this • Sequence only would be nice Expression Prediction with CUB

An example • Worked on a project that predicted metabolic efficiency • Tendency for organisms to utilize, where possible, less expensive amino acids • Tested by looking at expression vs. protein biosynthetic cost Biosynthetic Cost Protein production rate (expressivity) Expression Prediction with CUB

Early on noted biased usage CTA CTC CTG CTT TTA TTG One of the most highly expressed genes in Escherichia coli K12 has 9 CTG codons and zero of all other codons that code for leucine • In highly expressed, extremely biased usage of certain codons Leucine Expression Prediction with CUB

tRNAs ala ala ala ala ala ala ala ala ala ala ala ala ala ala ala ala ala gcu gcu gcu gcu gcu gcu gcu gcu gcc gcu gcg gcc gcu gca gcc gcc gcu Protein Strand gc? (ala) Ribosome mRNA Why? • Translational efficiency Expression Prediction with CUB

Question: how harness? Expression of Genes 1 2 3 4 5 6 7 8 . . . N • How would you use this biased usage to predict expression? • Frequency of preferred codons (FOP) • Just look at most highly expressed • Either experimentally determined or genes known to be highly expressed • Calculate usage for all genes • Usage predictive of expressivity Expression Prediction with CUB

Can it be automated? • That is, given sequence data only, can we determine probable expression levels? 1 atgggttggt caatcatctg atttaatggg caaattttta aagatgcaca ttatatcagc 61 aaaaaatcga acctgttggg tcttgcgcag ggtgccggac ttggcctagt tttgggcctc 121 aagatgacga tcaaatgacg aaagcttgcc tggtcgaggg ttttttcaac cgtcgattgc 181 gggagcgggg ttgtgcggcc gtatggcgga aatcgctatt cggttgagct gggacgatgg 241 caggacgggg agcggtgcgc ttggacacgc aaacttggca ggaacagggg ctcgaaaccc 301 ggtctccggg acgcacgcgc ggtgaaatca gccaggatga actggcgcac cagtggagcc 361 gtgttcgcgg ccgacttcag gaagaaatcg gcgaggtcga gtaccgcaac tggttgcggc 421 aagccgtgct gcatgggctc gacggcgatg aagtgactgt catgctgccg acccgcttcc 481 tgcgtgactg ggtgaacaag gaatatggca acctgctgac cgcgttctgg caggccgaga 541 acccggcggt acggcgcgtg gatatccgga cccggccggc cggcaccagc gagcgcgcgc 601 ccgacctcgc cgaggtggag ccgaagaccg cgatcgcgcg gcccgccgcc gcggcgcgcc 661 gcgaggccga ggaacgcccg gacatgagcg cgccgctcga cccgcgcttc acctttgata 721 cattcgtggt cggcaagccg aacgaattcg cctatgcctg cgcgcgccgc gtcgccgacg Expression Prediction with CUB

First step • Look at data in matrix • If we assume that the major force driving variance in codon usage is translational efficiency • If highly expressed genes have high usage of preferred, low usage of non-preferred, weakly expressed have more balanced usage (or even avoidance of preferred) • What does this sound like? Expression Prediction with CUB

PCA • Can find axis of greatest variance • Genes projected on this axis • Highly expressed at one end and weakly at other Expression Prediction with CUB

Factor analysis… • …finding which codons are preferred • If codon’s usage is correlated with location on PC… • That is if genes at one end exhibit low usage and genes at other exhibit high Correlated? Projection of genes on first principle component Probably a preferred codon Expression Prediction with CUB

Is this as accurate as we can get? • Region in middle • Really more accurate look at distance from cluster Expression Prediction with CUB

Greedy algorithm • SCCI (Carbone, et al.) • Looks for most self consistent set of genes Search for these genes Expression Prediction with CUB

Algorithm • Looking for subset of genes (reference set) that define a bias to which they themselves adhere more strongly than the rest of the genes Algorithm • Start with all genes as reference set • Loop till reference set size 1% • Determine which codons are preferred • Determine average usage for all genes • Sort by adherence • Take genes in top half to be the new reference set • Repeat Expression Prediction with CUB

Do these always work? • Do you think all organisms have translational efficiency bias? • How would you expect metabolic efficiency trends to look in organisms that do not have? ? Biosynthetic Cost Protein production rate (expressivity) Some actually exhibited significant and positive trends Expression Prediction with CUB

What the heck? • What could cause a positive trend? • Organisms preferentially utilize the most expensive aa’s in the most highly expressed genes? • We decided the problem must be in our prediction of expressivity • Somehow we got it wrong—in fact, it seems we got it exactly opposite ? Biosynthetic Cost Protein production rate (expressivity) Expression Prediction with CUB

What did the misbehavers look like? • Misbehavers were all high and low GC-content organisms • But how would this cause a positive trend • Breakthrough came with Nostoc • Greedy algorithm was finding high AT-content that were on opposite side of PCA 2D codon usage space Expression Prediction with CUB

Search space • Algorithm is a search for self-consistent genes • What does search space look like—why did the algorithm get fooled • Our lab was heavy into GA’s • Think of all optimization problems in terms of being a search • Fitness landscape Carbone’s algorithm found the reference set associated with the dominant bias—what about the next most dominant Expression Prediction with CUB

How visualize? • How arrange solutions along two axes (with fitness in a third) • How reduce the number of solutions Number of possible solutions Axes Expression Prediction with CUB

Noticed that… • Reference sets tend to be proximal • If choose nearest neighbors will only have to calculate fitness for each gene • We already have a method for viewing gene placement in a 2D space: PCA • Elevated regions: highly self-consistent AT-content ridge dominates search space Expression Prediction with CUB

How can we use this? • How fix algorithm? • I modified the SCCI algorithm to avoid unbalanced GC-content regions • Push down Expression Prediction with CUB

In the neighborhood… • Greedy algorithm gets perfect self-consistency scores • Modified algorithm does not • Decided to try using a GA to improve g1 g2 g3 g4 g5 … gN Parent One g1 g2 g3 g4 g5 … gN Parent Two Mutate g1 g2 g3 g4 g5 … gN Child We can rebuild him. We have the technology. We have the capability to build the world's first bionic man. Steve Austin will be that man. Better than he was before. Better, stronger, faster. Expression Prediction with CUB

Multi-objective approach • Searched for a set of genes that were both • Self-consistent • And that identified a bias to which known highly expressed genes strongly adhered • Two Objectives • Ranking of HEGs • Self-consistent Expression Prediction with CUB

Pareto front and fitness • Count the number of solutions that dominate (better in both dimensions) • Solutions on the Pareto front: no other solution is better in both dimensions • The fewer there are the higher the fitness • Genes on front given highest fitness • Ranking of HEGs • Self-consistent Expression Prediction with CUB

Turns out… • Those that identified a bias to which known highly expressed genes strongly adhered was by far the best • But the reference set we identified were not among the most highly expressed… yet the bias it discovered (the codon preferences it identified) yielded much better predictions of actual expressivity Best Solutions • Ranking of HEGs • Self-consistent Expression Prediction with CUB

Why? Not a better set of genes • We just found a better set of codon preferences • Why not directly search for codon preferences? • Reframe the problem • Instead of “given a set of known highly expressed genes, determine which codons they seem to prefer and use these preferences to rank the whole genome” • We asked “given a set of known highly expressed genes, which set of codon preferences (weights associated with each codon) yield a gene ranking with known highly expressed genes at the top” Reframe the problem A better set of codons Expression Prediction with CUB

Reframed • Given a set of known highly expressed genes, which set of codon preferences (weights associated with each codon) yield a gene ranking with known highly expressed genes at the top w1 w2 w3 w4 w5 … w59 Parent One w1 w2 w3 w4 w5 … w59 Parent Two Mutate w1 w2 w3 w4 w5 … w59 Child Expression Prediction with CUB

Expression Prediction with CUB

Predicting Expression Levels Using Codon Usage Bias

Predicting Expression Levels Using Codon Usage Bias

Presentation Transcript

Predicting Using Story Clues!

Codon-based models

Codon Optimization

Using Bias-Free Language

Predicting Bugs Using Antipatterns

RIP – Transcript Expression Levels

Universal Codon Chart

Codon Bias Examination measuring the effect of codon usage deviations on protein expression level

A novel method for measuring codon usage bias and estimating its statistical significance

Predicting Gene Expression using Logic Modeling and Optimization Abhimanyu Krishna

Codon Usage

Lectures for 4Y03 (a) Efficiency in Bacterial Cells (b)Codon Usage Bias (c) Mitochondrial Genomes

Differential expression analysis Alternative exon usage

Codon usage and gene finding with AMIGene

BIAS using ebXML

Experience Predicting Application Server Memory Usage

Normalized gene expression levels

Codon Bias and Regulation of Translation among Bacteria and Phages

Codon Usage

Codon bias correlates with the relative frequencies of tRNA species

Facial Expression: Predicting and promoting positive outcomes

Experience Predicting Application Server Memory Usage