780 likes | 974 Views
Correlating mRNA and protein abundance via genomic and proteomic characteristics. Dov Greenbaum Gerstein Lab Thesis Seminar April 21, 2004. outline. Why analyze mRNA and protein correlations Background Disparate Data Sources Correlating mRNA and Protein Results Other analyses
E N D
Correlating mRNA and protein abundance via genomic and proteomic characteristics Dov Greenbaum Gerstein Lab Thesis Seminar April 21, 2004
outline Why analyze mRNA and protein correlations Background Disparate Data Sources Correlating mRNA and Protein Results Other analyses Formalism – comparing genome, transcriptome and proteome in terms of broad categories New Data Sets Analysis via Broad Categories Analysis of factors affecting correlations Another reason to expect correlations Expression and Protein Interactions
Why Correlate mRNA & Protein? • Harness mRNA and protein • Quantitative analysis of global mRNA levels currently is a preferred method for the analysis of the state of cells and tissues. • mRNA level <= ? => protein level • Several methods which either provide absolute mRNA abundance or relative mRNA levels in comparative analyses are easy to apply. • * Fast * Very Sensitive • Look, we have so much mRNA – why even bother with the protien
Both mRNA and Protein Levels are necessary for complete analysis Shown mathematically in Hatzimanikatis et al Biotechnology 1999 Combinations of RNA and protein detection approaches have recently aided in the identification of biomarkers in cancer Hegde et al Current Opinion in Biotech 2003
Relationship between mRNA and Protein levels dPi dt = ks;i *mRNAi - kd;i Pi where ks,i and kd,i are the protein synthesis and degradation rate constants, respectively, ks;i * mRNAi At steady state: Pi = kdi Unfortunately K is difficult to find --- So lets look and see if we can find correlations instead of deriving.
Methods for determining mRNA expressionEach have Strengths and Weaknesses
Methods for determining protein abundance • 2DE Gel Electrophoresis • (Klose, 1975; O’Farrell, 1975) • Multiple staining options • Small dynamic range • limited in what it can detect
Methods for determining protein abundance ICAT • ICAT reagent-- relative levels • VB dynamic range • Cannot detect post-translational modifications • it require proteinsto contain cysteine residues, & these residues must be inthe region of a peptide that is produced during proteolyticcleavage
MudPit • Really only HT that can • detect PT modifications
Other Methods for determining protein abundance DIGE • e.g. Cy3 vs cy5 labeling • Very big dynamic range Tap Tagging Weissman & O’Shea (Oct 2003) 2D-electrophoresis
Same mRNA levels yet protein data varied > 20XN ~100, r = 0.9 Protein Quantification via measurement ofradioactivity Gygi et alMolecular and Cellular Biology,1999.
Same mRNA levels yet protein data varied > 20XDo some ORFs bias the results? 73 proteins (69%) R = 0.356
mRNA vs Proteinr = 0.74 Protein Quantification via image analysis Futcher et alMolecular and Cellular Biology, 1999
Jury is out… Gygi et al: “This study revealed that transcript levels provide little predictive value with respect to the extent of protein expression.” Futcher et al: “there is a good correlation between protein abundance and mRNAabundance for the proteins that we have studied”.
mRNA vs Protein Greenbaum et alBioinformatics 2001 r =0.67 While mine isn’t first A) Largest at the time: integrated previous two results B) first to integrate diverse data to analyze
3 Genes in Lung AdenocarcinomasOp18, Annexin IV, and GAPD r = 0.025 Chen et alMolecular & Cellular Proteomics, 2002.
murine hematopoietic precursor MPROchange in expression 0 - 72 hr
murine hematopoietic precursor MPROchange in expression 0 - 72 hr R = 0.58 ~ 80% of the genes are located in the first and third quadrants
Ratios of wt+gal to wt gal ICAT vs microarrayN ~ 290, r = 0.6 Ideker et al Science, 2001
Yeast growth under two different mediar = 0.45 but almost 1.0 for same loci in same pathway Washburn et al PNAS 2003
Integrating multiple sources of Information The challenge for computational biology is to provide methodologies for transforming high-throughput heterogeneous data sets into biological insights about the underlying mechanisms. Although high-throughput assays provide a global picture, the details are often noisy, hence conclusions should be supported by several types of observations. Integration of data from assays that examine cellular systems from different viewpoints (for instance, gene expression and protein-protein interactions) can lead to a more coherent reconstruction and reduce the effects of noise. Nir Friedman Science 2004 Anyone who hs worked with HT data– noise is huge!
Reference mRNA Sets Young Church Samson SAGE
Fitting Protein Data Original Set
mRNA vs Protein Greenbaum et alBioinformatics 2001 r =0.67 mRNA expression Reference Set 3 Affy Chip sets and SAGE 6249 ORFs Protein Abundance Reference Set #1 two 2DE sets ALL Available Date 181 ORFs
Outliers (2STDEV from the mean) High Protein Metabolism (1) Energy(2) Low Protein Prot. Syn. (5) Prot. Fate (6)
Later larger datasets concurred with these results in that Generally… Protein synthesis (~35% of all protein synthesis genes) and Protein fate (folding, modification, destination) are more likely to have low protein vs mRNA than the general population AA metabolism & Energy are 2X as likely to have high protein vs mRNA than the general population Alcohol dehydrogenase is also a stress induced protein in many organisms (Matton et al. 1990; An et al. 1991; Millar et al. 1994), Faster Ramp Up? Alternatively, it is possible to look into mRNA stability as a factor Presently there are many structures within mRNA that are thought to influence stability including, among others, stem loops, UTRs premature stops and uORFS (Klaff et al. 1996)
Non-Outliers Generally…Tight Regulation by the cell Only 3% of transcription associated genes (n = 441) have significantly uncorrelated mRNA and protein levels (2STDEV from trendline) Transcription Assoc. genes are 25% of the essential genes in yeast. Essential Genes as a group have higher correlations than the general yeast population 7% of Cell Cycle associated genes (n = 432) have significant non-correlation
Quick Summary • Why correlate mRNA and protein levels? • Merged Disparate Data Sets • Distinct but complimentary • Global Correlations • Outliers are interesting: • Metabolism & Energy Relatively high protein levels • Protein Synthesis & Protein Fate low protein levels
~170 ORFs2 DE-gel datasets ~6,000 ORFs5 Affymetrix GeneChips+ SAGE data ~6,000 ORFs Data Set Size
Enrichments (F,[v,S]) -(F,[w,G]) (F,[w,G]) (Feature, [v,S], [w,G]) = V&W are weights (expression level) of SetsS & G
~170 ORFs ~6,000 ORFs Visual Formalism Two different subsets of data because of limited size!
Depletion of Random Coil Secondary Structure STABILITY Concurrence with data from Perczelet al Chemistry 2003 Regarding stability of specific secondary structures
Enrichment of Amino Acids STABILITY Alanine’s, Glycines, Valines result in more compact structures More compact = more stable (i.e. thermophilic enzymes tend to be very compact)
Enrichment of Amino Acids Simple story: translatome is enriched in same way as transcriptome
Enrichment of Molecular Weights/BiomassAbundant proteins are smaller = reduces cost Effect of transcription yeast cell favors the expression of shorter ORFs over longer ones (as opposed to long lightweight ORFs – see MW of aa) This selection is happening, for the most part at the transcriptome level -------------------------------------------------------------------------------------------------- Neg Correlation between ORF length and mRNA expressionJansen & Gerstein 2000 (And to a lesser degree with Protein Abundance)
Enrichment of Molecular Weights/BiomassAbundant proteins are smaller = reduces cost Effect of transcription CONCURS with experimental results from Akashi, Genetics 2003 See also: Akashi,Genetics 1996 & Moriyama and Powell, NAR 1998 hypothesize that this trend exists in S. cerevisiae, D. melanogaster and E. coli. (although probably not in C. elegans)
Depletion Functional Categories Transcription & Cell Growth Molecular switches Require only minimal expression
Enrichment of localization - BIAS? (Drawid & Gerstein. 2000),
Review Formalism Different gene sets b/c of limited data Enrichments concur with experimental results
Fitting Protein Data Newer Set Mudpit fit first into mRNA space then inverse fit back into protein space then each of the data sets is fit via least squares onto the Aebersold data set
Fitting Protein Data Newer Set Mudpit fit first into mRNA space then inverse fit back into protein space then each of the data sets is fit via least squares onto the Aebersold data set
Fitting Protein Data Newer Set Mudpit fit first into mRNA space then inverse fit back into protein space then each of the data sets is fit via least squares onto the Aebersold data set
Fitting Protein Data Newer Set Mudpit fit first into mRNA space then inverse fit back into protein space then each of the data sets is fit via least squares onto the Aebersold data set
Fitting Protein Data Newer Set Mudpit fit first into mRNA space then inverse fit back into protein space then each of the data sets is fit via least squares onto the Aebersold data set
Global Correlation mRNA Set 6249 ORFs Protein Set # 2 2 2DE sets & 2 Mudpit ~2000 ORFs
Functional Categories Co-regulated proteins High: ion transport , INTERACTION WITH THE CELLULAR ENVIRONMENT, CELL FATE LOW: METABOLISM ,FATE. CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM