220 likes | 418 Views
Proc. Natl. Acad. Sci. USA 105: 21034-21038 (2008) .
E N D
Proc. Natl. Acad. Sci. USA 105: 21034-21038 (2008) . Discovery and revision of Arabidopsis genesby proteogenomicsNatalie E. Castellanaa, Samuel H. Payne, ZhouxinShen, Mario Stanke,* VineetBafna, and Steven P. BriggsUniversity of California San Diego,, *Institute for Microbiology and Genetics,Gottingen, Germany
Limitations of gene annotation • Based on evidence of transcripts • Depends on gene finding/ protein prediction algorithms. • How do we define genes? • Models suffer from errors in reading frame and exon definition. • Rare transcripts? Noise? • Arabidopsis is the best annotated plant genome and other plant genomes are annotated relative to Arabidopsis.
What did Castellana et al. do to detect gene model errors? • Isolated Arabidopsis proteins from different tissues. • Analyzed tryptic peptides by Tandem Mass Spectrometry. • Determined sequences for 144,079 distinct peptides. • Confirmed gene models for 40% (12,769) of annotated genes (assuming gene total of 31,922). • 18,024 novel peptides were found, suggesting 13% of the proteome was missing or incorrect. • They added or corrected 1473 gene/proteins, leaving 1 to 4% unidentified protein coding genes.
Proteins • Protein extracts of four Arabidopsis organs: (leaf, root, flower, silique) and cell culture MM2d. • Phosphoproteins were enriched using TiO2from MM2d • Sodium orthovanadate (Na3VO4)used as a phosphatase inhibitor. • Cysteines were reduced and alkylated. • Digested with trypsin. • Separated by high resolution 3D-LC: RP1, SCX, RP2, • in 45 runs producing 144,079 tryptic peptides.
Mass Spectrometry (MS) From Wikipedia. • Ionized molecules or molecule fragments are measured by their mass-to-charge ratios • 1) the components of the sample are ionized by an electron beam, which results in the formation of charged particles (ions), • 2) directing the ions into a electric and/or magnetic fields, • 3) computation of the mass-to-charge ratio of the particles based on their motion as they transit through electromagnetic fields • 4) 5) detection of the ions, which in step 3) were sorted according to m/z.
Mass Spectrometers consist of three modules: 1) An ion source, which can convert gas phase sample molecules into ions (or, in the case of electrospray ionization, move ions that exist in solution into the gas phase); 2) a mass analyzer, which sorts the ions by their masses by applying electromagnetic fields; and 3) a detector, which measures the value of an indicator quantity and thus provides data for calculating the abundances of each ion present.
A quadrupoletime-of-flight hybrid tandem mass spectrometer. Multiple stages of mass analysis separation can be accomplished with MS steps separated in space or time. In tandem mass spectrometry the elements are physically separated. These elements can be sectors, transmission quadrupole, or time-of-flight. ESI is electrospray ionization MALDI is matrix-assisted laser desorption/ionization
Work flow Castellana N. E. et.al. PNAS (2008) 105:21034-21038 ©2008 by National Academy of Sciences
Acquisition of Spectra • Peptides charged by electrospray ionization. • LTQ linear ion trap tandem mass spectrometery • 21 million spectra were acquired. Data is archived in Tranche (http://tranche.proteomecommons.org) • Spectra were searched against three reference databases: TAIR 7, a six frame translation of the genome, and ab initio gene predictions using AUGUSTUS and exon prediction.
Number of assigned spectra, distinct peptides, and proteins in different samples and organs. Baerenfalleret al. (2008) Science 320: 938-941. • Plant tissue Spectra Distinct peptides Proteins Avg. Mol. Mass (kD) • Differentiated organs 465,836 64,219 10,902 54.6 • Roots 71,516 27,546 6,125 55.0 • Roots 10 days 38,476 20,301 5,159 55.7 • Roots 23 days 33,040 16,984 4,466 54.3 • Leaves 80,186 20,417 4,853 57.5 • Cotyledons 39,419 13,628 3,665 58.2 • Juvenile leaves 40,767 14,437 3,892 57.8 • Flowers 147,650 33,192 7,040 57.4 • Flower buds 54,588 19,467 5,104 58.5 • Open flowers 57,861 20,205 5,215 59.0 • Carpels 35,201 13,393 3,946 56.7 • Siliques 79,589 23,054 5,779 54.6 • Seeds 86,895 13,901 3,789 54.7 • Cell culture 324,345 49,842 8,698 57.3 • Dark 149,051 34,551 6,547 59.7 • Light 143,583 32,656 6,474 59.8 • Light; small 31,711 15,318 4,472 43.2 • Total 790,181 86,456 13,029 54.7 • TAIR7 27,029 45.9 65% of all peptides were detected in only one organ. 1.3% were identified an all organs.
Fig. S1. Discovery Curve, showing the number of distinct peptides matching to TAIR7 recovered as a function of the number of annotated spectra. The discovery curve is separated to show the contribution of each individual dataset.
Novel gene discovery A cluster of 13 uniquely located peptides that do not overlap a current gene model (Chr3). The prediction track shows the single exon gene model produced by AUGUSTUS. (B) The predicted sequence shows strong homology to a Thylakoid lumen family protein (sp|P82658|TL19_ARATH). It also shows strong similarity to proteins in both grapevine (emb|CAO40861.1 a hypothetical gene) and rice (Os08g0504500 a cDNA derived gene). Castellana N. E. et.al. PNAS 2008;105:21034-21038 ©2008 by National Academy of Sciences
Intergenic Regions 64% of intergenic clusters overlap annotated pseudogenes or transposons. Annotated pseudogenes may be incorrectly truncated, and have missing exons. Transposons may contain protein coding genes unrelated to transposon activity. (gene hitch-hiking) A large number (7,442 ) of small ORFs have been found as transcripts from intragenic regions*. 155 of these have predicted peptides. *Hanada et al. (2007) Genome Research 17:632-640.
Peptides overlapping a predicted transposable element gene Five peptides overlap an annotated transposable element gene. The inferred protein is 56% identical to a ubiquitin like protease. Castellana N. E. et.al. PNAS 2008;105:21034-21038 ©2008 by National Academy of Sciences
Gene refinement: new exons, boundary change, exon skipping, modified translation start and stop sites. A majority are novel exons: 60% are within introns, and 40% are in UTRs. 26 cases may actually be a single exon. Exon extension and shortening are equally frequent. AUGUSTUS using the peptide evidence predicts altered transcripts in 695 genes. In 130 cases, peptide variation indicates new isoforms.
Refined Gene Model 4 novel peptides map in the 5’UTR and the first exon of a protein kinase Castellana N. E. et.al. PNAS 2008;105:21034-21038 ©2008 by National Academy of Sciences
New gene models from identified peptidesBaerenfaller et al (2008) Science 320: 938-941.
New gene models from identified peptidesBaerenfaller et al (2008) Science 320: 938-941.
Take home lessons MS is a powerful adjunct to genomics and transcriptomics. More precise definition of coding genes. Proteomics is becoming more quantitative and less expensive. MS can provide absolute protein quantitation. Likely to play an increasing role in “omic” research. Proteomics people will want more respect.
References • KatjaBaerenfaller, Jonas Grossmann, Monica A. Grobei, Roger Hull, Mattias Hirsch-Hoffman, ShaulYalovsky, Phillip Zimmermann, UeliGrossniklaus, Wilhelm Gruissem, Sacha (2008). Genome scale proteomics reveals Arabidopsis thaliana Gene models and proteome dynamics. Science 320: 938-941. • Stephen Tanner, ZhouxinShen, Julio Ng, LilianaFlorea, RodericGuiogo, Steven Briggs and VineetBafna. (2007). Improving gene annotation using peptide mass spectrometry. Genome Res. 2007. 17: 231-239 2007;17:231-239 • KousukeHanada, Xu Zhang, Justin O. Borevitz, Wen-Hsiung Li, • and Shin-Han Shiu1 (2007). A large number of novel coding small open reading frames in the intergenic regions of the Arabidopsis thaliana genomeare transcribed and/or under purifying selection. Genome Res. 2007 17: 632-640