Ruedi Aebersold, Ph.D. Institute for Systems Biology Seattle, Washington

Data Collection and Analysis for High Throughput Quantitative Proteomics:Current Status and Challenges Ruedi Aebersold, Ph.D. Institute for Systems Biology Seattle, Washington email: raebersold@systemsbiology.org

Enumerate all the components of a proteome Detect dynamic changes in proteome following external or internal perturbations Proteome as database: Proteomics as Biol. or clin. assay: Proteome analyzed once Proteome analyzed multiple (infinite) times Proteomics: The systematic (quantitative) analysis of the proteins expressed in a cell at a time

Protein Identification Strategy * I 12 14 16 Peptides Time (min) Protein mixture 1D, 2D, 3D peptide separation * II 1200 1000 200 400 600 800 m/z Q1 Q2 Collision Cell Q3 Tandem mass spectrum Correlative sequence database searching III 1200 1200 1000 1000 200 400 600 800 200 400 600 800 m/z m/z Theoretical Acquired Protein identification

Sample 2 Sample 1 (Reference) Incorporate Incorporate Stable Heavy Stable Light Isotope Isotope Combine Samples Analyze by Mass Spectrometer • Ratio of h/l signals indicates ratio of analytes Accurate Quantitation Using Isotope Dilution • h/l analytes are chemically identical  identical specific signal in MS

Heavy reagent: d8-ICAT (X=deuterium) Light reagent: d0-ICAT (X=hydrogen) O N N X X O O X X I O O O N N S X X X X Biotin tag Linker (heavy or light) Thiol reactive Isotope Coded Affinity Tags (ICAT) Detection of Cys containing peptides and accurate quantification using stable isotope dilution

100 Mixture 1 Optional fractionation Light Heavy 0 550 560 570 580 m/z isotope-label 100 NH2-EACDPLR-COOH Combine and proteolyze Avidin affinity enrichment Mixture 2 0 200 400 600 800 m/z Compatible with any separation/fractionation method at protein/peptide level. Quantitation and protein identification Quantitative proteomics by isotope labeling-LC-MS/MS

Stable isotope incorporation via enzyme reaction Metabolic stable isotope labeling Isotope tagging by chemical reaction PROTEIN LABELING Label Digest Digest Digest DATA COLLECTION Mass spectrometry DATA ANALYSIS Intensity Intensity Intensity m/z m/z m/z Stable Isotope Labeling Strategies

Quantitative Proteomics Technology • Protein identification: Automated peptide tandem mass spectrometry of complex peptide mixtures • Protein quantification: Isotope dilution • Selective chemical reactions: reduction of sample complexity; selective analyte isolation Results Identification of proteins in sample and quantitative profiles

Quantitative Proteomics Technology • Protein identification: Automated peptide tandem mass spectrometry of complex peptide mixtures • Protein quantification: Isotope dilution • Selective chemical reactions: reduction of sample complexity; selective analyte isolation Results Identification of proteins in sample and quantitative profiles Current capacity: ~1000 proteins per day/instrument Total yeast lysate: ~ 2000 proteins identified and quantified

Quantitative Proteomics Technology • Protein identification: Automated peptide tandem mass spectrometry of complex peptide mixtures • Protein quantification: Isotope dilution • Selective chemical reactions: reduction of sample complexity; selective analyte isolation Results Identification of proteins in sample and quantitative profiles Current capacity: ~1000 proteins per day/instrument Total yeast lysate: ~ 2000 proteins identified and quantified In 1991, all the world’s labs combined had identified just about 2000 genes

The efficiency problem The validation problem The biological inference problem Current Limitations (and Potential Solutions)

Cation Exchange RP-HPLC ESI-MS/MS Standard Method for Complex Peptide Mixture Analysis

Yeast Proteome Expected number of ORFs: 6118 Expected number of tryptic peptides: ~350,000 Proteome Analysis: The Analytical Challenges

Synchronous Timepoint SamplesCompared to Reference Sample Asynchronous Reference Sample Timepoint Samples from Yeast Cells Synchronously Transiting the Cell Cycle

Data Summary 1648 1095 1184 1112 892 1523 1055 1140 921 1448 1051 871 1713 960 1229 • 2735/6562 proteins quantified across all timepoints (42%) • 696 proteins quantified in every experiment • 1513 proteins quantified in at least one timepoint • 34,400 peptides quantified on average per timepoint • >1 million mass spectra collected

Features: 2720 Pep3D: Xiao-jun Li et al. submitted

Features: 2720 CIDs: 1633

Features: 2720 CIDs: 1633 IDs: 363 ID/CID: 22% ID/feature: 13%

Better separation technology Selective peptide isolation Smart precursor ion selection Possible Solutions

Tryptic yeast digest separated by FFE-IEX or SAX • 30 fractions collected and analyzed by capLC-MS/MS • Overlap: same peptide identified in adjacent fractions

92% 68%

Better separation technology Selective peptide isolation Zhang H, et al. Curr. Op. Chem . Biol. (2004) 8: 66-75 Aebersold R Nature (2003) 422(6928):115-6. Smart precursor ion selection Griffin T et al. Anal Chem.( 2003) 75:867-74. Griffin et al. J Am Soc Mass Spectrom. (2001) 12:1238-46. Possible Solutions

Only a (small) subset of peptides present is identified Current separation strategies do not have sufficient resolving power MS/MS of every peptide in every experiment is a bottleneck of current MS based proteomics LC-ESI MS/MS wastes a high fraction of MS/MS cycles sequencing precursor ions that do not lead to a positive identification Most positive identifications are not informative in profiling experiments Smart precursor ion selection is required Summary: Efficiency Problem

Protein Identification by MS/MS protein identifications protein sample A B C D A B C peptide mixture peptide identifications MS/MS spectra MS/MS spectra

Protein Identification by MS/MS protein identifications protein sample Protein level A B C D A B C Peptide level • Database search • Tools: • -Sequest • -Mascott • SpectrumMill • Etc. peptide mixture peptide identifications MS/MS spectrum level MS/MS spectra MS/MS spectra

OUTPUT FROM SEARCH ALGORITHM sort by search score

Threshold Model “correct” sort by search score threshold SEQUEST: Xcorr> 2.0 Cn > 0.1 MASCOT: Score > 47 incorrect

Difficulty Interpreting Protein Identifications based on MS/MS • Different search score thresholds used to filter data • Unknown and variable false positive error rates • No reliable measures of confidence

Statistical Model entire dataset: SpectrumPeptideScore Spectrum 1 LGEYGH 4.5 Spectrum 2 FQSEEQ 3.4 Spectrum 3 FLYQE 1.3 … … … Spectrum N EIQKKF 2.2 best match database search score MS/MS spectrum

Statistical Model entire dataset: incorrect --- SpectrumPeptideScore Spectrum 1 LGEYGH 4.5 1.0 Spectrum 2 FQSEEQ 3.4 0.97 Spectrum 3 FLYQE 1.3 0.01 … … … Spectrum N EIQKKF 2.2 0.3 incorrect p=0.5 correct correct --- probability unsupervised learning EM mixture model algorithm learns the most likely distributions among correct and incorrect peptide assignments given the observed data

Threshold Model: Bad Discrimination and Inconsistency Sensitivity: fraction of all correct results passing filter Error Rate: fraction of all results passing filter that are incorrect IdealSpot SEQUEST thresholds (from literature) test data: A. Keller et al. OMICS 6(2), 207 (2002)

Discriminating Power of Peptide Prophet Sensitivity: fraction of all correct results passing filter Error Rate: fraction of all results passing filter that are incorrect IdealSpot SEQUEST thresholds (from literature) probability model Improved discrimination: more identifications (for the same error rate) Keller at al. Anal. Chem. 2003

>sp|P02754|LACB_BOVINBETA-LACTOGLOBULIN PRECURSOR (BETA-LG) (ALLERGEN BOS D 5) - Bos taurus (Bovine). MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI TPEVDDEALEK : p = 0.96 TPEVDDEALEKFDK : p = 0.96 KPTPEGDLEILLQK : p = 0.83 LSFNPTQLEEQCHI: p = 0.65 LSFNPTQLEEQCHI: p = 0.76 sp|P02754|LACB_BOVIN Probability = ??? Protein Identification ProteinProphetTMsoftware combines probabilities of peptides assigned to MS/MS spectra to computeaccurateprobabilities that corresponding proteins are present Nesvizhskii et al Anal Chem. (2003)75:4646-58.

Issues for Protein Identification • Many peptides are present in more than a single database protein entry ProteinProphet apportions such peptides among all corresponding proteins to derive simplest list of proteins that explain observed peptides • Peptides corresponding to ‘single-hit’ proteins are less likely to be correct than those corresponding to ‘multi-hit’ proteins ProteinProphet learns by how much peptide probabilities should be adjusted to reflect this protein grouping information

Amplification of False Positive Error Rate from Peptide to Protein Level + Prot A Peptide 1 in the sample (enriched for ‘multi-hit’ proteins) Peptide 2 Prot B + Peptide 3 + Peptide 4 5 correct (+) Peptide 5 Prot Peptide 6 not in the sample (enriched for ‘single hits’) + Peptide 7 Prot Prot Peptide 8 Prot Peptide 9 Prot + Peptide10 Peptide Level: 50% False Positives Protein Level: 71% False Positives

Serum Protein Identifications from Large-scale (~375 run) Experiment Data Filter # ids # non-single hits # single-hits Publ. Threshold model#1 2257 359 1898 Publ. Threshold model #2 2742 441 2301 ProteinProphet, p 0.5 713 511 202 (predicted error rate: 7%) Reference: H. Zhang et al., in prep

Manual Authenticators Search Results Incorrect Validation Correct Validation Validation Withheld Consistency of Manual Validation of SEQUEST Search Results

Tasks for a proteomic analysis pipeline Data Analysis Pipeline mzXML Protein Prophet Peptide Prophet Suitable input Peptide assignment Protein assignment Validation Interpretation SBEAMS Cytoscape COMET ProbID Quantitation ASAPRatio

Processing of data collected from different platforms, samples, experiments, operators requires transparent methods to score data Publication and relational database analysis require consistently scored data Tools assigning probability based scores are essential Openly accessible, transparent (OS) tools bring in new talent and lead to community improved tools Data Analysis Summary: Nesvizhskii and Aebersold (2004) Drug Discov Today. 9:173-81 http://www.proteomecenter.org/software.php

Mock-treated IFN-treated C13 C12 ICAT label C12/C13 HPLC-MS/MS Wei Yan et al

Name Cellular pathway Probability ASAPRatio Mean ASAPRatio Std. S100 P100 P3 Sum Unique ID 54 IFN-induced proteins (2-fold) 15 previously reported 39 novel 23 IFN-repressed proteins (0.5-fold) P  0.9 P  0.4 523 590 270 330 671 748 1464 1668 1113 1272

Lots of data -what does it mean?

Interferon (IFN) Pathway 2.215 ± 0.079 IFN / Mock PKR 3.963 ± 0.659 2’,5’-OAS 2.460 ± 0.076 Mx 2.359 ± 0.149 ADAR 1.398 ± 0.118 IRFs Not identified MHC -2-microglobulin (MHC I) 2.768 ± 0.583 2.219 ± 0.183 IFI-30 (MHC II) Katze et al (2002) 2: 675

GO Analysis of Interferon regulated proteins GO level Physiological process 3 Response to external stimulus Response to stress Death Cell growth and/or maintenance Metabolism Pathogenesis 4 Cell death Cell organization Catabolism Nitrogen metabolism Transport Cell growth 5 Cytoplasm organization Nuclear organization DNA metabolism Defense response 6 Fatty acid metabolism Amino acid metabolism Immune response 7 8 9 10 11 12 Cell growth and/or maintenance Metabolism Cellular defense response

Hormone responses Cell motility Energy metabolism Transcription Islands of intense knowledge in ocean of unknown

Charting the path between landmarks Hormone responses Cell mobility Energy metabolism Unassigned observations Transcription

B F C E D Walking down the interaction map G A H I

First round of TAP-tagging:Identification of IGBP1 and TIP41 interactors TCP1 CCT2 CCT3 CCT4 CCT complex CCT5 CCT6A CCT7 CCT8 PPP2CA Catalytic subunits PP2A-type phosphatases PPP2CB PPP4C PPP6C Uncharacterized proteins IGBP1 TIP41 PPP4R2* PPP6R1* PPP6R2A* Anne-Claude Gingras

Ruedi Aebersold, Ph.D. Institute for Systems Biology Seattle, Washington