RESULTS FROM THE HUPO PLASMA PROTEOME PROJECT: CORE DATASET OF 3020 PROTEINS

RESULTS FROM THE HUPO PLASMA PROTEOME PROJECT: CORE DATASET OF 3020 PROTEINS • US HUPO Symposium • 14 March, 2005 • Gilbert S. Omenn, David States, Dan Chan, Richard Simpson, Henning Hermjakob, and Sam Hanash, on behalf of the HUPO PPP Investigators

Protein DNA

LONGTERM SCIENTIFIC GOALS OF HUPO PPP • Comprehensive analysis of plasma and serum • protein constituents in people • Identification of biological sources of variation • within individuals over time, with validation of • biomarkers • Physiological: age, sex/menstrual cycle, exercise • Pathological: selected diseases/special cohorts • Pharmacological: common medications • 3. Determination of the extent of variation across • populations and within populations

Scheme Showing Aims and Linkages of the HUPO Plasma Proteome Project Serum vs Plasma Technology Platforms--Separation and Identification Reference Specimens HUPO HUMAN PLASMA PROTEOME PROJECT (PPP) Development & Validation of Biomarkers HUPO PPP Participating Labs Technology Vendors Liver and Brain Proteome Projects Omenn GS. The Human Proteome Organization Plasma Proteome Project Pilot Phase: Reference Specimens, Technology Platform Comparisons, and Standardized Data Submissions and Analyses. Proteomics 2004;4:1235-1240.

PPP TECHNICAL COMMITTEE STRUCTURE • Reference Specimens and Specimen Handling Issues • (Dan Chan, chair) • Technology Platforms & Protocols (Richard Simpson) • Database Development and Links with EBI (HUPO/PSI) • (David States, Henning Hermjakob) • Population Cohorts/Specimen Banks (Gerard Siest) • Education & Training Committee (Peipei Ping) • Executive Committee (including Partnerships) (Omenn)

SERUM AND PLASMA REFERENCE SPECIMENS • BD: specially prepared male/female pooled samples, divided into EDTA-, Heparin-, and Citrate-anti-coagulated Plasma and Serum (250 ul x4 of each). • BD clot activator. No protease inhibitors. Three separate ethnic pools prepared. Shipped frozen. • 2. Chinese Academy of Medical Sciences: Sets of three • plasmas + serum, similar to BD protocol. • 3. National Institute for Biological Standards & Control, • UK: citrate-anti-coagulated, freeze-dried plasma, from • 25 donors, prepared for Intl Soc Thrombosis & • Hemostasis, 1 ml aliquots/ampoules.

UPDATED SUMMARY OF PPP LABS • 31 Total Participating Labs (18 US, 13-International): • 9 – US Academic, 3 – US Federal, 6– US Corporate • 4 – Europe, 1– Israel, 6 – Asia, 2 – Australia • LC-MS/MS datasets from 18; MALDI-MS from 5; SELDI-MS from 8; antibody arrays/immunoassays from 4 • Number that analyzed various reference specimens: • 9 – UK NIBSC, 26 – BD b1, Caucasian-American • 9 – BD b2/b3, African- and Asian-American, 5 -CAMS

Arie Admon, Technion, Haifa, Israel Ruedi Aebersold, Institute for Systems Biology, Seattle William Hancock, Barnett Institute, Northeastern Univ Stan Hefta, Bristol-Myers Squibb, NJ Helmut Meyer, Ruhr University Bochum Gil Omenn/Sam Hanash/Phil Andrews/Mike Pisano, MI Young Ki Paik, Yonsei Research Center, Korea John Peltier, Myriad Proteomics Inc. Peipei Ping, UCLA Joel Pounds, Pacific Northwest Natl Lab Xiaohong Qian, Beijing Institute of Radiation Medicine Richard Simpson, Ludwig Institute for Cancer Research David Speicher, Wistar Institute Rong Wang, Mass Spec Proteomics Lab, Mount Sinai Valerie Wasinger, Univ of New South Wales Chi Yue Wu, Institute of Biol Chem, Acad Sinica, Taiwan Xiaohang Zhao, Natl Lab of Molecular Oncology, CAMS Robert Gerszten, Harvard/Erik Forsberg, Amersham-GE

Immunoassay Labs Brian Haab, Van Andel Research Institute Frank Vitzthum, Dade Behring, Marburg GMBH, Germany Mark Driscoll, Molecular Staging Inc Bernhard Geierstanger, Genomics Inst of Novartis Research Fdn MALDI-MS Labs AlexanderArchakov, Institute of Biomedical Chemistry, Moscow, Erik Forsberg, Amersham Biosciences, Uppsala, Sweden Young-ki Paik, Yonsei Proteome Research Center Akira Tsugita, Proteomics Research Lab, Tsukuba, Japan SELDI Labs Bao-Ling Adam, Univ of Georgia Alexander Archakov, Institute of Biomedical Chemistry, Moscow Dan Chan/Alex Rai, Johns Hopkins Kenneth Greis, Procter & Gamble Eastwood Leung, Genome Institute of Singapore Sandra McCutcheon-Maloney/Brett Chromy, Lawrence Livermore Lab William Morgan, Univ of Missouri-KC, Jean-Charles Sanchez, Geneva Proteomics Research Center Paul Stemmer, Wayne State University

SPECIFICATIONS FOR DATA SUBMISSION • Each lab was instructed (July, 2003) to provide • a) a detailed experimental protocol, to “push the limits” to detect low-abundance proteins • b) peptide sequences, rated as “high” or “lower” confidence, based on MS/MS criteria • c) protein IDs from IPI 2.21 (July 2003) and search engine used to align peptide sequences with proteins in human database • Later, we requested m/z peak lists and raw spectra (by CD or DVD); search parameters

CRITERIA FOR HIGH CONFIDENCE IDENTIFICATION OF PEPTIDES, ILLUSTRATED WITH SEQUEST Xcorr: singly-charged ion, >=1.9 doubly-charged ion, >=2.2 triply-charged ion, >=3.75 Delta Cn >= 0.1; Rsp <= 4 Fully tryptic

Database Design and Implementation • RDBMS • Stable, proven technology • Data validation • Commercial package • Microsoft SQL Server • Stable and supported • Full RDBMS functionality • Transactions • Referential integrity checks • Effective development tools • GUI • Cross-tab extension Laboratories Samples Methods IdentificationsLaboratorySampleMethodDatabase IDPeptides

University of Michigan David States Marcin Adamski Thomas Blackwell Rajasree Menon Yin Xu EBI – England Rolf Apweiler Henning Hermjakob Chris Taylor Nicky Mulder Sandra Orchard Ludwig Institute - Australia Richard Simpson Eugene Kapp James Eddes Institute for Systems Biology Jimmy Eng Alexey Nesvizhskii Technion/IBM, Ilan Beer Bioinformatics Acknowledgements

Integration Algorithm (Adamski et al) • Objectives: • o Integrate results from disparate instruments, search • engines, and specimens • o Evaluate concordance between results from • different laboratories • o Reduce ambiguity and redundancy of the identifications • o Select accession numbers of the most representative and • protein for those matching equally. • We designed a workflow that uses sequences of identified peptides, rather than submitted protein accession numbers.

Numbers of Proteins Identified (LC-MS/MS or FTICR-MS) • From 15,519 reported distinct protein IDs in IPI 2.21, chose representative protein for clusters: • (a) all protein IDs (high and lower conf) • 9504 = 1 or more peptide matches (>=6 aa) • 3020 = 2 or more peptide matches [1274 = 3+] • [2580 in plasma x3; 2353 in serum; 1913 in both] • (b) all protein IDs (high conf peptides, only) • 2852 = 1 or more peptide matches • http://www.bioinformatics.med.umich.edu/app1/ • MsSqlAccess [UM] and www.ebi.ac.uk/pride [EBI]

Distribution of protein identifications in function of peptides detected per protein 10,000 all identifications - left axis confirmed identifications 8,000 6,000 number of identified proteins 4,000 25% 75% 2,000 86% 91% 92% 97% 0 ≥ 1 ≥ 2 ≥ 3 ≥ 4 ≥ 5 ··· // ··· ≥ 10 number of peptides per protein detected across experiments and laboratories

ESTIMATION OF ERROR RATE Poisson Model (States/Blackwell) Ndb, total non-redundant protein entries in IPI v2.21 (49,924) Lambda, proportion of matches false-positives Upper bound: all 9504 FP, = 0.211 Lower bound: accept 1920 high confidence single-peptide-based protein IDs, reject 4864 lower confidence, = 0.146 Pr (true positives): 4 peptides, 0.99 3 peptides, 0.95-0.98 2 peptides, 0.70-0.85 Use 2+ peptides to obtain more representative dataset.

Virtual 2D gel Proteins detected with at least 2 peptides All Detected Proteins All proteins in IPI 2.21

INDEPENDENT ANALYSES FROM RAW SPECTRA (#IDs with 2+ peptides) • Core Dataset (18 datasets, 3020) • Mascot/Digger (Kapp, Australia, 18 datasets, 3178) • PepMiner (Beer, Israel, 8 large datasets, 2902) • (c) PeptideProphet/ProteinProphet (Eng, USA, 5 datasets, 508) • Plus alternative integration scheme with Sequest (Eddes, Australia, 18 datasets, 2344)

GREATEST RESOLUTION AND SENSITIVITY • The most extensive high-confidence yield was from combined use of immunoaffinity (“top-6”) depletion, 2 or 3-D high-resolution fractionation, and then ESI-MS/MS with ion-trap LTQ instrument. • LTQ gave several fold more IDs than did LCQ in same hands (B1-serum vs B1-heparin).

SPECIFIC OBSERVATIONS: DEPLETION • Many investigators depleted albumin and/or immunoglobulins • Several obtained Agilent immunoaffinity column to remove Top-6 proteins. • Much higher numbers of identifications after depletion. • Inadvertent removal of other proteins open issue: LC-MS/MS vs gels; “sponge” effect of albumin. • Feasible to assay both flow-through & bound fractions.

Example of Depletion Analysis Echan/Speicher • Immunoaffinity/Top-6 polyclonal (Agilent) • o Column for HPLC • o Spin column • Two-antibody spin column (Proteoprep); IgY • Cibacron Blue (for albumin) • Protein A or G (for immunoglobulins • Top-6 best: 85% of protein removed; least non-target removal (lots of fragments of top 6); • few “new” proteins on 2D gel despite 10-20X loading • Suggest depleting 12-20 proteins OR using multi-dimensional (microSol IEF) fractionation

Glycoprotein-Enriched Subproteomes[Hancock presentation this afternoon] Methods Lab 2Lab 11 Enrichment hydrazide chem lectin chrom’y Peptide Fxn SCX + RP RP Mass Spec qtof deca-xp Search engine Seq/ProteinProphet Sequest Protein IDs 222 83 in B1-serum [51 in common] Of total 254, 164 found among data from 11 other labs without glycoprotein enrichment.

A B First dimension fraction numbers (relative pI) and estimated MW of identified proteins. Left (A): 39 locations with complement component 3 precursor (C3); (B):14 locations with clusterin (CLU).

INFLUENCE OF ABUNDANCE • Using quantitative immunoassays and microarrays (generally unknown epitopes), we have found very high rates of detection of the more abundant proteins, less in the mid-range, and occasional detection of very low abundance proteins, as expected. • High correlation (r=0.9) between # peptides and measured concentrations.

log10(N)=0.365* log10(conc)-0.711; r2 = 0.92

Least Abundant Proteins Identified with two distinct peptides(pg/ml: range 200 pg/ml to 20 ng/ml) • Alpha fetoprotein 2.9E+-02 • TNF-R-8 3.3E+02 • TNF-ligand-6 1.5E+03 • PDGF-R alpha 4.6E+03 • Leukemia inhibitory factor receptor 5.0E+03 • MMP-2/gelatinase 8.8E+03 • EGFR 1.1E+04 • TIMP-1 1.4E+04 • IGFBP-2 1.5E+04 • Activated leukocyte adhesion mol 1.6E+04 • Selectin L [five labs;10 peptides] 1.7E+04

SPECIMEN VARIABLES • What evidence have we developed for choice of specimens for analysis? • Plasma preferred over serum • Citrate or EDTA preferred over Heparin for plasma • Protease inhibitors desirable, but complicated • Clot activator unnecessary (serum only) • Minimize freeze/thaw cycles (archives) • Avoid 4C step

SPECIMENS • The sets of four specimens from a given donor pool yielded rather similar numbers of proteins when analyzed identically. More fragmentation of serum. Little evidence of platelet in vitro contamination. • Quantitative immunoassays show generally 15-20% lower values for citrate-plasma, due to dilution and osmosis; no interference with or loss of identifiable proteins.

PROTEASES • Should anti-protease cocktails be used in specimen collection, or in a later step? • Advantages: reduce proteolytic degradation ex vivo; reduce complexity of peptides after tryptic digestion. • Disadvantages: adds peptides, as well as small molecules, to the mix to be analyzed; may covalently modify the proteins (ABESF does so).

BIOLOGICAL INSIGHTS • The proteins identified can be annotated by many methods. We have searched multiple databases, including Gene Ontology, Novartis Atlas, Online Mendelian Inheritance in Man (OMIM), incomplete or unidentified sequences in the human genome, microbial genomes, and protein domains. • Some examples follow.

Shown in the figure are the rates of occurrence of Gene Ontology terms in the HUPO PPP 3020 set relative to the frequency of occurrence of the same terms in the human genome. The solid line shows a linear regression estimate for the frequency that would be expected if the 3020 uniformly sampled the genome. The parallel dotted lines show 2 fold over and under representation relative to uniform sampling. The curved dashed lines show over and under representation by 3 standard deviations.

Over- and Under-Represented GO Terms • Over: extracellular, immune response, blood coagulation, lipid transport, complement activation, regulation of blood pressure; also, cytoskeletal proteins, receptors and transporters • Under: perception of smell (1 vs 25 expected); cation transporters, ribosomal proteins, G-protein coupled receptors, and nucleic acid binding proteins

OVER- AND UNDER-REPRESENTED DOMAINS IN INTERPRO FOR PPP vs FULL IPI DATASET • Over: EGF, intermediate filament protein, • sushi, thrombospondin, complement C1q, • and cysteine protease inhibitor • Under: Zinc finger (C2H2, B-box, RING), tyrosine protein phosphatase, tyrosine and serine/threonine protein kinases, helix-turn-helix motif, and IQ calmodulin binding region

GENE ONTOLOGY SPECIFIC TERMS • Over-represented in PPP 3020 (vs whole genome): “extracellular”, “immune response”, “blood coagulation”, “lipid transport”, “complement activation”, “regulation of blood pressure”, as expected; also: cytoskeletal proteins, receptors and transporters. • Proteins from most cellular locations and molecular processes are recognized. • Under-represented: “perception of smell” (1 vs 25 exp); cation transporters, ribosomal proteins, G-protein coupled receptors, and nucleic acid binding proteins.

InterPro Protein Domain Analysis • Compared with the whole human genome, the 3020 PPP proteins are: • Over-represented for EGF, intermediate filament protein, sushi, thrombospondin, complement C1q, and cysteine protease inhibitor, and • Under-represented Zinc finger (C2H2, B-box, RING), tyrosine protein phosphatase, tyrosine and serine/threonine protein kinases, helix-turn-helix motif, and IQ calmodulin binding region domains.

TRANSMEMBRANE AND SECRETED PROTEIN FEATURES • 1297 of 3020: • SwissProt Annotated ProFun Both • Transmembrane 230 151 104 • Secretion signal 373 420 358 • 1723 of 3020: ProFun Predicted • TM domain(s) 137 • Secretion signal 255

Cardiovascular-Related Proteins Biomarker Candidates in the PPP Database (Vondriska-Ping: presentation today) • Proteins characterized in eight groups: • Inflammation • Vascular • Signaling • Growth and differentiation • Cytoskeletal • Transcription factors • Channels • Receptors

PROTEINS FROM INHERITED CANCER DISORDERSLinking IPI IDs and Mendelian Inheritance in Man (OMIM)

IDENTIFICATION OF 94 NOVEL PEPTIDES USING WHOLE GENOME ORF SEARCH • States has enhanced the annotation of the Human Genome by identifying novel and cryptic genes not previously known to have protein products. Mass spectra peaklists from a subset of PPP labs were searched against all ORFs in NCBI Build 33 in all three reading frames and both strands, using X!Tandem. • A bonus of the PPP: protein to DNA mapping of the human genome!

COMPARISON WITH LITERATURE • Report#IDs#IPIin 3020in 9504 • Anderson 1175 990 316 471 • Shen [1682] 1842 213 526 • Chan 1444 1019 257 402 • Zhou 210 107 51 68

NEXT STEPS • We are in the homestretch on manuscripts from the Pilot Phase of the PPP for Special Issue of PROTEOMICS August 2005 & for Nature Biotech. • Plan potential future phases of PPP • a) Identify and perform critical experiments to support development of standardized procedures for specimens, fractionation, analysis. • b) Provide high-quality bioinformatics and database for plasma proteome datasets from all sources, assuring linkage with organ-proteomes. • c) Organize strategies, labs, and bioinformatics for large-scale studies, or play facilitation role.

DEFINE HIGH THROUGHPUT OPTIONS FOR LARGE-SCALE PROTEOMICS STUDIES (1) • Admon/Dongre: LC-MS with highly accurate mass and elution time parameters for peptide IDs • Combine with depletion; rely on very slow flow (2 hr) LC and accurate mass and elution characteristics for mass fingerprints, after building a high-quality mass x elution database.

DEFINE HIGH THROUGHPUT OPTIONS FOR LARGE-SCALE PROTEOMICS STUDIES (2) Mann, Beijing Congress (2004) Use MS (3) with FTICR for much greater precision of mass determination and for detection and localization of post-translational modifications. Probably convert to microarrays for high throughput of clinical and epidemiological specimens.

Genome-Wide Studies of Proteome (3) • Humphery-Smith (Proteomics 2004;4:2519-21) • Design and produce affinity ligands against conserved regions in each ORF for signal enrichment: antibodies, receptins, aptamers; sequence strings unencumbered by PTMs, uncleaved, near 5’ end, exposed at surface • Use ECL, rolling circles, isotopic labeling, and/or light scattering as readout technologies.

Large-Scale Proteomics Studies (4) • Aebersold (Nature 2003;422:115-116) • Go from discovery using MS to “browsing” using unique chemically-synthesized peptides tagged with heavy isotope for each gene and even each protein isoform. • Combine this standard peptide mixture with specimen fractions on sample plate for MS, examine double peaks (with the precise differential mass) in the ordered peptide array. • Try the method first on yeast.

HUPO PPP SUPPORT FROM NIH Trans-NIH Consortium Natl Cancer Inst: Div CancerPrevention; Div Cancer Treatment Natl Institute on Aging Natl Inst on Alcoholism & Alcohol Abuse Natl Inst on Diabetes, Digestive, & Kidney Diseases Natl Inst for Environmental Health Science Natl Inst for Neurologic Diseases & Stroke

RESULTS FROM THE HUPO PLASMA PROTEOME PROJECT: CORE DATASET OF 3020 PROTEINS