INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS • Bibliotheca Alexandrina, 9 October, 2007 • Gilbert S. Omenn, M.D., Ph.D. • Center for Computational Medicine & Biology • Chair, HUPO Plasma Proteome Project • University of Michigan, Ann Arbor, MI, USA

It Is Such A Great Pleasure to Visit The Bibliotheca Alexandrina One of the Wonders of the Modern World! “The First Digital Library, from its Birth” Facilitating International Collaboration in Science and Technology

Nearly-Complete Human Genome Sequence, 15-16 Feb 2001

We Live in a New World of Life Sciences New Biology---New Technology: a “parts list” Genome Expression Microarrays Comparative Genomics + CNV + miRNA Proteomics and Metabolomics Bioinformatics & Computational Biology • Mechanism- & Evidence-Based Medicine: • “What were you doing up to now?!” • Predictive, personalized, preventive, • participatory healthcare and community • health services

Key Components of the Vision of Biology As An Information Science An avalanche of genomic information: validated SNPs, haplotype blocks, candidate genes/alleles, proteins, & metabolites--associated with disease risk Powerful computational methods Effective linkages with better environmental and behavioral datasets for eco-genetic analyses Credible privacy and confidentiality protections Breakthrough tests, vaccines, drugs, behaviors, and regulatory actions to reduce health risks and cost-effectively treat patients globally.

A Golden Age for the Public Health Sciences • Sequencing and analyzing the human genome is generating genetic information that must be linked with information about: • Nutrition and metabolism • Lifestyle behaviors • Diseases and medications • Microbial, chemical, physical exposures Every discipline of public health sciences needed.

Definitions • Genetics is the scientific study of genes and their roles in health and disease, physiology, and evolution. • Genomics is a modern subset of the broader field of genetics, made feasible by remarkable advances in molecular biology, biotechnology, and computational sciences, to examine the entire complement of genes and their actions. • Global analyses permit us and require us to go beyond the known “lamp-posts” of individual gene associations and effects.

Proteins are the action molecules of the cell and the leading candidates for biomarkers—in tissues and in the blood. Proteins are coded for by genes. Understanding one protein can be a lifetime’s work! • Proteomics is the global analysis of proteins in cells or body fluids. Techniques for global analysis of proteins are advancing rapidly, especially for discovery of biomarkers for diagnosis, treatment, and prevention. • Metabolomics is the global analysis of metabolites. • Proteomics + metabolomics + epigenomics = “functional genomics”

Protein DNA

Rationale for Proteomics Proteins are much closer to the pathophysiologic changes and molecular targets for drugs than are mRNAs. Changes in mRNAs are clues, but changes in corresponding proteins often are not highly correlated. Advances in fractionation of complex tissue and plasma protein mixtures, in mass spectrometry, and in curated databases of proteins help address complexity, dynamic range, and uncertainty of protein identifications.

A Vision For Proteomics • Multiple protein biomarkers discovered • Biomarkers combined on diagnostic chips • Detect organ location of cancers, for surgery or radiation • Detect mechanism of disease for chemotherapy, even if location unknown • Mechanistic, rather than “geographic” classification • Better efficacy/less toxicity for all types of patients

Status of Proteomics Assays • Many technology platforms of increasing sensitivity and resolution • Patterns or specific proteins still just biomarker candidates —most lack independent confirmation and coefficient of variation, let alone “validation” with standard clinical chemistry parameters of sensitivity, specificity, and especially positive predictive value • Approaches of clinical chemistry needed to guide further development of the field

Barriers for Proteomic Cancer Biomarker Discovery in Plasma • Human cancers are very heterogeneous • Tumor proteins are in low abundance for early detection of cancers • Tumor proteins are greatly diluted upon release to ECF and blood • Plasma is an extraordinarily complex specimen dominated by high abundance proteins (50% by weight is albumin) • Knowledge of the plasma proteome is still limited

Outline of Lecture • Review of the vision, strategy, and output of the HUPO Human Plasma Proteome Project Pilot Phase • Objectives for the New Phase of the Plasma Proteome Project • Example of the power of computational tools and collaborations (if time)

HUPO • The international Human Proteome Organization (HUPO) was founded in 2001. Its aims are: • To advance the science of proteomics • To enhance training in proteomics • To build international initiatives by organ (liver, brain, kidney), biofluid (plasma, urine, CSF, saliva), and disease (cardiovascular, cancers), plus antibodies and data standards.

Proteomics Interaction Map Ruth McNally, sociologist

Samir Hanash, founding President of HUPO Gil Omenn, leader of HUPO PPP

THE PLASMA PROTEOME • Advantages: The most available human specimen; the most comprehensive sample of tissue-derived proteins; the basis for a Disease Biomarkers Initiative tied to organ proteomes. • Specific Disadvantages: • Extreme complexity/enormous dynamic range • High risk of ex vivo modifications • Lack of highly standardized protocols • General Challenges: Inadequate appreciation of incomplete sampling by MS/MS; evolving annotations and unstable databases

Long-Term Scientific Goals of the HUPO • Human Plasma Proteome Project • 1. Comprehensive analysis of plasma and serum • protein constituents in people • Identification of biological sources of variation • within individuals over time, with validation of • biomarkers • Physiological: age, sex/menstrual cycle, exercise • Pathological: selected diseases/special cohorts • Pharmacological: common medications • 3. Determination of the extent of variation across • populations and within populations

Scheme Showing Aims and Linkages of the HUPO Plasma Proteome Project, Pilot Phase Serum vs Plasma Technology Platforms--Separation and Identification Reference Specimens HUPO HUMAN PLASMA PROTEOME PROJECT (PPP) Development & Validation of Biomarkers HUPO PPP Participating Labs Technology Vendors Liver and Brain Proteome, Antibody, Protein Stds Projects Omenn GS. The Human Proteome Organization Plasma Proteome Project Pilot Phase: Reference Specimens, Technology Platform Comparisons, and Standardized Data Submissions and Analyses. Proteomics 2004;4:1235-1240.

OUTPUT FROM PPP Pilot Phase • Special Issue Aug 2005, Proteomics, “Exploring the Human Plasma Proteome”: 28 papers—collaborative analyses and annotations, plus lab-specific analyses, and Wiley book (2006) • Publicly-accessible datasets: • www.ebi.ac.uk/pride [EBI] www.peptideatlas.org/repository [ISB] • www.bioinformatics.med.umich.edu/hupo/ppp • Additional papers are encouraged: • Nature Biotechnology 2006; 24:333-338 (States et al) • Genome Biology 2006;7:R35 (Fermin et al) • Proteomics 2006; 6: 5662-5673 (Omenn) • Numerous citations/comparisons of datasets

SERUM AND PLASMA REFERENCE SPECIMENS • BD: specially prepared male/female pooled samples, divided into EDTA-, Heparin-, and Citrate-anti-coagulated Plasma and Serum (250 ul x4 of each). • BD clot activator. No protease inhibitors. Three separate ethnic pools prepared. Shipped frozen. • 2. Chinese Academy of Medical Sciences: Sets of three • plasmas + serum, similar to BD protocol. • 3. National Institute for Biological Standards & Control, • UK: citrate-anti-coagulated, freeze-dried plasma, from • 25 donors, prepared for Intl Soc Thrombosis & • Hemostasis, 1 ml aliquots/ampoules.

Specifications for Data Submission • Each of 55 labs agreed (July, 2003 Workshop) to provide, and 31 labs did provide: • a) a detailed experimental protocol, to “push the limits” to detect low-abundance proteins • b) peptide sequences, rated as “high” or “lower” confidence, based on MS/MS criteria • c) protein IDs from IPI 2.21 (July 2003) and search engine parameters used to align peptide sequences with proteins in human database • Later, we obtained m/z peak lists and raw spectra (by DVD) for independent analyses.

1200 1000 200 400 600 800 m/z From Peptides to Genome Annotation digestion databasesearch extraction LC-MS/MS Peptides Mass Spectrum Proteins Sample Peptides Spectrum Peptide Probability Spectrum 1 LGEYGH 1.0 … … … Spectrum N EIQKKF 0.3 statistical filtering SBEAMS BLAST protein database Map to genome Peptide … Chrom Start_Coord End_Coord … PAp00007336 … X 132217318 132217368 … … … … … … … visualization Genome Browser PeptideAtlas Database

Numbers of Proteins Identified (LC-MS/MS or FTICR-MS, 18 labs) • From 15,519 reported distinct protein IDs in IPI 2.21, we chose one representative/cluster: • (a) 9504 = 1 or more peptide matches • (b) 3020 = 2+ peptide matches (Core Dataset) • (c) 1274 = 3 or more peptide matches • 889 = follow-up high-stringency analysis with adjustments for protein length and multiple (43,000) comparisons in IPI v2.21 • (Nature Biotech 2006; 24:333-338)

GREATEST RESOLUTION AND SENSITIVITY • The most extensive high-confidence yield was from combined methods of immunoaffinity (“top-6”) depletion, 2 or 3-D high-resolution fractionation, and then ESI-MS/MS with ion-trap LTQ instrument. • LTQ gave several fold more IDs (1168) than did LCQ (271) in same hands (B1-serum vs B1-heparin) and obtained multiple peptides for many proteins which had just one hit with LCQ.

SPECIFIC OBSERVATIONS: DEPLETION • Many investigators depleted albumin and/or immunoglobulins • Several were provided Agilent immunoaffinity column to remove “top-6” proteins • Much higher numbers of identifications after depletion if sufficient fractionation • Inadvertent removal of other proteins; “sponge” effect of albumin • Assay both flow-through & bound fractions

SPECIMEN VARIABLES • What evidence have we developed for choice of specimens for analysis? • Plasma preferred over serum—more consistent, less degradation • EDTA-plasma preferred over heparin interferences and citrate dilution • Clot activator? necessary only for serum • Minimize freeze/thaw cycles (archives) • Minimal evidence of platelet activation [4C] • Protease inhibitors desirable, but alter proteins

INFLUENCE OF ABUNDANCE • Using quantitative immunoassays and microarrays (generally unknown epitopes), we have found very high rates of detection of the more abundant proteins, less in the mid-range, and occasional detection of very low abundance proteins, as expected. • High correlation (r=0.9) between # peptides and measured concentrations

Least Abundant Proteins Identified with two distinct peptides(pg/ml: range 200 pg/ml to 20 ng/ml) • Alpha fetoprotein 2.9E+-02 • TNF-R-8 3.3E+02 • TNF-ligand-6 1.5E+03 • PDGF-R alpha 4.6E+03 • Leukemia inhibitory factor receptor 5.0E+03 • MMP-2/gelatinase 8.8E+03 • EGFR 1.1E+04 • TIMP-1 1.4E+04 • IGFBP-2 1.5E+04 • Activated leukocyte adhesion mol 1.6E+04 • Selectin L [five labs;10 peptides] 1.7E+04

BIOLOGICAL INSIGHTS • The proteins identified can be annotated by many methods. We have searched multiple databases, including Gene Ontology, Novartis Atlas, Online Mendelian Inheritance in Man (OMIM), incomplete or unidentified sequences in the human genome, microbial genomes, InterPro protein domains, transmembrane domains, secretion signals. • See Proteomics 2005; 5:3226-3519; Wiley, 2006

GENE ONTOLOGY SPECIFIC TERMS • Over-represented in PPP 3020 (vs whole genome): “extracellular”, “immune response”, “blood coagulation”, “lipid transport”, “complement activation”, “regulation of blood pressure”, as expected; also: cytoskeletal proteins, receptors and transporters. • Proteins from most cellular locations and molecular processes are recognized. • Under-represented: “perception of smell” (1 vs 25 exp); cation transporters, ribosomal proteins, G-protein coupled receptors, and nucleic acid binding proteins.

InterPro Protein Domain Analysis • Compared with the whole human genome, the 3020 PPP proteins are: • Over-represented for EGF, intermediate filament protein, sushi, thrombospondin, complement C1q, and cysteine protease inhibitor. • Under-represented: Zinc finger (C2H2, B-box, RING), tyrosine protein phosphatase, tyrosine and serine/threonine protein kinases, helix-turn-helix motif, and IQ calmodulin binding region domains.

TRANSMEMBRANE AND SECRETED PROTEIN FEATURES • 1297 of 3020: • SwissProt Annotated ProFun Both • Transmembrane 230 151 104 • Secretion signal 373 420 358 • 1723 of 3020: ProFun Predicted • TM domain(s) 137 • Secretion signal 255

Cardiovascular-Related Proteins Biomarker Candidates in the PPP Database • Proteins characterized in eight groups: • Inflammation • Vascular • Signaling • Growth and differentiation • Cytoskeletal • Transcription factors • Channels • Receptors

Comparison of Five Search Algorithms • Using PPP data, Kapp et al (Proteomics 2005) found Sequest and Spectrum Mill more sensitive and MASCOT, Sonar, and X!Tandem more specific for peptide identifications at specified false-positive rates. • Some investigators have reported using combinations of two or more search engines. Decision rules are necessary.

Can We Overcome the Idiosyncrasies of Individual Instruments and Laboratories? • Several informatics investigators approached the human PPP with an offer to re-analyze the complete MS/MS datasets using their own software and criteria from the raw spectra (or peaklists). • These analyses eliminated the heterogeneity of search algorithms, search parameters, and idiosyncrasies of individual labs. • The results are hard to compare, given different extent of analysis. However, each can be compared with the Core Dataset.

Independent Analyses from Raw Spectra (#IDs with 2+ peptides) • Core Dataset (18 datasets, 3020) • PepMiner (Beer, 8 large datasets, 2895) [1051 in 3020 dataset, + 700 in the 9504] • X!Tandem (Beavis/States, 18 datasets, 2678) [577 in the 3020; 218 in the 889] • PeptideProphet/ProteinProphet (Deutsch, 7+ datasets, 960)[479 in 3020] • Mascot/Digger (Kapp, Australia, 14 datasets, 513 with 1.4% error rate; ongoing analysis

What is Required and Feasible to Enhance the Statistical Robustness of Findings? • Many complex proteomics analyses are done once, without replicates required to estimate coefficient of variation or other standard parameters for clinical chemistry use. • “Five to ten independent repetitions of the experiments are a must” [Hamacher et al, Proteomics in Drug Discovery, 2006]. • How should we determine how similar or different are samples A and B, or the results of methods X and Y? What decision rules apply? • We have a long way to go from discovery research to clinical applications.

Comparison of 5 Published Reports on Plasma Proteins with HUPO PPP Datasets • Report#IDs#IPIin 3020in 9504 • Anderson 1175 990 316 471 • Shen [1682] 1842 213 526 • Chan 1444 1019 257 402 • Zhou 210 148 51 88 • Rose 405 287 142 159

Comparison of New Biofluid Proteome Findings with HUPO PPP-3020 Proteins • Proteome# ProteinsIPI 2.21PPP-3020 • Urine 1543 910 293 • tears 491 313 117 • semen 923 560 180 • Refs from Matthias Mann Lab, Genome Biology, 2007, different IPI versions. • Comparison, Omenn, Proteomics-Clinical Applications (2007).

NEXT PHASE OF PPP (PPP-2) • Standard operating procedures (SOPs), including EDTA-plasma as standard specimen; replication and confirmation of results • Quantitation and subproteomes, using new methods and advanced instruments • Databases and robust bioinformatics • Clinical chem/disease-related studies

PPP-2 Research & Technology Thrusts • Learned a lot from Pilot Phase—plasma is a very complex specimen; no single platform sufficient; analyses currently far from comprehensive, let alone reproducible; now have improved data quality and informatics resources. • PPP-2: use multiple methods; focus on biomarker discovery; build upon already-funded laboratories and repositories.

Specific Technology Recommendations • N-Glycosite (proteotypic) peptide resource is a special subproteome likely to have high biomarker relevance. • Capture glycoproteins, digest with trypsin and PNGase F to yield N-linked glycopeptides. Choose one unique to each protein; a finite number; not all proteins. Use complementary lectin approach to characterize glycans. • Prepare isotope-labeled N-glyco-peptides for multiple uses as standards and to spike specimens.

N-Glycosites Glycoproteins are enriched on cell surface, in secreted proteome and in plasma Glycoproteins tend to be stable Only few glycosites per protein: reduction in sample complexity (excludes albumin) Inherent validation of N-glycosite by fragment ion spectrum N-glycosite subproteome is probably the one easiest to completely map

Capture Wash Non-glycoproteins Trypsin digestion Non-glyco-peptides Wash • Asn  Asp PNGaseF Digestion N-linkedglycopeptides Glycopeptide Isolation Zhang H., Li X.-J., Martin D.B. & Aebersold R. (2003) Nat Biotech 21: 660-666

Flow chart of process Tissue Samples Plasma Samples Normal & Disease Capture / Digestion 'Glycopeptide' Fract. 'Glycopeptide' Fract. LC-MS LC/MS Maps Target peptides Data Analysis MRM LC/MS/MS Data Analysis Targeted LC/MS/MS

Reducing Complexity: Glycoprotein-Enriched Subproteomes Methods Lab 2Lab 11 Enrichment hydrazide chem lectin chrom’y Peptide Fxn SCX + RP RP Mass Spec qtof deca-xp Search engine Seq/ProteinProphet Sequest Protein IDs 222 83 in B1-serum [51 in common] Of total 254, 164 found among data from 11 other labs without glycoprotein enrichment.

Technology Recommendations (cont’d) • Orbitrap and other advanced instruments with high mass accuracy and increased throughput • Multiple Reaction Monitoring (Q-Trap, triple quad---LOD <50 amol, 5 logs range, probably ng/ml range for GP. • Extensive fractionation and newer labeling methods. • Recruit several major labs; be open to volunteers. • Determine interest in reference specimen. • Make peptide standards available through PPP-2: post lists and make labeled compounds.

INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

INTERNATIONAL COLLABORATION IN PROTEOMICS AND INFORMATICS

Presentation Transcript

Proteomics Informatics David Fenyő

Proteomics Informatics David Fenyő

Proteomics Informatics –

Proteomics Informatics (BMSC-GA 4437)

Proteomics Informatics –

Proteomics Informatics –

Proteomics Informatics (BMSC-GA 4437)

International Collaboration

Proteomics Informatics – Molecular signatures (Week 11)

Convergence, cooperation and collaboration in international standards

Proteomics Informatics – Databases, data repositories and standardization (Week 7)

International Collaboration

Biomedical informatics for proteomics

Proteomics Informatics: Scientific Discovery or IT Development?

Proteomics and Glycoproteomics (Bio-)Informatics of Protein Isoforms

International collaboration in practice

INTERNATIONAL COLLABORATION

Biodiversity research and informatics in Bioversity International

International Collaboration

International Journal of Proteomics & Bioinformatics