Mass Spec Proteomics HUPO-PSI & PRIDE

Mass Spec ProteomicsHUPO-PSI & PRIDE Phil Jones (pjones@ebi.ac.uk) Proteomics Services Group www.ebi.ac.uk

Positioning – The Technologies in Question

protein extraction complex protein mixture http://www.akh-wien.ac.at/biomed-research/htx/platweb1.htm 2D-PAGE separation MS/MS analysis pI fragmentation MS analysis tryptic digest MW Classic: 2D PAGE proteomics

protein extraction complex protein mixture enzymatic digest http://www.akh-wien.ac.at/biomed-research/htx/platweb1.htm Data-dependent MS/MS analyses extremely complex peptide mixture separation selection MS analysis less complex peptide fractions New: peptide-centric identification (shotgun strategy)

Public Standards for Proteomics:HUPO Proteomics Standards Initiative

Mission: Develop minimal reporting guidelines Data representation standards (often XML formats) Annotation standards (ontology and controlled vocabularies) Involve data produces, hardware vendors, database providers, software producers, publishers The HUPO Proteomics Standards Initiative http://psidev.info

Four documents make up each individual standard Formal requirements specification Minimal reporting requirements => MIAPE document XML Data exchange format Domain-specific controlled vocabulary What constitutes a PSI standard?

MIAPE / MIMIx Guidelines

MIAPE: Minimum Information About a Proteomics Experiment MIMIx: Minimum Information about a Molecular Interaction eXperiment Understand, qualify and reproduce Requirements to be enforced by journals, repositories, funders Compatibility with the PSI data formats MIAPE & MIMIx

It is: A checklist of information and data to provide when an experiment is reported (it is a content descriptor) An aid to assessing quality control Number of replicates, expected error rate It is not: A description of the way to run an experiment A describing of HOW to represent data Use excel to create a table with these five following columns:… A guide to quality judgment What is a MIAPE / MIMIx document

XML Data Exchange Formats

mzData Mass spectrometry data mzML Replacement for mzData (since June 2008) analysisXML Mass spec. search engine output PSI-MI Molecular interactions (PPI) GelML Results of gel electrophoresis experiments GelInfoMLGel image analysis, manipulation and quantitation spML GC, LC, centrifugation, capillary electrophoresis etc. Available XML Exchange Formats

mzData 1.05 Established 4 years ago All major MS vendors generate mzData All major search engines consume mzData Data repositories accept mzData as input Commercial applications are built on mzData mzML 1.0.0 Completed document process on 1 June, 2008 Developed as a collaboration between PSI and ISB PSI-MS Working group chaired by Eric Deutsch (ISB) Supports Merges best features of PSI’s mzData and ISB’s mzXML “We encourage the community to begin implementing mzML 1.0.0 [and] to phase out use of mzData and mzXML” PSI – Mass Spectrometry Data InterchangemzData -> mzML

mzData  mzML: beyond the deliverable PSI ISB mzXML mzData PepXML mzIdent ProtXML –+ + – + + mzML analysisXML

Details of mzML

Details of mzML: run

Details of mzML: cvParam and userParam

Details of mzML: spectrum

Details of mzML: chromatogram

Will become a common format for mass spectrometry search engine output Provides support for multi-step analyses Merges previous efforts of HUPO-PSI with ISB PSI – Mass Spectrometry Data InterchangeanalysisXML (Protein / Peptide Identifications)

All interchange standards map to external CVs CVs used to keep standards flexible and up to date – XML frozen for as long as possible CVs assist in keeping curation consistent and database searching effective All CVs maintained in OBO format and published on the Open Biomedical Ontologies website (http://www.obofoundry.org/) Controlled Vocabularies

PSI-MS Mass spectrometry data MI Molecular Interactions PSI-MOD Protein modifications (PTMs) sepCVSample processing and separations controlled vocabulary PI “Proteomics Informatics” CV (accompanies analysisXML) The four in bold are current and available from the OBO Foundry & the Ontology Lookup Service http://obofoundry.org/ http://www.ebi.ac.uk/ols Available PSI Controlled Vocabularies

PRIDE: The Proteomics Identifications Database

The origin: availability versus accessibility Proteomics data is only made available as arbitrarily formatted PDF tables, carrying important limitations: • Source data (mass spectra) are not made available • No peer review validation possible • Very little raw materials for testing innovative in silico techniques are available • Automated (re-)processing of the identifications is impossible

Sample generation Origin of sample hypothesis, organism, environment, preparation, paper citations • Sample processing, gel informatics Gels (1D/2D), columns, ‘chips’, other methods images, gel type and ranges, band/spot coordinates, quantitation stationary and mobile phases, flow rate, temperature, fractionation • Mass Spectrometry  ‘mzData’ machine type, ion source, voltages • Mass Spectrometry Informatics peak lists, database name + version, partial sequence, search parameters, search hits, accession numbers, quantitation • Data dissemination and Comparison PRIDE peak lists, protein and peptide identifications, post-translational modifications Science Supported by PRIDE

Data In PRIDE Current Statistics: • 831,764 Protein Identifications • 4,947,353 Peptide Identifications (479,014 unique) • 7,409,854 Mass spectra Large Public Datasets: • HUPO Plasma Proteome Project • HUPO Brain Proteome Project (including mass spectra) • HUPO Liver Proteome Project (including mass spectra) • Human Cerebrospinal Fluid (U Washington School of Medicine). • Cellzome data set

Apache Licence, Version 2.0 DAS Distributed Annotation Service Data Ownership Remains with Submitter 84% Public 16% Private PRIDE Overview Data Submission Presentation Proteome Harvest Excel Data Submission Spreadsheet Direct XML Submission Using the PRIDE Core API Human Curation (Creation of XML in house) WEB Data Exchange API & Persistence mzData XML Peak Lists (MS), Instrumentation, Sample. PRIDE XML Identifications of Proteins, Peptides, PTMs CORE

Project * Experiment * Protocol <<mzData>> Sample Species Tissue Disease state Cellular component Developmental stage Protein Identifications <<mzData>> Instrumentation & Associated Software * 1..* * Ordered Steps Peptide Identifications 0..1 * * <<mzData>> Mass Spectra 0..1 * Protein Modifications (PTMs) A simplified schema of the PRIDE data store + group-based access control system; reviewer access

THE LOOK OF PRIDE

PRIDE web interface – overview

PRIDE web interface – experiment and protein

PRIDE web interface – mass spectra

PRIDE web interface – project comparison

PRIDE BioMart A Leap Forward in Query Capability

BioMart (http://www.biomart.org) A query-oriented data management system. Developed by the EBI and CSHL Powered by BioMart software: • Central Server • Ensembl • HapMap • Dictybase • UniProt • Reactome • Array Express • Wormbase • Gramene • GermOnLine • DroSpeGe • PRIDE

BioMart and PRIDE • Perform powerful and fast queries across large, complex data sets: • specify simple or complex filters involving multiple attributes of the data; • specify precisely which attributes or ‘columns’ of data are included in the output; • specify the format of the output, including: • HTML table (with links) • Excel spreadsheet • Tab-delimited file • Comma separated format

Typical BioMart Usage Step 1 (Dataset): Choose your dataset Step 2 (Filters): Restrict your query Step 3 (Attributes): Specify what information you want to include in the output Step 4 (Results): Preview (including a simple count) and output or download the results in your chosen format.

PRIDE BioMart – Dataset Page

PRIDE BioMart – Defining a Complex Filter

PRIDE BioMart – Selecting Output Fields

PRIDE BioMart – Retrieving Results

PRIDE BioMart – Output to Microsoft Excel

The Ontology Lookup Service:Intelligent Query for PRIDE and Beyond…

Ontologies – more than just a list of terms • A vocabulary of terms (names for concepts) • use stable identifiers for each concept • Definitions • Authoritative and unambiguous meaning for each concept and the context in which it should be used. • Defined logical relationships between terms • More complexity than a simple hierarchy. Child terms can be related to more than one parent and parent terms can have multiple children. Relationships themselves carry a significance.

http://www.ebi.ac.uk/ontology-lookup/ What is OLS? • A unified, single point of query for over 54 ontologies (updated daily) and upwards of 530,000 terms. • A tool that offers online and programmatic access to query ontologies about: • Term names • Synonyms • Relationships • Annotations • Cross-references • Reusable code components to integrate such functionality in other projects

The Use of Controlled Vocabulariesand Ontologies in PRIDE Require controlled vocabularies / ontologies are used to define the search space: • Species: Newt / NCBI Taxonomy ID • Tissue / organ / cell type: BRENDA Tissue ontology, Cell Type ontology • Sub-cellular component: GO • Disease: Human Disease: DOID • Genotype: GO • Sample Processing: PSI Ontology • Mass Spectrometry: PSI-MS Ontology • Protein Modifications: PSI-MOD Ontology • Terms that fit nowhere else!? - PRIDE CV OBO Ontologies

Mass Spec Proteomics HUPO-PSI & PRIDE