PRIDE

PRIDE The Proteomics Identifications Database Proteomics Services Group www.ebi.ac.uk

TWO PATHS TO MASS SPECTROMETRY BASED PROTEOMICS

protein extraction complex protein mixture http://www.akh-wien.ac.at/biomed-research/htx/platweb1.htm 2D-PAGE separation MS/MS analysis pI fragmentation MS analysis tryptic digest MW Classic: 2D PAGE proteomics

protein extraction complex protein mixture tryptic digest http://www.akh-wien.ac.at/biomed-research/htx/platweb1.htm Data-dependent MS/MS analyses extremely complex peptide mixture separation selection MS analysis less complex peptide fractions New: peptide-centric identification (shotgun strategy)

Public Standards for Proteomics: HUPO Proteomics Standards Initiative

http://psidev.info HUPO Proteomics Standards Initiative (PSI) • Formal requirements specification • Minimal reporting requirements • MIAPE document (Minimum Information About a Proteomics Experiment) • Data representation standards • XML Data exchange format • Annotation standards • Controlled vocabulary (CV) and ontologies Requirements enforced by journals, repositories and funders

HUPO PSI - what is a MIAPE document? It is: • Checklist of information and data required when reporting an experiment • Helps assess quality control (number of replicates, expected error rate…) It is not: • Dictating how to run an experiment

HUPO-PSI - XML Data Exchange Formats mzML • Mass spectrometry data • Merges mzData (HUPO_PSI) and mzXML (ISB) • Mass spectrometry search engine output • Provides support for multi-step analyses analysisXML PSI-MI • Molecular interactions (PPI) GelML • Results of gel electrophoresis experiments GelInfoML • Gel image analysis, manipulation & quantitation spML • GC, LC, centrifugation, capillary electrophoresis

HUPO-PSI - Annotation Standards • Controlled Vocabularies (CVs) keep curation consistent and database searching effective • CVs maintained in OBO format • published on the Open Biomedical Ontologies website (http://www.obofoundry.org/) • Available PSI Controlled Vocabularies • PSI-MS(Mass spectrometry data) • MI(Molecular Interactions) • PSI-MOD (Protein modifications (PTMs)) • sepCV (Sample processing and separations CV) • PI(Proteomics Informatics CV (accompanies analysisXML)

PRIDE: The Proteomics Identification Database

The origin: availability versus accessibility Proteomics data is only made available as arbitrarily formatted PDF tables, carrying important limitations: • Source data (mass spectra) are not made available • No peer review validation possible • Very little raw materials for testing innovative in silico techniques are available • Automated (re-)processing of the identifications is impossible

Sample generation Origin of sample hypothesis, organism, environment, preparation, paper citations • Sample processing, gel informatics Gels (1D/2D), columns, ‘chips’, other methods images, gel type and ranges, band/spot coordinates, quantitation stationary and mobile phases, flow rate, temperature, fractionation • Mass Spectrometry  ‘mzData’ machine type, ion source, voltages • Mass Spectrometry Informatics peak lists, database name + version, partial sequence, search parameters, search hits, accession numbers, quantitation • Data dissemination and Comparison PRIDE peak lists, protein and peptide identifications, post-translational modifications Science Supported by PRIDE

Data In PRIDE Current Statistics: • 372,625 Protein Identifications • 2,150,515 Peptide Identifications (310,381 unique) • 2,599,562 Spectra Large Public Datasets: • HUPO Plasma Proteome Project • HUPO Brain Proteome Project (including mass spectra) • HUPO Liver Proteome Project (including mass spectra) • Human Cerebrospinal Fluid (U Washington School of Medicine). • Cellzome data set

Apache Licence, Version 2.0 DAS Distributed Annotation Service Data Ownership Remains with Submitter 84% Public 16% Private PRIDE Overview Data Submission Presentation Proteome Harvest Excel Data Submission Spreadsheet Direct XML Submission Using the PRIDE Core API Human Curation (Creation of XML in house) WEB Data Exchange API & Persistence mzData XML Peak Lists (MS), Instrumentation, Sample. PRIDE XML Identifications of Proteins, Peptides, PTMs CORE

Project * Experiment * Protocol <<mzData>> Sample Species Tissue Disease state Cellular component Developmental stage Protein Identifications <<mzData>> Instrumentation & Associated Software * 1..* * Ordered Steps Peptide Identifications 0..1 * * <<mzData>> Mass Spectra 0..1 * Protein Modifications (PTMs) A simplified schema of the PRIDE data store + group-based access control system; reviewer access

THE LOOK OF PRIDE

PRIDE web interface – overview

PRIDE web interface – experiment and protein

PRIDE web interface – mass spectra

PRIDE web interface – project comparison

PRIDE BioMart A Leap Forward in Query Capability

BioMart (http://www.biomart.org) A query-oriented data management system. Developed by the EBI and CSHL Powered by BioMart software: • Central Server • Ensembl • HapMap • Dictybase • UniProt • Reactome • Array Express • Wormbase • Gramene • GermOnLine • DroSpeGe • PRIDE

BioMart and PRIDE • Perform powerful and fast queries across large, complex data sets: • specify simple or complex filters involving multiple attributes of the data; • specify precisely which attributes or ‘columns’ of data are included in the output; • specify the format of the output, including: • HTML table (with links) • Excel spreadsheet • Tab-delimited file • Comma separated format

Typical BioMart Usage Step 1 (Dataset): Choose your dataset Step 2 (Filters): Restrict your query Step 3 (Attributes): Specify what information you want to include in the output Step 4 (Results): Preview (including a simple count) and output or download the results in your chosen format.

PRIDE BioMart – Dataset Page

PRIDE BioMart – Defining a Complex Filter

PRIDE BioMart – Selecting Output Fields

PRIDE BioMart – Retrieving Results

PRIDE BioMart – Output to Microsoft Excel

The Ontology Lookup Service:Intelligent Query for PRIDE and Beyond…

Ontologies – more than just a list of terms • A vocabulary of terms (names for concepts) • use stable identifiers for each concept • Definitions • Authoritative and unambiguous meaning for each concept and the context in which it should be used. • Defined logical relationships between terms • More complexity than a simple hierarchy. Child terms can be related to more than one parent and parent terms can have multiple children. Relationships themselves carry a significance.

http://www.ebi.ac.uk/ontology-lookup/ What is OLS? • A unified, single point of query for over 54 ontologies (updated daily) and upwards of 530,000 terms. • A tool that offers online and programmatic access to query ontologies about: • Term names • Synonyms • Relationships • Annotations • Cross-references • Reusable code components to integrate such functionality in other projects

The Use of Controlled Vocabulariesand Ontologies in PRIDE Require controlled vocabularies / ontologies are used to define the search space: • Species: Newt / NCBI Taxonomy ID • Tissue / organ / cell type: BRENDA Tissue ontology, Cell Type ontology • Sub-cellular component: GO • Disease: Human Disease: DOID • Genotype: GO • Sample Processing: PSI Ontology • Mass Spectrometry: PSI-MS Ontology • Protein Modifications: PSI-MOD Ontology • Terms that fit nowhere else!? - PRIDE CV OBO Ontologies

Ontology Lookup Service (OLS) http://www.ebi.ac.uk/ols

The Protein Identifier Cross Reference Service:Solving the Protein Accession Problem in PRIDE

Why do you need ID mapping • Merging datasets to a common identifier space • Finding all aliases/synonyms for an identifier • (data integration – submissions!) • Mapping from secondary IDs to more recent primary IDs • (data “freshness”) • Preparing data sets for specific tools • Querying in various primary databases • (data format requirements)

Protein identifier mapping is hard • The basic problem: the same protein sequence is referred to by multiple accession numbers assigned by multiple databases. • No universal identifier scheme • Redundant databases – multiple identifiers for the same sequence in the same database • Unstable identifiers (ex: gi numbers) • Obsolete and deleted identifiers (hypothetical proteins) • Different production cycles for major databases • Tools exist, but are limited in important their database and species coverage and in their usability and availability. UniParc is a major component

PICR: Home page http://www.ebi.ac.uk/tools/picr Limit search by taxonomy (pessimistic) Submit accessions OR sequences (FASTA) with 500 entry interactive limit (no batch limit) Choose to return all mappings or only active ones Select output format Select one or many databases to map to in one request Run search

Logical xref (hyperlinked) Secondary Identifier Active xref (hyperlinked) Inactive xref PICR Result Page – simple view

PICR Result Page – detailed view

PICR Result Page – XLS view

PICR in PRIDE

PRIDE PLAYS WELL WITH OTHER PROTEOMICS REPOSITORIES

Large (binary) files PRIDE PRIDE to Tranche Tranche, Falkner and Andrews http://tranche.proteomecommons.org

2D-PAGE gels and gel spots PRIDE World 2D PAGE

ProteomExchange consortium • Sharing proteomics data between existing proteomics repositories • Includes PeptideAtlas, GPMDB, and PRIDE • Submission guidelines document finalized, currently being tested • Guidelines primarily deal with data types, formatting and reporting requirements • Both submitters and journals are quite interested and drive process

PRIDE

PRIDE

Presentation Transcript

Pride:

Pride

PRIDE

Ally’s Pride

Pride of …

PRIDE

Pride:

PRIDE

TIGER PRIDE

American Pride,

PRIDE

PRIDE

Pride

PRIDE

PRIDE

PRIDE

Pride

Pride

WIDENER PRIDE

Pride Removal

PRIDE

Pride