1 / 56

PRIDE

PRIDE. The Proteomics Identifications Database. Proteomics Services Group www.ebi.ac.uk. TWO PATHS TO MASS SPECTROMETRY BASED PROTEOMICS. protein extraction. complex protein mixture. http://www.akh-wien.ac.at/biomed-research/htx/platweb1.htm. 2D-PAGE separation. MS/MS analysis. pI.

tavita
Download Presentation

PRIDE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PRIDE The Proteomics Identifications Database Proteomics Services Group www.ebi.ac.uk

  2. TWO PATHS TO MASS SPECTROMETRY BASED PROTEOMICS

  3. protein extraction complex protein mixture http://www.akh-wien.ac.at/biomed-research/htx/platweb1.htm 2D-PAGE separation MS/MS analysis pI fragmentation MS analysis tryptic digest MW Classic: 2D PAGE proteomics

  4. protein extraction complex protein mixture tryptic digest http://www.akh-wien.ac.at/biomed-research/htx/platweb1.htm Data-dependent MS/MS analyses extremely complex peptide mixture separation selection MS analysis less complex peptide fractions New: peptide-centric identification (shotgun strategy)

  5. Public Standards for Proteomics: HUPO Proteomics Standards Initiative

  6. http://psidev.info HUPO Proteomics Standards Initiative (PSI) • Formal requirements specification • Minimal reporting requirements • MIAPE document (Minimum Information About a Proteomics Experiment) • Data representation standards • XML Data exchange format • Annotation standards • Controlled vocabulary (CV) and ontologies Requirements enforced by journals, repositories and funders

  7. HUPO PSI - what is a MIAPE document? It is: • Checklist of information and data required when reporting an experiment • Helps assess quality control (number of replicates, expected error rate…) It is not: • Dictating how to run an experiment

  8. HUPO-PSI - XML Data Exchange Formats mzML • Mass spectrometry data • Merges mzData (HUPO_PSI) and mzXML (ISB) • Mass spectrometry search engine output • Provides support for multi-step analyses analysisXML PSI-MI • Molecular interactions (PPI) GelML • Results of gel electrophoresis experiments GelInfoML • Gel image analysis, manipulation & quantitation spML • GC, LC, centrifugation, capillary electrophoresis

  9. HUPO-PSI - Annotation Standards • Controlled Vocabularies (CVs) keep curation consistent and database searching effective • CVs maintained in OBO format • published on the Open Biomedical Ontologies website (http://www.obofoundry.org/) • Available PSI Controlled Vocabularies • PSI-MS(Mass spectrometry data) • MI(Molecular Interactions) • PSI-MOD (Protein modifications (PTMs)) • sepCV (Sample processing and separations CV) • PI(Proteomics Informatics CV (accompanies analysisXML)

  10. PRIDE: The Proteomics Identification Database

  11. The origin: availability versus accessibility Proteomics data is only made available as arbitrarily formatted PDF tables, carrying important limitations: • Source data (mass spectra) are not made available • No peer review validation possible • Very little raw materials for testing innovative in silico techniques are available • Automated (re-)processing of the identifications is impossible

  12. Sample generation Origin of sample hypothesis, organism, environment, preparation, paper citations • Sample processing, gel informatics Gels (1D/2D), columns, ‘chips’, other methods images, gel type and ranges, band/spot coordinates, quantitation stationary and mobile phases, flow rate, temperature, fractionation • Mass Spectrometry  ‘mzData’ machine type, ion source, voltages • Mass Spectrometry Informatics peak lists, database name + version, partial sequence, search parameters, search hits, accession numbers, quantitation • Data dissemination and Comparison PRIDE peak lists, protein and peptide identifications, post-translational modifications Science Supported by PRIDE

  13. Data In PRIDE Current Statistics: • 372,625 Protein Identifications • 2,150,515 Peptide Identifications (310,381 unique) • 2,599,562 Spectra Large Public Datasets: • HUPO Plasma Proteome Project • HUPO Brain Proteome Project (including mass spectra) • HUPO Liver Proteome Project (including mass spectra) • Human Cerebrospinal Fluid (U Washington School of Medicine). • Cellzome data set

  14. Apache Licence, Version 2.0 DAS Distributed Annotation Service Data Ownership Remains with Submitter 84% Public 16% Private PRIDE Overview Data Submission Presentation Proteome Harvest Excel Data Submission Spreadsheet Direct XML Submission Using the PRIDE Core API Human Curation (Creation of XML in house) WEB Data Exchange API & Persistence mzData XML Peak Lists (MS), Instrumentation, Sample. PRIDE XML Identifications of Proteins, Peptides, PTMs CORE

  15. Project * Experiment * Protocol <<mzData>> Sample Species Tissue Disease state Cellular component Developmental stage Protein Identifications <<mzData>> Instrumentation & Associated Software * 1..* * Ordered Steps Peptide Identifications 0..1 * * <<mzData>> Mass Spectra 0..1 * Protein Modifications (PTMs) A simplified schema of the PRIDE data store + group-based access control system; reviewer access

  16. THE LOOK OF PRIDE

  17. PRIDE web interface – overview

  18. PRIDE web interface – experiment and protein

  19. PRIDE web interface – mass spectra

  20. PRIDE web interface – project comparison

  21. PRIDE BioMart A Leap Forward in Query Capability

  22. BioMart (http://www.biomart.org) A query-oriented data management system. Developed by the EBI and CSHL Powered by BioMart software: • Central Server • Ensembl • HapMap • Dictybase • UniProt • Reactome • Array Express • Wormbase • Gramene • GermOnLine • DroSpeGe • PRIDE

  23. BioMart and PRIDE • Perform powerful and fast queries across large, complex data sets: • specify simple or complex filters involving multiple attributes of the data; • specify precisely which attributes or ‘columns’ of data are included in the output; • specify the format of the output, including: • HTML table (with links) • Excel spreadsheet • Tab-delimited file • Comma separated format

  24. Typical BioMart Usage Step 1 (Dataset): Choose your dataset Step 2 (Filters): Restrict your query Step 3 (Attributes): Specify what information you want to include in the output Step 4 (Results): Preview (including a simple count) and output or download the results in your chosen format.

  25. Typical BioMart Usage Step 1 (Dataset): Choose your dataset Step 2 (Filters): Restrict your query Step 3 (Attributes): Specify what information you want to include in the output Step 4 (Results): Preview (including a simple count) and output or download the results in your chosen format.

  26. PRIDE BioMart – Dataset Page

  27. Typical BioMart Usage Step 1 (Dataset): Choose your dataset Step 2 (Filters): Restrict your query Step 3 (Attributes): Specify what information you want to include in the output Step 4 (Results): Preview (including a simple count) and output or download the results in your chosen format.

  28. PRIDE BioMart – Defining a Complex Filter

  29. Typical BioMart Usage Step 1 (Dataset): Choose your dataset Step 2 (Filters): Restrict your query Step 3 (Attributes): Specify what information you want to include in the output Step 4 (Results): Preview (including a simple count) and output or download the results in your chosen format.

  30. PRIDE BioMart – Selecting Output Fields

  31. Typical BioMart Usage Step 1 (Dataset): Choose your dataset Step 2 (Filters): Restrict your query Step 3 (Attributes): Specify what information you want to include in the output Step 4 (Results): Preview (including a simple count) and output or download the results in your chosen format.

  32. PRIDE BioMart – Retrieving Results

  33. PRIDE BioMart – Output to Microsoft Excel

  34. The Ontology Lookup Service:Intelligent Query for PRIDE and Beyond…

  35. Ontologies – more than just a list of terms • A vocabulary of terms (names for concepts) • use stable identifiers for each concept • Definitions • Authoritative and unambiguous meaning for each concept and the context in which it should be used. • Defined logical relationships between terms • More complexity than a simple hierarchy. Child terms can be related to more than one parent and parent terms can have multiple children. Relationships themselves carry a significance.

  36. http://www.ebi.ac.uk/ontology-lookup/ What is OLS? • A unified, single point of query for over 54 ontologies (updated daily) and upwards of 530,000 terms. • A tool that offers online and programmatic access to query ontologies about: • Term names • Synonyms • Relationships • Annotations • Cross-references • Reusable code components to integrate such functionality in other projects

  37. The Use of Controlled Vocabulariesand Ontologies in PRIDE Require controlled vocabularies / ontologies are used to define the search space: • Species: Newt / NCBI Taxonomy ID • Tissue / organ / cell type: BRENDA Tissue ontology, Cell Type ontology • Sub-cellular component: GO • Disease: Human Disease: DOID • Genotype: GO • Sample Processing: PSI Ontology • Mass Spectrometry: PSI-MS Ontology • Protein Modifications: PSI-MOD Ontology • Terms that fit nowhere else!? - PRIDE CV OBO Ontologies

  38. Ontology Lookup Service (OLS) http://www.ebi.ac.uk/ols

  39. The Protein Identifier Cross Reference Service:Solving the Protein Accession Problem in PRIDE

  40. Why do you need ID mapping • Merging datasets to a common identifier space • Finding all aliases/synonyms for an identifier • (data integration – submissions!) • Mapping from secondary IDs to more recent primary IDs • (data “freshness”) • Preparing data sets for specific tools • Querying in various primary databases • (data format requirements)

  41. Protein identifier mapping is hard • The basic problem: the same protein sequence is referred to by multiple accession numbers assigned by multiple databases. • No universal identifier scheme • Redundant databases – multiple identifiers for the same sequence in the same database • Unstable identifiers (ex: gi numbers) • Obsolete and deleted identifiers (hypothetical proteins) • Different production cycles for major databases • Tools exist, but are limited in important their database and species coverage and in their usability and availability. UniParc is a major component

  42. PICR: Home page http://www.ebi.ac.uk/tools/picr Limit search by taxonomy (pessimistic) Submit accessions OR sequences (FASTA) with 500 entry interactive limit (no batch limit) Choose to return all mappings or only active ones Select output format Select one or many databases to map to in one request Run search

  43. Logical xref (hyperlinked) Secondary Identifier Active xref (hyperlinked) Inactive xref PICR Result Page – simple view

  44. PICR Result Page – detailed view

  45. PICR Result Page – XLS view

  46. PICR in PRIDE

  47. PRIDE PLAYS WELL WITH OTHER PROTEOMICS REPOSITORIES

  48. Large (binary) files PRIDE PRIDE to Tranche Tranche, Falkner and Andrews http://tranche.proteomecommons.org

  49. 2D-PAGE gels and gel spots PRIDE World 2D PAGE

  50. ProteomExchange consortium • Sharing proteomics data between existing proteomics repositories • Includes PeptideAtlas, GPMDB, and PRIDE • Submission guidelines document finalized, currently being tested • Guidelines primarily deal with data types, formatting and reporting requirements • Both submitters and journals are quite interested and drive process

More Related