1 / 57

Proteomics repositories

Proteomics repositories. Dr. Juan Antonio VIZCAINO PRIDE Group coordinator. PRIDE team, Proteomics Services Group PANDA group European Bioinformatics Institute Hinxton , Cambridge United Kingdom. Overview …. Why sharing proteomics data? PRIDE in detail … Other proteomics repositories

zuriel
Download Presentation

Proteomics repositories

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Proteomics repositories Dr. Juan Antonio VIZCAINO PRIDE Group coordinator PRIDE team, Proteomics Services Group PANDA group European Bioinformatics Institute Hinxton, Cambridge United Kingdom

  2. Overview … • Why sharing proteomics data? • PRIDE in detail… • Other proteomics repositories • ProteomeXchange consortium

  3. THE RATIONALE BEHIND SHARING PROTEOMICS DATA

  4. Key technologies in modern biology Genomics DNA sequencing is a central technology for studying DNA Microarrays and RNA-seqare a central technology for studying RNA Mass spectrometry is a central technology for studying the proteome. Transcript- omics Proteomics

  5. Need of data sharing in the proteomics field

  6. Proteomics data sharing: why? • 1) Data producers are not always the best data analysts • Sharing of data allows analysts access to real data, and in turn allows • better analysis tools to be developed • 2) Meta-analysis of data can recycle previous findings for new tasks • Putting findings in the context of other findings increases their scope • 3) Sharing data allows independent review of the findings • When actual replication of an experiment is often impossible, a re-analysis or spot checks on the obtained data become vitally important • 4) Direct benefit for the field: fragmentation models, spectral libraries, ...

  7. How do we make this all happen? • Journal guidelines • Journal guidelines heavily influence the decisions taken by authors; by first requesting and subsequently mandating data submission to established repositories, they provide an important stick. • Funder support and guidelines • Funders contribute both sticks and carrots. The sticks lie in the grant application guidelines; they can require a plan for data management and dissemination. The carrot is in providing specific funding for this aspect of science. • Data repositories • The availability of reliable, freely available repositories is key; submission thresholds should be kept low and added value needs to be provided. Furthermore, feedback loops need to be established in order to ensure that accumulated data flows back to the user community. Repositories thus provide mostly carrots.

  8. WHAT CAN BE STORED IN PROTEOMICS REPOSITORIES

  9. MS proteomics: peptide IDs and protein IDs TDMDNQIVVSDYAQMDR LFDQAFGLPR AKPLMELIER DESTNVDMSLAQR DIVVQETMEDIDK NGMFFSTYDR GTAGNALMDGASQL sequence database peptides Search engine MS/MS spectra IPI00302927 IPI00025512 IPI00002478 IPI00185600 IPI00014537 IPI00298497 IPI00329236 IPI00002232 proteins

  10. Types of information stored • 1) Original experimental data recorded by the mass spectrometer (primary data)

  11. Primary data Binary data mzData XML-based files mzML mzXML .dta, .pkl, .mgf, .ms2 Peak lists

  12. Types of information stored • 1) Original experimental data recorded by the mass spectrometer (primary data) • 2) Identification results inferred from the original primary data

  13. Peptide and Protein Identifications mzIdentML, mascot .dat, sequest .out, SpectrumMill .spo pep.xml, prot.xml Only qualitative data!

  14. Types of information stored • 1) Original experimental data recorded by the mass spectrometer (primary data) • 2) Identification results inferred from the original primary data • 3) Experimental and technical metadata

  15. Controlled Vocabularies (CVs) Term Synonyms TOF T.O.F. 100173 time-of-flight time of flight

  16. Relationships between CV terms TOF T.O.F. 100123 mass analyzer 100101 mass spectrometer is-a part-of 100173 time-of-flight time of flight

  17. CVs, ontologies (here: PSI-MOD) http://www.ebi.ac.uk/ols

  18. Types of information stored • 1) Original experimental data recorded by the mass spectrometer (primary data) • 2) Identification results inferred from the original primary data • 3) Experimental and technical metadata • 4) Quantification information

  19. Wide variety of quantitative techniques…

  20. Quantification techniques Label free Gel-based quantitation approaches • Different philosophies • Very heterogeneous data formats • Techniques not very well established Very problematic data for proteomics repositories

  21. Types of information stored • 1) Original experimental data recorded by the mass spectrometer (primary data) • 2) Identification results inferred from the original primary data • 3) Experimental and technical metadata • 4) Quantification information

  22. PROTEOMICS REPOSITORIES

  23. Existing proteomics repositories • Main public repositories: • - PROteomicsIDEntifications database (PRIDE) • - Global Proteome Machine (GPMDB) • - Peptide Atlas • - Tranche • Smaller scale repositories, more specialized: • Among others: Human Proteinpedia, Genome Annotation Proteomics Pipeline (GAPP), MAPU, SwedCAD, PepSeeker, Open Proteomics Database, … • Very diverse: different aims, functionalities, …

  24. Other MS proteomics repositories Tranche Reprocesses data Reprocesses data No reprocessing No reprocessing Editorial control Editorial control No editorial control No editorial control Limited annotation Limited annotation Detailed annotation Limited annotation ?? 539 million peptides 302 million spectra ?? ?? ?? 63 million protein IDs 9.2 million protein IDs

  25. PeptideAtlas • Peptide identifications from MS/MS • Data are reprocessed using the popular Trans Proteomic Pipeline (TPP) • - Uses PeptideProphet to derive a probability for the correct identification for all contained peptides http://www.peptideatlas.org

  26. All peptides mapped to Ensembl using ProteinProphet (for human) Built by the Aebersold lab to help them find proteotypicpeptides Provides proteotypic peptide predictions Limited metadata Great support for targeted proteomics approaches (SRM/MRM) PeptideAtlas http://www.peptideatlas.org

  27. PeptideAtlas Builds available at present http://www.peptideatlas.org

  28. End point of the GPM proteomics pipeline, to aid in the process of validating peptide MS/MS spectra and protein coverage patterns. GPMDB http://gpmdb.thegpm.org/

  29. End point of the GPM proteomics pipeline, to aid in the process of validating peptide MS/MS spectra and protein coverage patterns. Data are reprocessed using the popular X!Tandemor X!Hunterspectral searching algorithm Also provides proteotypicpeptides GPMDB http://gpmdb.thegpm.org/

  30. Powerful visualization features Provides very limited annotation with GO, BTO Some support to targeted approaches is available GPMDB http://gpmdb.thegpm.org/

  31. Peer-to-peer distributed filesystem (original name: the DFS) Meant to securely store, and conveniently deliver large amounts of data Provides a highly specialized, but much needed niche service Has already been used by PRIDE to store certain large files Very limited annotation (metadata is not mandatory) Tranche http://tranche.proteomecommons.org

  32. Other MS proteomics repositories Tranche Reprocesses data Reprocesses data No reprocessing No reprocessing Editorial control Editorial control No editorial control No editorial control Limited annotation Limited annotation Detailed annotation Limited annotation ?? 539 million peptides 302 million spectra ?? ?? ?? 63 million protein IDs 9.2 million protein IDs

  33. The PRIDE database

  34. MS proteomics: Shot-gun/bottom-up approaches MS/MS analysis P R O T O C O L peptides sequence database proteins fragmentation MS analysis

  35. MS/MS analysis PRIDE database (www.ebi.ac.uk/pride) peptides sequence database proteins fragmentation PRIDE stores: 1) PeptideIDs 2) Protein IDs 3) Mass spectra as peak lists 4) Valuable additional metadata MS analysis

  36. PRIDE: why is it there? • Repository to support publications (proteomics MS derived data)

  37. Journal Submission Recommendations • Journal guidelines recommend now submission to proteomics repositories: • Proteomics • Nature Biotechnology • Nature Methods • Molecular and Cellular Proteomics • Funding agencies are enforcing public deposition of data to maximize the value of the funds provided

  38. MCP guidelines • New guidelines from MCP for data deposition: • For all proteins identified on the basis of ONE OR TWO unique peptide spectra, the ability to view annotated spectra for these identifications must be made available. This can be achieved in one of three ways: • Submission of spectra and search results to a public results repository that is equipped with a spectral viewer (e.g. PRIDE, Peptidome etc). This information will appear as a hyperlink in the published article… • Submission (with the manuscript) of spectra and search results in a file format that allows visualization of the spectra using a freely-available viewer. • Submission (with the manuscript) of annotated spectra in an ‘office’ or PDF format.

  39. PRIDE: why is it there? • Repository to support publications (proteomics MS derived data) • Source of proteomics data for other data resources PRIDE: reliable source of MS proteomics data for other resources

  40. Data content in PRIDE

  41. Data content in PRIDE

  42. PRIDE data content Protein IDs Peptide IDs

  43. THE LOOK OF PRIDE

  44. PRIDE web interface – overview PART_OF www.ebi.ac.uk/pride

  45. PRIDE web interface – experiment and protein

  46. PRIDE web interface – mass spectra

  47. PRIDE BioMart

  48. PRIDE AND OTHER REPOSITORIES: ProteomeXchange

  49. For sharing, superstructures must be built IntAct … BIND PRIDE … EMBL MINT GPMDB DIP interactions IMEx DDBJ Tranche NCBI PeptideAtlas mass spec ProteomeXchange sequence databases (INSDC) Often, multiple repositories will emerge more or less simultaneously in a particular field. By exchanging data, and by collaborating on data acquisition an increase in coverage as well as a more comprehensive dataset is obtained by each individual resource. Such superstructures do require additional infrastructure, however.

  50. ProteomeXchangeconsortium • Sharing proteomics data between existing proteomics repositories • Includes PeptideAtlas, GPMDB, and PRIDE, with data sharing infrastructure provided by different members • Submission guidelines document finalized, it was proven on three different datasets • ProteomeXchangeis primarily user-oriented: the idea is to provide a single point of submission, but multiple points of data visualization and analysis www.proteomexchange.org

More Related