1 / 26

Bioinformatics Committee Exploration of the sPRG2006 Study Data Sets

This study explores the sPRG2006 datasets to determine the best answer set of proteins, assess the consistency of analysis tools, and examine sources of variation in previous sPRG study results. A master protein list was generated, showing consistent results among experts using different analysis tools.

mperales
Download Presentation

Bioinformatics Committee Exploration of the sPRG2006 Study Data Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Committee Exploration of the sPRG2006 Study Data Sets ABRF2007, Tampa, FL

  2. Bioinformatics Committee (BIC) General Goals • Examine sPRG2006 datasets • Act as the foundation for an ABRF "iPRG" – Proteome Informatics Research Group Members • Jeffrey A. Kowalak - National Institutes of Health • William S. Lane - Harvard University • Alexey Nesvizhskii - University of Michigan • Brian C. Searle - Proteome Software • Sean L. Seymour - Applied Biosystems (chair) • David L. Tabb - Vanderbilt University

  3. Specific Questions of Interest • What is the best answer set of proteins for the sample? • Which 'bonus proteins' reported by respondents last year are correct? • Careful protein inference assumed to be key in answering this. • A 'best answer' for the sample would allow these data sets to be used for future informatics tools development, testing, etc. • Can experts in informatics get consistent results using different analysis tools? • How similar are existing protein inference tools? • Can we dissect acquisition and informatics sources of variation in the quality of answers from last year’s sPRG study?

  4. Analyzed Data: sPRG2006 Study • 78 labs responded to last year's sPRG Study • 24 labs provided raw data (used Tranche repository) • Converted data to mzData, mzXML, .dta, and .mgf • 19 sets were ultimately examined • These sets included: • Data from all major vendors' instruments • ESI and MALDI, 1D-LC, 2D-LC, and gel workflows. • Only MS/MS data were considered.

  5. A Diversity of Analysis Tools were Used

  6. Method to Finding Consensus • Stage 1: Free exploration of data sets • Stage 2: From inclusive to focused: BIC Candidates FASTA database • Included all accessions that anyone suggested might be correct: • Original respondents’ protein lists. • BIC from UniProt, IPI, NCBI, Sigma recombinant proteins, and a contaminant database. • Searched concatenated decoy database. • Stage 3: Final searches on BIC FASTA DB • Upon seeing rough results, we removed unnecessary non-human proteins. All Swiss-Prot human proteins 15,637 Strong IPI, NCBI candidates 4 Strong non-human candidates 39 Strong Sigma sequence candidates 1 Total 15,681 (31,362 including reversed)

  7. BIC Consensus Protein Annotation • Not everyone searched all 19 files – total of 84 searches • Of 247 total proteins that had any evidence: • 133 were vouched for by at least one person • Reduced to 104 by requiring a majority in same lab

  8. BIC Consensus Protein Annotation • Total proteins detected in any lab’s data 104 Expected: Intended 49 proteins 49 Expected: Additional human proteins 20 Expected: E.coli 3 Total desirable protein detections (originally in sample) 72 Expected: Digestion enzyme 2 Variable: Human keratins 12 Variable: Sheep proteins (from wool) 5 Variable: Trypanosome proteins 9 Variable: E.coli proteins 4 Total other detections (avoidable artifacts) 30

  9. Final Searches: See the BIC Poster for All Search Results Cell gives peptide metric reported for protein Ambiguity among Accession Numbers Block for each participant lab Each row is one putative protein detection Column per analysis by one BIC member

  10. BIC 1 BIC 2 BIC 2 BIC 2 BIC 2 BIC 2 proteins Final Searches: See the BIC Poster for All Search Results participant lab 28629 Trypanosome protein contaminants

  11. BIC Assessment of Data Set Quality and Consistency of Analysis • Not everyone counts peptides the same way. • However, trends were clearly consistent. • To assess consistency of our analyses and the fundamental quality of the acquired data sets, peptide measures were normalized.

  12. Normalization of Peptide Metrics across BIC Member Analyses • Considered 4 data sets. • Sum each protein per BIC. • Compute the mean value for each protein across BIC. • Determine a normalization factor for each BIC member. Expected Proteins 1-49

  13. Consistency of BIC on Individual Data Sets • For cross-validation, consider data sets not used to determined the normalization factors. Expected Proteins 1-49

  14. Consistency of BIC on Individual Data Sets With error bars indicating variance across BIC • Averaging all BIC measures for any single data set yields a consensus on the information content or quality of the data set. Expected Proteins 1-49 …or for simplicity, just the trend line. Expected Proteins 1-49 Expected Proteins 1-49

  15. BIC Consensus: Qualities of Acquired Data Expected Proteins 1-49

  16. BIC Consensus: Qualities of Acquired Data Proteins 50-80

  17. BIC Consensus: Qualities of Acquired Data E.coli Proteins 100-145

  18. Re-grading of sPRG2006 Results • Using the annotation, the picture changes slightly.

  19. Conclusions • A master protein list has been produced for the sPRG2006 study sample. • Despite using very different tools, experts were able to get surprisingly consistent results. • A key to recognizing this consistency was careful alignment of reported proteins preserving ambiguity among accession numbers where possible. • By doing our own informatics analyses, we were able to order the participants’ data sets by our assessment of the quality of the acquired data. • We reconfirm the conclusion of last year’s sPRG study that the expertise of the user seems to be the dominant factor in the quality of the acquired data. • There are both good and bad sets from high res and low res instruments, as well as newer and older instruments. • Re-grading submitted protein lists with our master protein list allowed confirmation of additional correct proteins reported by last year’s respondents. • To what extent is expertise a factor in consistency of informatics analyses? • www.abrf.org/sprg

  20. Acknowledgements • BIC Members • Jeffrey A. Kowalak - National Institutes of Health • William S. Lane - Harvard University • Alexey Nesvizhskii - University of Michigan • Brian C. Searle - Proteome Software • Sean L. Seymour - Applied Biosystems (chair) • David L. Tabb - Vanderbilt University • The sPRG • Jayson Falkner (Phil Andrews lab, UMichigan) • All sPRG2006 Study respondents who submitted your raw data!

  21. Appendix

  22. Normalization of Peptide Metrics across BIC Member Analyses • Comparison of the individual means to the group mean allows the derivation of a correction factor from a simple fit to y=mx.

  23. sPRG2006 Proteomics Standardwww.abrf.org/sprg Note: UBC9_HUMAN - formerly UBE2I_HUMAN SYHC_HUMAN - formerly SYH_HUMAN

  24. sPRG2006 Protein Identification

  25. Proteins Identified by 2 or More Labs

  26. Final Searches • Not all 19 data sets were searched by all 6 members of the BIC:

More Related