Bioinformatics Committee Exploration of the sPRG2006 Study Data Sets

Bioinformatics Committee Exploration of the sPRG2006 Study Data Sets ABRF2007, Tampa, FL

Bioinformatics Committee (BIC) General Goals • Examine sPRG2006 datasets • Act as the foundation for an ABRF "iPRG" – Proteome Informatics Research Group Members • Jeffrey A. Kowalak - National Institutes of Health • William S. Lane - Harvard University • Alexey Nesvizhskii - University of Michigan • Brian C. Searle - Proteome Software • Sean L. Seymour - Applied Biosystems (chair) • David L. Tabb - Vanderbilt University

Specific Questions of Interest • What is the best answer set of proteins for the sample? • Which 'bonus proteins' reported by respondents last year are correct? • Careful protein inference assumed to be key in answering this. • A 'best answer' for the sample would allow these data sets to be used for future informatics tools development, testing, etc. • Can experts in informatics get consistent results using different analysis tools? • How similar are existing protein inference tools? • Can we dissect acquisition and informatics sources of variation in the quality of answers from last year’s sPRG study?

Analyzed Data: sPRG2006 Study • 78 labs responded to last year's sPRG Study • 24 labs provided raw data (used Tranche repository) • Converted data to mzData, mzXML, .dta, and .mgf • 19 sets were ultimately examined • These sets included: • Data from all major vendors' instruments • ESI and MALDI, 1D-LC, 2D-LC, and gel workflows. • Only MS/MS data were considered.

A Diversity of Analysis Tools were Used

Method to Finding Consensus • Stage 1: Free exploration of data sets • Stage 2: From inclusive to focused: BIC Candidates FASTA database • Included all accessions that anyone suggested might be correct: • Original respondents’ protein lists. • BIC from UniProt, IPI, NCBI, Sigma recombinant proteins, and a contaminant database. • Searched concatenated decoy database. • Stage 3: Final searches on BIC FASTA DB • Upon seeing rough results, we removed unnecessary non-human proteins. All Swiss-Prot human proteins 15,637 Strong IPI, NCBI candidates 4 Strong non-human candidates 39 Strong Sigma sequence candidates 1 Total 15,681 (31,362 including reversed)

BIC Consensus Protein Annotation • Not everyone searched all 19 files – total of 84 searches • Of 247 total proteins that had any evidence: • 133 were vouched for by at least one person • Reduced to 104 by requiring a majority in same lab

BIC Consensus Protein Annotation • Total proteins detected in any lab’s data 104 Expected: Intended 49 proteins 49 Expected: Additional human proteins 20 Expected: E.coli 3 Total desirable protein detections (originally in sample) 72 Expected: Digestion enzyme 2 Variable: Human keratins 12 Variable: Sheep proteins (from wool) 5 Variable: Trypanosome proteins 9 Variable: E.coli proteins 4 Total other detections (avoidable artifacts) 30

Final Searches: See the BIC Poster for All Search Results Cell gives peptide metric reported for protein Ambiguity among Accession Numbers Block for each participant lab Each row is one putative protein detection Column per analysis by one BIC member

BIC 1 BIC 2 BIC 2 BIC 2 BIC 2 BIC 2 proteins Final Searches: See the BIC Poster for All Search Results participant lab 28629 Trypanosome protein contaminants

BIC Assessment of Data Set Quality and Consistency of Analysis • Not everyone counts peptides the same way. • However, trends were clearly consistent. • To assess consistency of our analyses and the fundamental quality of the acquired data sets, peptide measures were normalized.

Normalization of Peptide Metrics across BIC Member Analyses • Considered 4 data sets. • Sum each protein per BIC. • Compute the mean value for each protein across BIC. • Determine a normalization factor for each BIC member. Expected Proteins 1-49

Consistency of BIC on Individual Data Sets • For cross-validation, consider data sets not used to determined the normalization factors. Expected Proteins 1-49

Consistency of BIC on Individual Data Sets With error bars indicating variance across BIC • Averaging all BIC measures for any single data set yields a consensus on the information content or quality of the data set. Expected Proteins 1-49 …or for simplicity, just the trend line. Expected Proteins 1-49 Expected Proteins 1-49

BIC Consensus: Qualities of Acquired Data Expected Proteins 1-49

BIC Consensus: Qualities of Acquired Data Proteins 50-80

BIC Consensus: Qualities of Acquired Data E.coli Proteins 100-145

Re-grading of sPRG2006 Results • Using the annotation, the picture changes slightly.

Conclusions • A master protein list has been produced for the sPRG2006 study sample. • Despite using very different tools, experts were able to get surprisingly consistent results. • A key to recognizing this consistency was careful alignment of reported proteins preserving ambiguity among accession numbers where possible. • By doing our own informatics analyses, we were able to order the participants’ data sets by our assessment of the quality of the acquired data. • We reconfirm the conclusion of last year’s sPRG study that the expertise of the user seems to be the dominant factor in the quality of the acquired data. • There are both good and bad sets from high res and low res instruments, as well as newer and older instruments. • Re-grading submitted protein lists with our master protein list allowed confirmation of additional correct proteins reported by last year’s respondents. • To what extent is expertise a factor in consistency of informatics analyses? • www.abrf.org/sprg

Acknowledgements • BIC Members • Jeffrey A. Kowalak - National Institutes of Health • William S. Lane - Harvard University • Alexey Nesvizhskii - University of Michigan • Brian C. Searle - Proteome Software • Sean L. Seymour - Applied Biosystems (chair) • David L. Tabb - Vanderbilt University • The sPRG • Jayson Falkner (Phil Andrews lab, UMichigan) • All sPRG2006 Study respondents who submitted your raw data!

Appendix

Normalization of Peptide Metrics across BIC Member Analyses • Comparison of the individual means to the group mean allows the derivation of a correction factor from a simple fit to y=mx.

sPRG2006 Proteomics Standardwww.abrf.org/sprg Note: UBC9_HUMAN - formerly UBE2I_HUMAN SYHC_HUMAN - formerly SYH_HUMAN

sPRG2006 Protein Identification

Proteins Identified by 2 or More Labs

Final Searches • Not all 19 data sets were searched by all 6 members of the BIC:

Bioinformatics Committee Exploration of the sPRG2006 Study Data Sets

Bioinformatics Committee Exploration of the sPRG2006 Study Data Sets

Presentation Transcript

Sets of Digital Data

Private Analysis of Data Sets

I2E Data Sets

Bioinformatics Data Management

Data Sets

Health Data Sets

Example Data Sets

Inductive Sets of Data

Data Abstraction: Sets

Cluster data sets

The Unseen Challenge Data Sets

Inductive Sets of Data

Overview of Existing Data Sets

Desktop techniques for the exploration of terascale size, time-varying data sets

Desktop Techniques for the Exploration of Terascale Sized Turbulence Data Sets

IMPROVING THE UPTAKE OF GLOBAL DATA SETS

Inductive Sets of Data

Data Mining – Basics of Bioinformatics

Architecture Exploration of FPGA based Accelerators for Bioinformatics

Cluster data sets

Bioinformatics Committee Exploration of the sPRG2006 Study Data Sets