Analysis of Complex Proteomic Datasets Using Scaffold

Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at: www.proteomesoftware.com

Scaffold: Why do we need it? Shotgun proteomics  Analysis of complex mixtures Whole cell extract 10,000+ proteins 600,000 peptides 1.2 Million Spectra!!! • Beyond the realm of manual interpretation • How do we determine what is a valid protein identification?

Statistical Analysis Using Scaffold • All search engines use different scoring • algorithms  Can not directly compare results • Many search engines results are described by • more than one value Examples: Mascot  Ion Score and Identity Score Sequest  Xcorr and DeltaCn

Statistical Analysis Using Scaffold Peptide Prophet* • Creates a universal score (discriminant score) for the search • engine result (e.g. XCorr and DeltaCn are compressed to one • score for SEQUEST results, Ion score and Identity score for • Mascot results) • Plots a histogram of the discriminant scores and • calculates a bimodal distribution based on standard • statistics to differentiate between correct and incorrect hits • Computes the probability that the match is correct at a • given discriminant score *Nesvizhskii, A. I. et al, Anal. Chem.2003, 75, 4646-4658

Statistical Analysis Using Scaffold 200 180 Histogram of discriminate scores 160 140 120 100 Number of spectra in each bin 80 60 40 20 0 -3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3 Discriminant score (D)

Statistical Analysis Using Scaffold 200 180 160 140 120 100 Number of spectra in each bin 80 60 40 20 0 -3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3 Discriminant score (D) Assumes a mixture of standard statistical distributions “incorrect” “correct”

Statistical Analysis Using Scaffold 200 180 160 140 120 100 Number of spectra in each bin 80 60 40 20 0 -3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3 Discriminant score (D) Peptide Probability Threshold “incorrect” “correct”

Statistical Analysis Using Scaffold 9% 22% 4% 34% 19% 7% 5% One Search Engine may not be enough SEQUEST X!Tandem Mascot www.proteomesoftware.com

Statistical Analysis Using Scaffold • Peptide Prophet statistics are applied separately for • each search engine result (i.e. Mascot, SEQUEST, • and X!Tandem) • Scaffold Merger combines the peptide probabilities • from each search engine to generate a protein • probability The probability of identifying a spectrum + The probability of agreement between search engines Protein Probability

Statistical Analysis Using Scaffold Advantages using of Scaffold • Allows you to choose a statistical error rate by setting probability thresholds • Allows you to compare and combine results from different experiments and different search engines • Allows sharing of raw data and search results • Accepted as a suitable statistical method to validate large datasets

This is the Samples view

List of all the proteins found in your samples Homologous proteins (proteins matched to the same peptides) are shown. You can directly like out to database entries

How does Scaffold Deal with peptides that can be assigned to more than one protein? General Rule  Explain the spectral data with the smallest set of proteins B Protein A and Protein B share all the same peptides so they will be grouped together A

How does Scaffold Deal with peptides that can be assigned to more than one protein? General Rule  Explain the spectral data with the smallest set of proteins B Protein A and protein B each have one unique peptide  they will be listed separately only if the peptide probability is > 50% A

How does Scaffold Deal with peptides that can be assigned to more than one protein? General Rule  Explain the spectral data with the smallest set of proteins B Protein B has two unique peptides  it will be listed separately A

Scaffold will extract GO terms from NCBI annotations

Gene Ontology “GO” terms • Controlled vocabulary containing consistent • descriptions of gene products in different • databases • Describe gene products in terms of their • associated biological processes, cellular • components and molecular functions in a species • independent manner Gene Ontology Projecthttp://www.geneontology.org/GO.doc.shtml

List of samples

Probability thresholds for peptide and protein identifications and required number of unique peptides can be defined Color coded to represent probability that protein identification is correct

This is the Proteins view

Spectrum of each peptide labeled with y and b ions which can be used for manual validation

Manual Spectrum Evaluation • Search engine scores  Is peptide found by more • than one search engine? • Mascot ion score > 40 • SEQUEST Xcorr > 2 (+2 ion), 2.5 (+3 ion) • deltaCn > 0.2 • Good signal-to-noise • Long stretches of y and/or b ions • All dominant peaks are assigned as y or b ions • Fragmentation chemistry N-terminal cleavage at P  dominate y-ion C-terminal cleavage at D and E  dominate b-ion Peptides containing W  abundant y-ions S and T  tend to lose water (-18 Da) R, N, and Q  tend to lose ammonia (-17 Da)

Good Spectrum Good coverage of y and b ion series Dominant y-ion at N-terminal cleavage of P Peptide Sequence IAELAGFSVPENTK +2 charge on parent peptide Good signal-to-noise Mascot: Ion Score = 60.1 Identify Score = 37.3 SEQUEST: Xcorr = 2.61 deltaCn = 0.4

Bad Spectrum Poor signal-to-noise Multiple unassigned peaks Peptide Sequence YPLADYALTPDMAIVDANLVMDMPK +3 charge on parent peptide Poor coverage of y and b ion series Mascot: Ion Score = 9.93 Identity Score = 37.3 SEQUEST: Xcorr = 2.26 deltaCn = 0.2

This is the Statistics view

Scaffold Statistics View Score Histogram Blue indicates “incorrect” proteins Red indicates “correct” proteins Important! Must have enough data to fit two distributions for the statistics to be valid. Protein is “correct” if it passes the peptide and protein probability and minimum # peptide filters.

Scaffold Statistics View With at least 2 unique Peptides (95% peptide prob) the maximum protein probability is ~100%. With only 1 unique peptide (95% peptide prob) the maximum protein probability is <90%.

Scaffold Statistics View Missed IDs SEQUEST only

Scaffold Statistics View Mascot only Missed IDs

Both Mascot only Sequest only Scaffold Statistics View Using both Mascot and Sequest results in more “correct” protein identifications

This is the Publish View

Publication Guidelines for Proteomic Data Journal of Molecular and Cellular Proteomics http://www.mcponline.org/misc/ParisReport_Final.shtml

Publication Guidelines for Proteomic Data Data Analysis • Name and version of software used to extract peak list • Name and version of database searching software (Mascot, Sequest, Spectrum Mill, or X! Tandem) • Values of all search parameters used (enzyme, modifications, mass tolerance, etc.) • Name and size of the database searched (Swisprot or NCBI and the number of sequence entries) • Name and version of any additional software used for statistical analysis and an explanation of the analysis (Scaffold, #peptide requirements, probability settings)

Publication Guidelines for Proteomic Data Each Peptide Identified • Peptide sequence noting any modifications or missed cleavages • Parent peptide ion mass and charge • All search engine scores Each Protein Identified • Accession number • Sequence coverage and total number of unique peptides

Analysis of Complex Proteomic Datasets Using Scaffold

Analysis of Complex Proteomic Datasets Using Scaffold

Presentation Transcript

Genome and proteomic analysis of industrial fungi

Complex Analysis

Pitfalls in the analysis of complex surveys using Stata

SCAFFOLD

Organizing MS/MS Proteomic Data for Publication with Scaffold

Dremel : Interactive Analysis of WebScale Datasets

Study of Lyme disease using proteomic approach

Analysis of shotgun proteomics datasets

Using PP to Scaffold a Text

Algorithmic Analysis of Large Datasets

An Analysis of Jenga Using Complex Systems Theory

Scaffold

Interactive Visualization of Exceptionally Complex Industrial CAD Datasets

Proteomic analysis of the Stress response in insects

Complex Analysis

Complex World: Analysis of air transportation using complex networks

Proteomic analysis of cellular systems

Proteomic Applications

Proteomic

The Nanostructure for Next Generation of Proteomic Analysis

Complex Analysis

Complex Analysis