340 likes | 588 Views
Analysis of Complex Proteomic Datasets Using Scaffold. Free Scaffold Viewer can be downloaded at: www.proteomesoftware.com. Scaffold: Why do we need it?. Shotgun proteomics Analysis of complex mixtures. Whole cell extract. 10,000+ proteins. 600,000 peptides. 1.2 Million Spectra!!!.
E N D
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at: www.proteomesoftware.com
Scaffold: Why do we need it? Shotgun proteomics Analysis of complex mixtures Whole cell extract 10,000+ proteins 600,000 peptides 1.2 Million Spectra!!! • Beyond the realm of manual interpretation • How do we determine what is a valid protein identification?
Statistical Analysis Using Scaffold • All search engines use different scoring • algorithms Can not directly compare results • Many search engines results are described by • more than one value Examples: Mascot Ion Score and Identity Score Sequest Xcorr and DeltaCn
Statistical Analysis Using Scaffold Peptide Prophet* • Creates a universal score (discriminant score) for the search • engine result (e.g. XCorr and DeltaCn are compressed to one • score for SEQUEST results, Ion score and Identity score for • Mascot results) • Plots a histogram of the discriminant scores and • calculates a bimodal distribution based on standard • statistics to differentiate between correct and incorrect hits • Computes the probability that the match is correct at a • given discriminant score *Nesvizhskii, A. I. et al, Anal. Chem.2003, 75, 4646-4658
Statistical Analysis Using Scaffold 200 180 Histogram of discriminate scores 160 140 120 100 Number of spectra in each bin 80 60 40 20 0 -3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3 Discriminant score (D)
Statistical Analysis Using Scaffold 200 180 160 140 120 100 Number of spectra in each bin 80 60 40 20 0 -3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3 Discriminant score (D) Assumes a mixture of standard statistical distributions “incorrect” “correct”
Statistical Analysis Using Scaffold 200 180 160 140 120 100 Number of spectra in each bin 80 60 40 20 0 -3.9 -2.3 -0.7 0.9 2.5 4.1 5.7 7.3 Discriminant score (D) Peptide Probability Threshold “incorrect” “correct”
Statistical Analysis Using Scaffold 9% 22% 4% 34% 19% 7% 5% One Search Engine may not be enough SEQUEST X!Tandem Mascot www.proteomesoftware.com
Statistical Analysis Using Scaffold • Peptide Prophet statistics are applied separately for • each search engine result (i.e. Mascot, SEQUEST, • and X!Tandem) • Scaffold Merger combines the peptide probabilities • from each search engine to generate a protein • probability The probability of identifying a spectrum + The probability of agreement between search engines Protein Probability
Statistical Analysis Using Scaffold Advantages using of Scaffold • Allows you to choose a statistical error rate by setting probability thresholds • Allows you to compare and combine results from different experiments and different search engines • Allows sharing of raw data and search results • Accepted as a suitable statistical method to validate large datasets
List of all the proteins found in your samples Homologous proteins (proteins matched to the same peptides) are shown. You can directly like out to database entries
How does Scaffold Deal with peptides that can be assigned to more than one protein? General Rule Explain the spectral data with the smallest set of proteins B Protein A and Protein B share all the same peptides so they will be grouped together A
How does Scaffold Deal with peptides that can be assigned to more than one protein? General Rule Explain the spectral data with the smallest set of proteins B Protein A and protein B each have one unique peptide they will be listed separately only if the peptide probability is > 50% A
How does Scaffold Deal with peptides that can be assigned to more than one protein? General Rule Explain the spectral data with the smallest set of proteins B Protein B has two unique peptides it will be listed separately A
Gene Ontology “GO” terms • Controlled vocabulary containing consistent • descriptions of gene products in different • databases • Describe gene products in terms of their • associated biological processes, cellular • components and molecular functions in a species • independent manner Gene Ontology Projecthttp://www.geneontology.org/GO.doc.shtml
Probability thresholds for peptide and protein identifications and required number of unique peptides can be defined Color coded to represent probability that protein identification is correct
Spectrum of each peptide labeled with y and b ions which can be used for manual validation
Manual Spectrum Evaluation • Search engine scores Is peptide found by more • than one search engine? • Mascot ion score > 40 • SEQUEST Xcorr > 2 (+2 ion), 2.5 (+3 ion) • deltaCn > 0.2 • Good signal-to-noise • Long stretches of y and/or b ions • All dominant peaks are assigned as y or b ions • Fragmentation chemistry N-terminal cleavage at P dominate y-ion C-terminal cleavage at D and E dominate b-ion Peptides containing W abundant y-ions S and T tend to lose water (-18 Da) R, N, and Q tend to lose ammonia (-17 Da)
Good Spectrum Good coverage of y and b ion series Dominant y-ion at N-terminal cleavage of P Peptide Sequence IAELAGFSVPENTK +2 charge on parent peptide Good signal-to-noise Mascot: Ion Score = 60.1 Identify Score = 37.3 SEQUEST: Xcorr = 2.61 deltaCn = 0.4
Bad Spectrum Poor signal-to-noise Multiple unassigned peaks Peptide Sequence YPLADYALTPDMAIVDANLVMDMPK +3 charge on parent peptide Poor coverage of y and b ion series Mascot: Ion Score = 9.93 Identity Score = 37.3 SEQUEST: Xcorr = 2.26 deltaCn = 0.2
Scaffold Statistics View Score Histogram Blue indicates “incorrect” proteins Red indicates “correct” proteins Important! Must have enough data to fit two distributions for the statistics to be valid. Protein is “correct” if it passes the peptide and protein probability and minimum # peptide filters.
Scaffold Statistics View With at least 2 unique Peptides (95% peptide prob) the maximum protein probability is ~100%. With only 1 unique peptide (95% peptide prob) the maximum protein probability is <90%.
Scaffold Statistics View Missed IDs SEQUEST only
Scaffold Statistics View Mascot only Missed IDs
Both Mascot only Sequest only Scaffold Statistics View Using both Mascot and Sequest results in more “correct” protein identifications
Publication Guidelines for Proteomic Data Journal of Molecular and Cellular Proteomics http://www.mcponline.org/misc/ParisReport_Final.shtml
Publication Guidelines for Proteomic Data Data Analysis • Name and version of software used to extract peak list • Name and version of database searching software (Mascot, Sequest, Spectrum Mill, or X! Tandem) • Values of all search parameters used (enzyme, modifications, mass tolerance, etc.) • Name and size of the database searched (Swisprot or NCBI and the number of sequence entries) • Name and version of any additional software used for statistical analysis and an explanation of the analysis (Scaffold, #peptide requirements, probability settings)
Publication Guidelines for Proteomic Data Each Peptide Identified • Peptide sequence noting any modifications or missed cleavages • Parent peptide ion mass and charge • All search engine scores Each Protein Identified • Accession number • Sequence coverage and total number of unique peptides