Paulo Costa Carvalho Laboratory for Proteomics and Protein Engineering Fiocruz - PR

Analyzing shotgun proteomic data Paulo Costa Carvalho Laboratory for Proteomics and Protein Engineering Fiocruz - PR pcarvalho.com

Outline • Shotgun proteomics • Motivation for studying proteomics. • What is shotgun proteomics. • Data analysis • Protein identification • Label-free quantitation • PatternLab for proteomics • Final Considerations

Motivations J. Proteome Res., 2011, 10 (1), pp 153–160 DOI: 10.1021/pr100677g

Computational Proteomics Editorial “There has been an unprecedented improvement in the quality and quantity of commercial proteomics data generation technologies, making data generation more accessible to many researchers. However, more and more discoveries will be led by researchers in command of the skills necessary to mine and extensively interpret the volumes of data. Already the ability to generate data vastly outpaces our ability to interpret it, and the lack of expertise in interpreting data is the current gating factor in the advancement of proteomics sciences. Proteomics scientists with training solely in data generation techniques will be shut out of more and more research opportunities. NunoBandeira, July 2011

Too many roads not taken Eduards AM, Nature, Feb 2011

Proteomics has revolutionized biochemical research

pcarvalho.com

LC / MS shotgun proteomic data Time Mass / Charge

(B) (Y) NH2 COOH A F Y L A K (precursor)2+ A F Y L K m/z

(B) (Y) NH2 COOH A F Y L A K (precursor)2+ Y L K A F A F Y L K m/z

(B) (Y) NH2 COOH A F Y L K Y L K (precursor)2+ L K A F A A F Y F Y L K m/z

(B) (Y) NH2 COOH A F Y L K Y L K (precursor)2+ L K A F Y L K A F A A F Y F Y L K m/z

Strategies for protein identification by mass spectrometry • Peptide sequence match • Advantage: most sensitive (when the protein is in the DB) • Disadvantage: sequence must be in the DB; needs to specify PTMs a priori. • De novo sequencing • Advantage: does not require a database • Disadvantage: most error prone. • Sequence Tag Search • Advantages: no need to specify PTM a priori; tolerant to small changes in the sequence • Disadvantages: not as sensitive as PSM when the protein is in the DB

De novo sequencing • Advantage: does not require a database • Disadvantage: most error prone MS/MS Intensity M/Z A L T H P V T E G G K E F S I L L V E Q D S G V K S D I G V V A

Sequence Tag Search • Advantages: no need to specify PTM a priori; tolerant to small sequence changes • Disadvantages: not as sensitive as PSM when the protein is in the DB Na S et al., MCP, 2008

Peptide sequence match • Advantage: most sensitive (when the protein is in the DB) • Disadvantage: sequence must be in the DB; needs to specify PTMs a priori

Protein Identification using a database ProLuCID Xtandem OMSSA Andromeda SEQUEST Mascot …

Interpreting MS/MS Proteomics Results Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting (February 2nd, 2006) Illustrated by Toni Boudreault

All these peaks are seen together simultaneously and we don’t even know… B-type,A-type,Y-type Ions R I T P E A H2O Intensity M/z

What type of ion they are, making the mass differences approach even more difficult. Finally, as with all analytical techniques, Intensity M/z

There’s noise, producing a final spectrum that looks like… Intensity M/z

And so it’s actually fairly difficult to… ….This, on a good day. Intensity M/z

XCalibur :: Show experimental data

B-type ions A-type ions Y-type ions Known Ion Types We knew a couple of things about peptide fragmentation. Not only do we know to expect B, A, and Y ions, but…

B-type ions A-type ions Y-type ions B- or Y-type +2H ions B- or Y-type -NH3 ions B- or Y-type -H2O ions 100% 20% 100% 50% 20% 20% … likelihood of seeing each type of ion, Known Ion Types where generally B and Y ions are most prominent.

So it’s actually pretty easy to guess what a spectrum should look like If we know the amino acid sequence of a peptide,we can guess what the spectra should look like! if we know what the peptide sequence is.

Model Spectrum So as an example, consider the peptide ELVIS LIVES K ELVISLIVESK that was synthesized by Rich Johnson in Seattle *Courtesy of Dr. Richard Johnson http://www.hairyfatguy.com/

Model Spectrum We can create a hypothetical spectrum based on our rules

B/Y type ions (100%) Where B and Y ions are estimated at 100%, plus 2 ions are estimated at 50%, and other stragglers are at 20%. B/Y +2H type ions (50%) A type ions B/Y -NH3/-H2O (20%)

Model Spectrum So if we consider the spectrum that was derived from the ELVIS LIVES K peptide…

Model Spectrum We can find where the overlap is between the hypothetical and the actual spectra…

Model Spectrum And say conclusively based on the evidence that the spectrum does belong to the ELVIS LIVES K peptide.

1977 Shotgun sequencing invented, bacteriophage fX174 sequenced. 1989 Yeast Genome project announced 1990 Human Genome project announced 1992 First chromosome (Yeast) sequenced 1995 H. influenza sequenced 1996 Yeast Genome sequenced 2000 Human Genome draft Sequencing Explosion Eng, J. K.; McCormack, A. L.; Yates, J. R. III J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. … In 1994 Jimmy Eng and John Yates published a technique to exploit genome sequencing for use in tandem mass spectrometry. And the idea was …

SEQUEST .…instead of searching all possible peptide sequences, Now, in the post- genomic world this seems like a pretty trivial idea, search only those in genome databases. but back then there was a lot of assumption placed on the idea that we’d actually have a complete Human genome in a reasonable amount of time.

For a scoring function they decided to use Cross-Correlation, Like so. which basically sums the peaks that overlap between hypothetical and the actual spectra SEQUEST Model Spectrum

And then they shifted the spectra back and …. SEQUEST Model Spectrum

… Forth so that the peaks shouldn’t align. They used this number, also called the Auto-Correlation, as their background. SEQUEST Model Spectrum

SEQUEST XCorr This is another representation of the Cross Correlation and the Auto Correlation. Cross Correlation (direct comparison) Auto Correlation (background) Correlation Score Offset (AMU) Gentzel M. et al Proteomics3 (2003) 1597-1610

The XCorr score is the Cross Correlation divided by the average of the auto correlation over a 150 AMU range. SEQUEST XCorr The XCorr is high if the direct comparison is significantly greater than the background, Cross Correlation (direct comparison) which is obviously good for peptide identification. Auto Correlation (background) Correlation Score Offset (AMU) XCorr = Gentzel M. et al Proteomics3 (2003) 1597-1610

SEQUEST DeltaCn And this XCorr is actually a pretty robust method for estimating how accurate the match is, and so far, there really haven’t been any significant improvements on it. The DeltaCn is another score that scientists often use. It measures how good the XCorr is relative to the next best match. As you can see, this is actually a pretty crude calculation.

Raw Xtractor/ Pause for search * Show an MS2 file

ProLuCID ProLuCID is a fast and sensitive tandem mass spectra-based protein identification program recently developed in the Yates laboratory at The Scripps Research Institute.

ProLuCID runner Show ProLuCID Runner Carvalho PC et al; unpublished

Protein Identification Search Engine (e.g. ProLuCID, SEQUEST, etc) MS PSM Workflow Database

The Challenge: How to pinpoint trustworthy identifications 1 spectrum = 1 identification!

Filtering data

In the beginning… Spectra were sorted according to some score and then a threshold value was set. Different programs have different scoring schemes, so SEQUEST, Mascot, and X!Tandem use different thresholds. Different thresholds may also be needed for different charge states, sample complexity, and database size. SEQUEST XCorr > 2.5 dCn > 0.1 Mascot Score > 45 X!Tandem Score < 0.01 sort by match score spectrum scores protein peptide

Paulo Costa Carvalho Laboratory for Proteomics and Protein Engineering Fiocruz - PR

Paulo Costa Carvalho Laboratory for Proteomics and Protein Engineering Fiocruz - PR

Presentation Transcript

Protein Engineering

Proteomics and Protein Bioinformatics: Functional Analysis of Protein Sequences

Protein engineering

Protein Expression and Folding Optimization For High-Throughput Proteomics

Protein Structure, classification, Prediction and Proteomics

Proteomics Informatics – Protein Characterization II: Protein Interactions (Week 11)

Understanding protein lists from proteomics studies

Protein Engineering

Proteomics technologies and protein-protein interaction

Protein chemistry to proteomics

Protein engineering and recombinant protein expression

Proteomics technologies and protein-protein interaction

Ch17. Proteomics and Protein Identification

Proteomics and Glycoproteomics (Bio-)Informatics of Protein Isoforms

Physics Laboratory for Engineering

10 Genomics, Proteomics and Genetic Engineering

Protein Structure, classification, Prediction and Proteomics

Proteomics: Protein Profiling and Identification through Mass Spectrometry

INF380 – Proteomics Chapter 3 – Protein digestion