Interpreting MS/MS Proteomics Results

The first thing I should say is that none of the material presented is original research done at Proteome Software Interpreting MS/MS Proteomics Results but we do strive to make the tools presented here available in our software product Scaffold. With that caveat aside… Brian C. Searle Proteome Software Inc. Portland, Oregon USA Brian.Searle@ProteomeSoftware.com NPC Progress Meeting (February 2nd, 2006) Illustrated by Toni Boudreault

Organization SEQUEST Identify This is an foremost an introduction so we’re first going to talk about Then we’re going to talk about the motivations behind the development of the first really useful bioinformatics technique in our field, SEQUEST. how you go about identifying proteins with tandem mass spectrometry in the first place This technique has been extended by two other tools called X! Tandem and Mascot. X! Tandem/Mascot We’re also going to talk about how these programs differ Differ Combine and how we can use that to our advantage by considering them simultaneously using probabilities.

A Start with a protein A I E P A T H K K Q So, this is proteomics, so we’re going to use tandem mass spectrometry to identify proteins-- hopefully many of them, and hopefully very quickly. I G L R L K N V I T I D D C G V R T A

A Cut with an enzyme A I E P A T H And to use this technique you generally have to lyse the protein into peptides about 8 to 20 amino acids in length and… K K Q I G L R L K N V I T I D D C G V R T A

A Select a peptide A I E P A T H K K Q I G L Look at each peptide individually. R L K We select the peptide by mass using the first half of the tandem mass spectrometer N V I T I D D C G V R T A

Impart energy in collision cell A E P T I R H2O The mass spectrometer imparts energy into the peptide causing it to fragment at the peptide bonds between amino acids.

Measure mass of daughter ions The masses of these fragment ions is recorded using the second mass spectrometer. A E P T A E P A E Intensity 399.2 A 298.1 201.1 72.0 M/z

These ions are commonly called B ions, based on nomenclature you don’t really want to know about… B-type Ions A E P T I R H2O Intensity 72.0 129.0 97.0 101.0 113.1 174.1 M/z But the mass difference between the peaks corresponds directly to the amino acid sequence.

B-type Ions A E P T I R H2O Intensity 72.0 129.0 97.0 101.0 113.1 174.1 AE-A AEP -AE AEPT -AEP AEPTI -AEPT AEPTIR -AEPTI A-0 For example, the A-E peak minus the A peak should produce the mass of E. You can build these mass differences up and derive a sequence for the original peptide This is pretty neat and it makes tandem mass spectrometry one of the best tools out there for sequencing novel peptides. M/z

But there are a couple confounding factors. So, it seems pretty easy, doesn’t it? For example…

B ions have a tendency to degrade and lose carbon monoxide producing… B-type Ions A E P T I R H2O CO CO CO CO CO CO Intensity M/z

A-type Ions A ions. A E P T I R H2O Furthermore… CO CO CO CO CO CO M/z

… The second half are represented as Y ions that sequence backwards. Y-type Ions And, unfortunately, this is the real world, so… R I T P E A H2O Intensity M/z

… All the peaks have different measured heights and many peaks can often be missing. Y-type Ions R I T P E A H2O Intensity M/z

All these peaks are seen together simultaneously and we don’t even know… B-type,A-type,Y-type Ions R I T P E A H2O Intensity M/z

What type of ion they are, making the mass differences approach even more difficult. Finally, as with all analytical techniques, Intensity M/z

There’s noise, producing a final spectrum that looks like… Intensity M/z

And so it’s actually fairly difficult to… ….This, on a good day. Intensity M/z

… compute the mass differences to sequence the peptide, certainly in a computer automated way. A E P T I R H2O Intensity 72.0 129.0 97.0 101.0 113.1 174.1 M/z

So the community needed a new technique. Now, it wasn’t all without hope…

B-type ions A-type ions Y-type ions Known Ion Types We knew a couple of things about peptide fragmentation. Not only do we know to expect B, A, and Y ions, but…

B-type ions A-type ions Y-type ions B- or Y-type +2H ions B- or Y-type -NH3 ions B- or Y-type -H2O ions Known Ion Types … We also know a couple of other variations on those ions that come up. We even know something about the…

B-type ions A-type ions Y-type ions B- or Y-type +2H ions B- or Y-type -NH3 ions B- or Y-type -H2O ions 100% 20% 100% 50% 20% 20% … likelihood of seeing each type of ion, Known Ion Types where generally B and Y ions are most prominent.

So it’s actually pretty easy to guess what a spectrum should look like If we know the amino acid sequence of a peptide,we can guess what the spectra should look like! if we know what the peptide sequence is.

Model Spectrum So as an example, consider the peptide ELVIS LIVES K ELVISLIVESK that was synthesized by Rich Johnson in Seattle *Courtesy of Dr. Richard Johnson http://www.hairyfatguy.com/

Model Spectrum We can create a hypothetical spectrum based on our rules

B/Y type ions (100%) Where B and Y ions are estimated at 100%, plus 2 ions are estimated at 50%, and other stragglers are at 20%. B/Y +2H type ions (50%) A type ions B/Y -NH3/-H2O (20%)

Model Spectrum So if we consider the spectrum that was derived from the ELVIS LIVES K peptide…

Model Spectrum We can find where the overlap is between the hypothetical and the actual spectra…

Model Spectrum And say conclusively based on the evidence that the spectrum does belong to the ELVIS LIVES K peptide.

But who cares? The more important question is “what about situations where we don’t know the sequence?”

We guess!

PepSeq And so this was an approach followed by a program called PepSeq which would guess every combination of amino acids possible AAAAAAAAAA AAAAAAAAAC AAAAAAAACC AAAAAAACCC ELVISLIVESK WYYYYYYYYY YYYYYYYYYY build a hypothetical spectrum, and find the best matching hypothetical. … … J. Rozenski et al., Org. Mass Spectrom., 29 (1994) 654-658.

PepSeq This was a start, • Impossibly hard after 7 or 8 amino acids! • High false positive rate because you consider so many options but it’s clearly impossibly hard with larger peptides and there’s a lot of room to overfit the data.

PepSeq So obviously this isn’t going to work in the long run. • Impossibly hard after 7 or 8 amino acids! • High false positive rate because you consider so many options Another strategy is needed!

Sequencing Explosion We needed a new invention to come around and that was shotgun Sanger-sequencing • 1977 Shotgun sequencing invented, bacteriophage fX174 sequenced. • 1989 Yeast Genome project announced • 1990 Human Genome project announced • 1992 First chromosome (Yeast) sequenced • 1995 H. influenza sequenced • 1996 Yeast Genome sequenced • 2000 Human Genome draft … In 89 and 90 the Yeast and Human Genome projects were announced followed by the first chromosome in 92 et cetra, et cetra

Sequencing Explosion • 1977 Shotgun sequencing invented, bacteriophage fX174 sequenced. • 1989 Yeast Genome project announced • 1990 Human Genome project announced • 1992 First chromosome (Yeast) sequenced • 1995 H. influenza sequenced • 1996 Yeast Genome sequenced • 2000 Human Genome draft Eng, J. K.; McCormack, A. L.; Yates, J. R. III J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. … In 1994 Jimmy Eng and John Yates published a technique to exploit genome sequencing for use in tandem mass spectrometry. And the idea was …

SEQUEST .…instead of searching all possible peptide sequences, Now, in the post- genomic world this seems like a pretty trivial idea, search only those in genome databases. but back then there was a lot of assumption placed on the idea that we’d actually have a complete Human genome in a reasonable amount of time.

SEQUEST 2*1014-- All possible 11mers (ELVISLIVESK) 2*1010-- All possible peptides in NR 1*108-- All tryptic peptides in NR 4*106-- All Human tryptic peptides in NR So, In terms of 11amino acid peptides So that was huge, we’re talking about a 10 thousand fold difference between searching every possible 11mer those in the current non-redundant protein database from the NCBI it made hypothetical spectrum matching feasible. And a 100 million fold difference for searching human trypic peptides

SEQUEST Model Spectrum Instead of trying to make a better model, SEQUEST made a couple of other interesting improvements as well they decided just to make the actual spectrum look like the model with normalization… Jimmy and John noted that there was a discontinuity between the intensities of the hypothetical spectrum and the actual spectrum.

For a scoring function they decided to use Cross-Correlation, Like so. which basically sums the peaks that overlap between hypothetical and the actual spectra SEQUEST Model Spectrum

And then they shifted the spectra back and …. SEQUEST Model Spectrum

… Forth so that the peaks shouldn’t align. They used this number, also called the Auto-Correlation, as their background. SEQUEST Model Spectrum

SEQUEST XCorr This is another representation of the Cross Correlation and the Auto Correlation. Cross Correlation (direct comparison) Auto Correlation (background) Correlation Score Offset (AMU) Gentzel M. et al Proteomics3 (2003) 1597-1610

The XCorr score is the Cross Correlation divided by the average of the auto correlation over a 150 AMU range. SEQUEST XCorr The XCorr is high if the direct comparison is significantly greater than the background, Cross Correlation (direct comparison) which is obviously good for peptide identification. Auto Correlation (background) Correlation Score Offset (AMU) XCorr = Gentzel M. et al Proteomics3 (2003) 1597-1610

SEQUEST DeltaCn And this XCorr is actually a pretty robust method for estimating how accurate the match is, and so far, there really haven’t been any significant improvements on it. The DeltaCn is another score that scientists often use. It measures how good the XCorr is relative to the next best match. As you can see, this is actually a pretty crude calculation.

Here’s another representation of that sentiment. The XCorr is a strong measure of accuracy, whereas the DeltaCn is a weak measure of relative goodness. . Accuracy Score Relative Score Weak (DeltaCn) Strong (XCorr) SEQUEST

Obviously, there could be an alternative method that focuses more on the success of the relative score. Mascot and X! Tandem fit that bill. Accuracy Score Relative Score Weak (DeltaCn) Strong (XCorr) SEQUEST Alternate Method Strong Weak

X! Tandem Scoring by-Score= Sum of intensities of peaks matching B-type or Y-type ions HyperScore= Now the X! Tandem accuracy score is rather crude. It only considers B and Y ions and and attaches these factorial terms with an admittedly hand waving argument. Fenyo, D.; Beavis, R. C. Anal. Chem., 75 (2003) 768-774

Distribution of “Incorrect” Hits But instead of just considering the best match to the second best, it looks at the distribution of lower scoring hits, assuming that they are all wrong. This is somewhat based on ideas pioneered with the BLAST algorithm. Here, every bar represents the number of matches at a given score. The X! Tandem creators found that the distribution decays (or slopes down) exponentially… # of Matches Second Best Best Hit Hyper Score

Interpreting MS/MS Proteomics Results