Advanced Top-Down Protein Identification: MS-Align+ Algorithm

Protein Identification Using Top-Down SpectraXiaowen Liu1, Yakov Sirotkin2, Yufeng Shen3, Gordon Anderson3, Yihsuan S. Tsai4, Ying S. Ting4, David R. Goodlett4, Richard D. Smith3, Vineet Bafna1 and Pavel A. Pevzner11Department of Computer Science and Engineering, University of California San Diego, CA, 2Algorithmic Biology Laboratory, St. Petersburg Academic University, 3Biological Science Division, Pacific Northwest National Laboratory, Richland, WA, 4Department of Medicine Chemistry, University of Washington, Seattle, WA Introduction In the last two years, due to advances in mass spectrometry, top-down mass spectrometry moved from analyzing single proteins to analyzing complex samples and identifying hundreds and even thousands of proteins. However, computational tools for database search of top-down spectra against protein databases are still in infancy. We describe MS-Align+, a fast algorithm for top-down protein identification based on spectral alignment that enables searches for unexpected post-translational modifications (PTMs). We also use generation function approach for evaluating statistical significance of top-down protein identifications and further benchmark MS-Align+ along with MASCOT, OMSSA, PIITA and ProSightPTM on two data sets from Saccharomyces cerevisiae (SC) and Salmonella typhimurium (ST). We demonstrate that MS-Align+ significantly increase the number of identified spectra as compared to MASCOT and OMSSA. While MS-Align+, PIITA and ProSightPTM have similar performance on ST data set, MS-Align+ outperforms ProSightPTM on the more complex SC data set. Methods Given a prefix residue mass (PRM) spectrum S={a0, a1, …, an} and a protein P={b0, b1, …, bm}, where bi equals the sum of the masses of the first i residues of the protein, we define the spectral grid of P and S as a rectangle in 2D with (n+1) (m+1) matching points. A global spectral alignment between P and S is a path from the top left to the bottom right corner of the spectral grid consisting of diagonal, vertical and horizontal segments (Fig. 1(a)). Diagonal segments correspond to sub-peptides of P matched to S without PTMs; vertical segments correspond to PTMs with positive offsets, while horizontal segments correspond to PTMs with negative offsets. The lengths of horizontal and vertical segments represent the offsets of PTMs. We often observe top-down spectra mapped to proteins with truncated peptides on N- and/or C-termini. To identify these spectra, we define semi-global spectral alignment between P and S as a path from a matching point on the top border to a matching point on the bottom border in the spectral grid (Fig. 1(b)). (b) (a) Figure 1. Global spectral alignment and semi-global spectral alignment. (a) Global spectral alignment. (b) Semi-global spectral alignment. Given a top-down PRM spectrum, We use a dynamic programming spectral alignment algorithm to find the best semi-global spectral alignment between the spectrum and each protein in the protein database and report the best-scoring protein. To speed up the protein identification process, we apply a spectral convolution approach to quickly filter out the proteins that can not have good alignments, and propose a faster spectral alignment algorithm that uses the results of the spectral convolution. In addition, we estimate the E-value of each identified protein-spectrum-match(PrSM) using a generating function approach. Results We tested MS-Align+ on two top-down data sets from Saccharomycescerevisiae (SC) and Salmonella typhimurium (ST). After data preprocessing, we obtained 11,030 and 4,439 deconvoluted spectra with precursor mass no less than 2,500 Da and at least 10 monoisotopic masses from the SC and ST data sets, respectively. We applied MS-Align+ to search the deconvoluted spectra against the corresponding proteomes. We set 15 PPM as the error tolerance for precursor and fragment ions and allowed two mass shift in the alignments (N- and C-terminal truncated peptides are not counted as mass shifts). Using the target/decoy approach, we identified 4,059 and 2,217 MS/MS spectra with spectral level 1% FDR from the SC and ST data sets, respectively. Comparison with MASCOT, OMSSA, PIITA and ProSightPTM (b) (a) (a) (b) Figure 2. (a) SC data set. MASCOT shares 2,892 spectra (91.9%) with MS-Align+, and OMSSA shares 2,745 spectra (95.5%) with MS-Align+. MASCOT and OMSSA share 2,657 spectra. All the three tools share 2,575 spectra. (b) ST data set. MASCOT shares 560 spectra (98.1%) with MS-Align+, and OMSSA shares 440 spectra (94.8%) with MS-Align+. MASCOT and OMSSA share 245 spectra. All the three tools share 243 spectra. Figure 3. (a) SC data set. ProSightPTM shares 2,298 spectra (99.5%) with MS-Align+. (b) ST data set. PIITA shares 1,641 spectra (75.0%) with MS-Align+, and ProSightPTM shares 1,743 spectra (78.2%) with MS-Align+. PIITA and ProSightPTM shares 1,718 spectra. All the three tools share 1,458 spectra. MS-Align+ identified significantly more spectra than MASCOT and OMSSA, especially for the spectra with precursor mass > 5K Da, MS-Align+ still has good sensitivity (Figure 4(b)). While MS-Align+, PIITA and ProSightPTM have similar performance on the ST data set, MS-Align+ outperforms ProSightPTM on the SC data set since it has many truncated proteins which ProSightPTM may fail to identify (Figure 4 (a)). (b) (a) Figure 4. Distributions of precursor masses of the spectra identified by MS-Align+, MASCOT, OMSSA, PIITA and ProSightPTM. (a) SC data set. (b) ST data set.

Advanced Top-Down Protein Identification: MS-Align+ Algorithm

Advanced Top-Down Protein Identification: MS-Align+ Algorithm

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction