470 likes | 677 Views
MS & MS/MS Search Engines for Proteomics. Pedro J. Navarro PhD student Centro de Biología Molecular Severo Ochoa. How are our data entries? lists of m/z values. MS. MS/MS. Parent mass charge. Parent mass value. 840.695086 1676.96063 1498.8283 1045.564 2171.967066 861.107346
E N D
MS & MS/MS Search Engines for Proteomics Pedro J. Navarro PhD student Centro de Biología Molecular Severo Ochoa
How are our data entries? lists of m/z values MS MS/MS Parent mass charge Parent mass value 840.695086 1676.96063 1498.8283 1045.564 2171.967066 861.107346 842.51458 1456.727405 863.268365 1163.7008 2 86.1105 220.1429 86.1738 13.7619 102.0752 4.3810 147.1329 57.3333 185.1851 649.0953 185.3589 5.3810 186.1876 81.4286 213.0791 1.4286 Peptide mass values fragment mass values fragment intensities (and some times the intensities)
File formats • Heterogeneity: each constructor, even each instrument has its own file format • Different information and type of data across formats • Raw data often not available • Format often not known • Libraries for reading formats often not available
Main file formats Example • DTA (data) • Company: Thermo Electron + Sequest • Precursor mass unit: [M+H]+ • Precursor Intensity:no • Format: • Single or Multiple spectra 1603.9204 2 101.0909 15.4762 202.1079 21.4762 203.1045 5.3333 244.1280 14.3810 254.1056 3.4286 255.1910 7.2381 270.2388 2.1905 … 962.6160 2 70.0560 2.1224 86.0947 5.1565 115.0842 8.4263
Main file formats BEGIN IONS TITLE=A1.1013.1013.2 CHARGE=2+ PEPMASS=715.940915 218.251 1.6 259.403 1.7 271.122 1.2 284.268 1.4 287.317 2.3 297.139 1.2 326.877 1.9 … END IONS BEGIN IONS TITLE=A1.1013.1013.2 • MGF (Mascot generic format) • Company: Matrix Science • Precursor mass unit : m/z • Precursor Intensity: optional • Format: • Single or Multiple spectra
Precursor value in DTA and Mascot • In a DTA file, the precursor peptide mass is an MH+ value independent of the charge state. • In Mascot generic format, the precursor peptide mass is an observed m/z value, from which Mr or MHnn+ is calculated using the prevailing charge state. • For example, in Mascot: • PEPMASS=1000 CHARGE=2+ • ... means that the relative molecular mass Mr is 1998. This is equivalent to a DTA file which starts: 1999 2 Source: Matrix Science site
Tentative standard file formats I • mzData (still evolving) • Organisation: HUPO PSI (Proteomic Standards Initiative) • Precursor mass unit : m/z • Precursor Intensity: optional • Format: • Defined by PSI-MS (HUPO Mass Spectrometry Standards Working Group) • Currently only peak lists (raw data in the future) • Supported by software vendors and manufacturers (Bruker Daltonics, Shimadzu Group) • In construction: the analysisXML standard, which captures parameters and results of search engines.
Tentative standard file formats II • mzXML and mzML (still evolving) • Organisation: Institute for Systems Biology (Seattle) and Institute for Molecular Systems Biology (ETH Zurich) • Precursor mass unit : m/z • Precursor Intensity: no • Format: • Accepts peak lists and raw data • Developed by one group; used in several projects • Ongoing effort to merge with mzData • pepXML and protXML formats for results of search engines (respectively peptides and proteins)
Indexing Databases • A protein sequence… MWVCLQLPVFLASVTLFEVAASDTIAQAASTTTISDAVSKVKIQVNKAFLDSRTRLKTTLSSEAPTTQQLSEYFKHAKGRTRTAIRNGQVWEESLKRLRRDTTLTNVTDPSLDLTALSWEVGCGAPVPLVKCDENSPYRTITGDCNNRRSPALGAANRALARWLPAEYEDGLALPFGWTQRKTRNGFRVPLAREVSNKIVGYLDEEGVLDQNRSLLFMQWGQIVDHDLDFAPETELGSNEHSKTQCEEYCIQGDNCFPIMFPKNDPKLKTQGKCMPFFRAGFVCPTPPYQSLAREQINAVTSFLDASLVYGSEPSLASRLRNLSSPLGLMAVNQEAWDHGLAYLPFNNKKPSPCEFINTTARVPCFLAGDFRASEQILLATAHTLLLREHNRLARELKKLNPHWNGEKLYQEARKILGAFIQIITFRDYLPIVLGSEMQKWIPPYQGYNNSVDPRISNVFTFAFRFGHMEVPSTVSRLDENYQPWGPEAELPLHTLFFNTWRIIKDGGIDPLVRGLLAKKSKLMNQDKMVTSELRNKLFQPTHKIHGFDLAAINLQRCRDHGMPGYNSWRGFCGLSQPKTLKGLQTVLKNKILAKKLMDLYKTPDNIDIWIGGNAEPMVERGRVGPLLACLLGRQFQQIRDGDRFWWENPGVFTEKQRDSLQKVSFSRLICDN
Indexing Databases • …is digested “in silico” by a protease (i.e. trypsin): MWVCLQLPVFLASVTLFEVAASDTIAQAASTTTISDAVSK VK IQVNK AFLDSR TR LK TTLSSEAPTTQQLSEYFK HAK GR TR TAIR NGQVWEESLK R LR R DTTLTNVTDPSLDLTALSWEVGCGAPVPLVK CDENSPYR TITGDCNNR R SPALGAANR ALAR WLPAEYEDGLALPFGWTQR K TR NGFR VPLAR EVSNK IVGYLDEEGVLDQNR SLLFMQWGQIVDHDLDFAPETELGSNEHSK TQCEEYCIQGDNCFPIMFPK NDPK LK TQGK CMPFFR AGFVCPTPPYQSLAR EQINAVTSFLDASLVYGSEPSLASR LR NLSSPLGLMAVNQEAWDHGLAYLPFNNK K PSPCEFINTTAR VPCFLAGDFR ASEQILLATAHTLLLR EHNR LAR ELK K LNPHWNGEK LYQEAR K ILGAFIQIITFR DYLPIVLGSEMQK WIPPYQGYNNSVDPR ISNVFTFAFR FGHMEVPSTVSR LDENYQPWGPEAELPLHTLFFNTWR IIK DGGIDPLVR GLLAK K SK LMNQDK MVTSELR NK LFQPTHK IHGFDLAAINLQR CR DHGMPGYNSWR GFCGLSQPK TLK GLQTVLK NK ILAK K LMDLYK TPDNIDIWIGGNAEPMVER GR VGPLLACLLGR QFQQIR DGDR FWWENPGVFTEK QR DSLQK VSFSR LICDN
Indexing Databases • …is digested “in silico” by a protease (i.e. trypsin): MWVCLQLPVFLASVTLFEVAASDTIAQAASTTTISDAVSK WLPAEYEDGLALPFGWTQR EHNR LFQPTHK VK K LAR IHGFDLAAINLQR IQVNK TR ELK CR AFLDSR NGFR K DHGMPGYNSWR TR VPLAR LNPHWNGEK GFCGLSQPK LK EVSNK LYQEAR TLK TTLSSEAPTTQQLSEYFK IVGYLDEEGVLDQNR K GLQTVLK HAK SLLFMQWGQIVDHDLDFAPETELGSNEHSK ILGAFIQIITFR NK GR TQCEEYCIQGDNCFPIMFPK DYLPIVLGSEMQK ILAK TR NDPK WIPPYQGYNNSVDPR K TAIR LK ISNVFTFAFR LMDLYK NGQVWEESLK TQGK FGHMEVPSTVSR TPDNIDIWIGGNAEPMVER R CMPFFR LDENYQPWGPEAELPLHTLFFNTWR GR LR AGFVCPTPPYQSLAR IIK VGPLLACLLGR R EQINAVTSFLDASLVYGSEPSLASR DGGIDPLVR QFQQIR DTTLTNVTDPSLDLTALSWEVGCGAPVPLVK LR GLLAK DGDR CDENSPYR NLSSPLGLMAVNQEAWDHGLAYLPFNNK K FWWENPGVFTEK TITGDCNNR K SK QR R PSPCEFINTTAR LMNQDK DSLQK SPALGAANR VPCFLAGDFR MVTSELR VSFSR ALAR ASEQILLATAHTLLLR NK LICDN
Indexing Databases • The resulting peptides are sorted: K K K K K K R R R CR GR GR LR LR TR TR TR LK LK QR SK VK NK NK LAR ELK HAK IIK TLK DGDR EHNR ALAR NDPK NGFR ILAK TAIR TQGK DSLQK EVSNK GLLAK VPLAR VSFSR IQVNK LICDN AFLDSR CMPFFR LMDLYK LMNQDK QFQQIR LYQEAR MVTSELR GLQTVLK LFQPTHK CDENSPYR GFCGLSQPK DGGIDPLVR SPALGAANR TITGDCNNR LNPHWNGEK NGQVWEESLK VPCFLAGDFR ISNVFTFAFR DHGMPGYNSWR VGPLLACLLGR PSPCEFINTTAR ILGAFIQIITFR FGHMEVPSTVSR FWWENPGVFTEK IHGFDLAAINLQR DYLPIVLGSEMQK IVGYLDEEGVLDQNR AGFVCPTPPYQSLAR WIPPYQGYNNSVDPR ASEQILLATAHTLLLR TTLSSEAPTTQQLSEYFK TPDNIDIWIGGNAEPMVER WLPAEYEDGLALPFGWTQR TQCEEYCIQGDNCFPIMFPK LDENYQPWGPEAELPLHTLFFNTWR EQINAVTSFLDASLVYGSEPSLASR NLSSPLGLMAVNQEAWDHGLAYLPFNNK SLLFMQWGQIVDHDLDFAPETELGSNEHSK DTTLTNVTDPSLDLTALSWEVGCGAPVPLVK MWVCLQLPVFLASVTLFEVAASDTIAQAASTTTISDAVSK
Indexing Databases • The peptides not messurable by the spectrometer are removed: DSLQK EVSNK GLLAK VPLAR VSFSR IQVNK LICDN AFLDSR CMPFFR LMDLYK LMNQDK QFQQIR LYQEAR MVTSELR GLQTVLK LFQPTHK CDENSPYR GFCGLSQPK DGGIDPLVR SPALGAANR TITGDCNNR LNPHWNGEK NGQVWEESLK VPCFLAGDFR ISNVFTFAFR DHGMPGYNSWR VGPLLACLLGR PSPCEFINTTAR ILGAFIQIITFR FGHMEVPSTVSR FWWENPGVFTEK IHGFDLAAINLQR DYLPIVLGSEMQK IVGYLDEEGVLDQNR AGFVCPTPPYQSLAR WIPPYQGYNNSVDPR ASEQILLATAHTLLLR TTLSSEAPTTQQLSEYFK TPDNIDIWIGGNAEPMVER WLPAEYEDGLALPFGWTQR TQCEEYCIQGDNCFPIMFPK LDENYQPWGPEAELPLHTLFFNTWR EQINAVTSFLDASLVYGSEPSLASR NLSSPLGLMAVNQEAWDHGLAYLPFNNK SLLFMQWGQIVDHDLDFAPETELGSNEHSK DTTLTNVTDPSLDLTALSWEVGCGAPVPLVK MWVCLQLPVFLASVTLFEVAASDTIAQAASTTTISDAVSK
Indexing Databases • The peptides not messurable by the spectrometer are removed: DSLQK EVSNK GLLAK VPLAR VSFSR IQVNK LICDN AFLDSR CMPFFR LMDLYK LMNQDK QFQQIR LYQEAR MVTSELR GLQTVLK LFQPTHK CDENSPYR GFCGLSQPK DGGIDPLVR SPALGAANR TITGDCNNR LNPHWNGEK NGQVWEESLK VPCFLAGDFR ISNVFTFAFR DHGMPGYNSWR VGPLLACLLGR PSPCEFINTTAR ILGAFIQIITFR FGHMEVPSTVSR FWWENPGVFTEK IHGFDLAAINLQR DYLPIVLGSEMQK IVGYLDEEGVLDQNR AGFVCPTPPYQSLAR WIPPYQGYNNSVDPR ASEQILLATAHTLLLR TTLSSEAPTTQQLSEYFK TPDNIDIWIGGNAEPMVER WLPAEYEDGLALPFGWTQR TQCEEYCIQGDNCFPIMFPK LDENYQPWGPEAELPLHTLFFNTWR EQINAVTSFLDASLVYGSEPSLASR NLSSPLGLMAVNQEAWDHGLAYLPFNNK SLLFMQWGQIVDHDLDFAPETELGSNEHSK DTTLTNVTDPSLDLTALSWEVGCGAPVPLVK MWVCLQLPVFLASVTLFEVAASDTIAQAASTTTISDAVSK
Indexing Databases • That’s actually translated to peptide masses! 500.32 554.34 575.28 576.24 589.29 594.3 600.34 707.34 747.34 757.45 778.38 781.39 799.34 818.42 834.41 855.44 869.46 935.44 940.48 982.37 992.42 1093.52 1110.64 1123.53 1188.56 1200.61 1318.54 1334.61 1345.63 1390.82 1466.78 1491.75 1538.7 1605.78 1718.83 1749 1804.84 2029.97 2125.99 2248.08 2364.97 2653.31 3072.46
We obtain only two candidate peptides !! Indexing Databases • If we look for a parental mass of 575.4 Da: 500.32 554.34 575.28 576.24 589.29 594.3 600.34 707.34 747.34 757.45 778.38 781.39 799.34 818.42 834.41 855.44 869.46 935.44 940.48 982.37 992.42 1093.52 1110.64 1123.53 1188.56 1200.61 1318.54 1334.61 1345.63 1390.82 1466.78 1491.75 1538.7 1605.78 1718.83 1749 1804.84 2029.97 2125.99 2248.08 2364.97 2653.31 3072.46
Peptide Identification in databases from MS/MS spectra The theoretical spectra Are correlated with the experimental spectrum producing the Xcorr:
hypergeometrical SCOPE OMSSA MASCOT ??? OLAV (Phenyx) SONAR 1. Parametric Scores SEQUEST (Xcorr, Sp) Empirical Single-spectrum (BLAST, FASTA-like) p- & e-values Theoretical Mixed SCORES Average score Distributions (p-experiment FDR) 2. Distribution- based scores Machine learning Algorithms(discriminant, FDR)
y y t i t s i s n n e e t t n n I I e e v v i t i t a a l l e e R R % % m/z m/z Correlation scoring:How does SEQUEST work? Observed spectrum Theoretical spectrum SEQUEST measures the degree of correlation
How does SEQUEST work? • Initial (fast) scoring step of all sequence candidates using the following score: Sum of intensities of matching fragments Number of matching fragments Total number of predicted sequence ions Bonus for consecutive fragment ions Bonus for Presence of Immonium ions Eng et al., 1994; Yates et al., 1995a and b
How does SEQUEST work? • 2nd step: discrete cross-correlation analysis of the top 500 best Sp scores (this kind of analysis has been used to identify UV absorption spectra, atomic emission spectra and g-ray spectra in libraries). • Discrete cross-correlation function between signals x and y: • Score Cn: Normalized (Rt (t = 0) minus mean Rt (-75 < t < 75)) Observed spectrum Theoretical spectrum Displacement in the m/z scale Eng et al., 1994; Yates et al., 1995a and b
How does SEQUEST work? -Computational shortcut to compute cross-correlations by using Fast Fourier Transforms (FFT): FFT Xt Complex conjugation FFT Yt Y*t Inverse FFT Xt Y*t Rt Powell & Hieftje, 1978
SEQUEST: best score (Xcorr) and delta score (DCn) XCorr Spectrum # Scores
Information provided by the best score (Xcorr) and the delta score (DCn) delta score: evaluates how much the best score deviates from random behavior best score: evaluates how good is the best match The second best score also evaluates deviation of the best score Score 1 2 3 4 5 6 7 8 9 Random matching behaviour Ranking of Peptide Sequences
hypergeometrical SCOPE OMSSA MASCOT ??? OLAV (Phenyx) SONAR 1. Parametric Scores SEQUEST (Xcorr, Sp) Empirical Single-spectrum (BLAST, FASTA-like) p- & e-values Theoretical Mixed SCORES Average score Distributions (p-experiment FDR) 2. Distribution- based scores Machine learning Algorithms(discriminant, FDR)
X!Tandem www.thegpm.org
X!Tandem • Calculates statistical confidence (e-values) for all of the individual spectrum-to-sequence assignments • Reassembles all of the peptide assignments in a data set onto the known protein sequences and assign the statistical confidence that this assembly and alignment is non-random. • E- values are then transformed to do a linear fit
X!Tandem • Match experimental versus theoretical spectra • Preliminary score = dot product of experimental versus theoretical spectra (because only similar peaks are considered, this is the sum of the intensities of the matched y and b ions) • Hyperscore = the preliminary score by multiplying by N factorial for the number of b and y ions assigned • It makes a histogram of all the hyperscores for all the peptides in the database that might match this particular spectrum • Log transformation of these values, a line interpolates them • A match is significant if is greater than the point at which the straight line through the log data intersects the log(#results)=0 line. Source: Brain Searle (Proteome Software) – XTandem to be explained (to be asked permission)
Log transformation and e-value • X!Tandem calculates the E-value by extrapolating the red line of the log histogram. • For the example shown, a hyperscore of 83 would occur by chance where the red line crosses 83. The log of this value — the E-value — is -8.2, as shown. hyperscore log(# results) E-value=e-8.2 Source: Brian Searle (Proteome Software) – XTandem to be explained (to be asked permission)
hypergeometrical SCOPE OMSSA MASCOT ??? OLAV (Phenyx) SONAR 1. Parametric Scores SEQUEST (Xcorr, Sp) Empirical Single-spectrum (BLAST, FASTA-like) p- & e-values Theoretical Mixed SCORES Average score Distributions (p-experiment FDR) 2. Distribution- based scores Machine learning Algorithms(discriminant, FDR)
Phenyx - principles • Basic probability-based score (log likelihood ratio) extended from Dancik et al. 1999 • Likelihood ratio: P(correct match) / P(random match) • Takes into consideration the probability pq(z) of detection of each ion typeq(a, b, y, etc.) and its charge state z.
Fragments (n AA) Ion types Phenyx- score • Score L1 is: • Let s = a1, …, an, be a peptide sequence and ai its amino acids. • The probability of a correct match between s and an experimental spectrum is the product of the pq(z) for each matched fragment and 1- pq(z) for each unmatched fragment • Same model for random match probabilities rq(z)
From a score to z-score • For each peak list Phenyx computes a score distribution for a search in a given database. • Then it computes a score distribution on a randomly sampled set of peptides to provide a random distribution. • Finally it normalizes the original scores to the random distribution
Search in a query database Search in a randomized set of peptides The scoring system in Phenyx • The score is the sum of up to 12 basic scores such as: • presence of a, b, y, y++, B-H2O…; co-occurrence of ion series (using HMMs), peak intensities, residue modifications (PTM or chemical), … • True probabilistic approach for each peptide match (likelihood of being correct) log -------------------------------- (likelihood of being random) • Function of instruments and molecular types • Esquire 3000+, LCQ; iTRAQ vs. unmodified peptides • Scores are normalised into z-scores
http://www.matrixscience.com/ Mascot Input
Mascot • Choice of several databases. • Considers multiple chemical modifications. • 0 to 9 missed-cleavages. • Score based on a combination of probabilistic and statistic approaches (is based on Mowse score). • Considers Swiss-Prot annotations for Splice Variants (using a script program)
Mascot - principles • Probability-based scoring • Computes the probability P that a match is random • Significance threshold p< 0.05 (accepting that the probability of the observed event occurring by chance is less than 5%) • The significance of that result depends on the size of the database being searched. • Mascot shades in green the insignificant hits • Score: -10Log10(P)
Decoy Output Hints about the significance of the score
Output Sequence coverage Peptides matched Error function
X!Tandem www.thegpm.org
1 2 3 X!Tandem - output
The two-rounds searchMascot, Phenyx and X!Tandem The identification process may be launched in 2-rounds • Each round is defined with a set of search criteria • First round searches the selected database(s) with stringent parameters, • Second round searches the proteins that have passed the first round (relaxed parameters): • Accelerate the job when looking for many variable modifications, or unspecific cleavages • Appropriate when the first round defines stringent criteria to capture a protein ID, and the second round looks for looser peptide identifications
Example 2nd round 1rnd, Only 3 fixed mods 131 valid, 75% cov. 2rnd, Add variable mods 205 valid, 84% cov. 2rnd, With all mods And half cleaved 348 valid, 90% cov.