1.65k likes | 1.72k Views
Protein Sequencing and Identification by Mass Spectrometry. Outline. Tandem Mass Spectrometry De Novo Peptide Sequencing Spectrum Graph Protein Identification via Database Search Identifying Post Translationally Modified Peptides Spectral Convolution Spectral Alignment.
E N D
Outline • Tandem Mass Spectrometry • De Novo Peptide Sequencing • Spectrum Graph • Protein Identification via Database Search • Identifying Post Translationally Modified Peptides • Spectral Convolution • Spectral Alignment
Different Amino Acid Have Different Masses H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 C-terminus N-terminus AA residuei-1 AA residuei+1 AA residuei
Peptide Fragmentation Collision Induced Dissociation H+ H...-HN-CH-CO . . .NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 Prefix Fragment Suffix Fragment • Peptides tend to fragment along the backbone. • Mass spectrometer is a sophisticated (and rather expensive!) scale to measure the masses of these fragments
Breaking Protein into Peptides and Peptides into Fragment Ions • Most mass spectrometers can only measure masses of short peptides (e.g., 20 amino acids) rather than masses of entire proteins (usually hundreds of amino acids). That’s why: • Proteases, e.g. trypsin, break protein into short peptides. • A Tandem Mass Spectrometer further breaks the peptides down into fragment ions and measures the mass of each piece. • Mass Spectrometer accelerates the fragmented ions; heavier ions accelerate slower than lighter ones. • Mass Spectrometer measure mass/chargeratio of an ion.
N- and C-terminal Peptides P A G N F A P G N F A N P G F C-terminal peptides N-terminal peptides A N F P G P A N F G
Terminal peptides and ion types P G N F Peptide H2O Mass (D) 57 + 97 + 147 + 114 = 415
Masses of fragment ions P G N F Peptide H2O Mass (D) 57 + 97 + 147 + 114 = 415 P G N F Peptide without H2O Mass (D) 57 + 97 + 147 + 114 – 18 = 397
N- and C-terminal Peptides 486 P A G N F A 71 P G N F 415 301 A N P G F 185 C-terminal peptides N-terminal peptides A N F P G 332 154 P A N F G 429 57
N- and C-terminal Peptides 486 71 415 301 185 C-terminal peptides N-terminal peptides 332 154 429 57
Theoretical Spectrum 486 71 415 Reconstruct peptide from the set of masses of fragment ions (mass-spectrum) 5771154185301332415429486 301 185 332 154 429 57
Reconstructing Peptides Reconstruct peptide from the set of masses of fragment ions (mass-spectrum) 57 71 154 185 301 332 415 429 486
Reconstructing Peptides • Reconstruct peptide from the set of masses of fragment ions • (mass-spectrum) • 5771 81 100 112 131 154 160 172 177 185 201 221 235 301 312 325 332 370 387 409 415 423 429 460 472 486
Reconstructing Peptides • Reconstruct peptide from the set of masses of fragment ions • (mass-spectrum) • 5771 81 100 112 131 160 172 177 185 201 221 235 301 312 325 370 387 409 415 423 429 460 472 486
Peptide Fragmentation b2-H2O b3- NH3 a2 b2 a3 b3 HO NH3+ | | R1 O R2 O R3 O R4 | || | || | || | H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH | | | | | | | H H H H H H H y3 y2 y1 y2 - NH3 y3 -H2O
G V D L K L 57 Da = ‘G’ K D V G 99 Da = ‘V’ H2O D Mass Spectra • The peaks in the mass spectrum: • Prefix • Fragments with neutral losses (-H2O, -NH3) • Noise and missing peaks. mass 0 and Suffix Fragments.
G V D L K • Peptide Identification Intensity MS/MS mass 0 mass 0 Protein Identification with MS/MS
Tandem Mass Spectrum • Tandem Mass Spectrometry mainly generates N- and C-terminal fragment ions • Chemical noise often complicates the spectrum. • Represented in 2-D: mass/charge axis vs. intensity axis
Breaking Proteins into Peptides HPLC GTDIMR To MS/MS PAKID MPSERGTDIMRPAKID...... MPSER …… …… protein peptides
Mass Spectrometry Matrix-Assisted Laser Desorption/Ionization (MALDI) From lectures by Vineet Bafna (UCSD)
collision cell MS-2 MS-1 Ion Source Tandem Mass Spectrometry MS LC Scan 1707 MS/MS Scan 1708
Protein Identification by Tandem Mass Spectrometry (MS/MS) S e q u e n c e MS/MS instrument • database search • Sequest, Mascot, etc • de novo interpretation • Lutefisk, Peaks, etc
W R V A L Database ofknown peptidesMDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN.. Database ofknown peptidesMDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN.. T G E P L K C W D T Database of all peptides = 20nAAAAAAAA,AAAAAAAC,AAAAAAAD,AAAAAAAE,AAAAAAAG,AAAAAAAF,AAAAAAAH,AAAAAAI, AVGELTI, AVGELTK , AVGELTL, AVGELTM, YYYYYYYS,YYYYYYYT,YYYYYYYV,YYYYYYYY W R V A L T G E P L K C W D T De Novo vs. Database Search Database Search De Novo Mass, Score AVGELTK
De Novo vs. Database Search: A Paradox • The database of all peptides is huge ≈ 20n peptides of length n • The database of all known peptides is much smaller ≈ 108 peptides • However, de novo algorithms can be much faster, even though their search space is much larger! • A database search scans all peptides in the database of all known peptides to find best one. • De novo eliminates the need to scan database of all peptides by modeling the problem as a graph search.
Three Algorithmic Problems • Searching for a million words in a text. Suppose it takes 1 sec to find a word in a text. How much time would it take to find 1 million words in the text? • Searching for a word without even looking at 99.999% of the text. Suppose you search for a word in a text. Would it be possible to ignore 99.999% of the text, scan only the remaining part and guarantee that the word you are looking for will be found? • Finding spelling errors in a book written in an unknown language. Given a book (in an unknown language) and a misspelled word (with insertions, deletions, and substitutions of letters) correct spelling errors in the word.
Three Algorithmic Problems • Searching for a million words in a text. Suppose it takes 1 sec to find a word in a text. How much time would it take to find 1 million words in the text? 1 million seconds? • Searching for a word without even looking at 99.999% of the text. Suppose you search for a word in a text. Would it be possible to ignore 99.999% of the text, scan only the remaining part and guarantee that the word you are looking for will be found? • Finding spelling errors in a book written in an unknown language. Given a book (in an unknown language) and a misspelled word (with insertions, deletions, and substitutions of letters) correct spelling errors in the word.
Genomics: Problems Solved. • Searching for a million words in a text. Aho-Corasik algorithm takes roughly the same time with a million words as it takes with a single word. • Searching for a word without even looking at 99.999% of the text. Filtration algorithms (like FASTA or BLAST) ignore 99.999% of the text. • Finding spelling errors. Sequence alignment algorithms (like Smith-Waterman) do it in quadratic time
Proteomics: Three Problems • Comparing a million spectra against a database. Suppose it takes 1 sec to interpret a spectrum. How much time would it take to interpret 1 million spectra? • Mass-spectrometry database search without even looking at 99.999% of the database. Suppose you compare a spectrum against a database. Would it be possible to ignore 99.999% of the database, scan only the remaining part and guarantee that you still can identify a peptide of interest? • Blind PTM search and discovery of new PTM types. Given a spectrum of a peptide with unknown PTM types, find this peptide in the database. Discover new PTM types by data mining of large MS/MS datasets.
Three Solutions • Comparing a million spectra against a database. InsPecT (Tanner et al., Anal. Chem, 2005) • MS/MS database search without even looking at 99.999% of the database. PepNovoTag+InsPecT (Tanner et al., Anal. Chem, 2005) • Blind PTM search and discovery of new PTM types. Given a spectrum of a peptide with unknown PTM types, find this peptide in the database. Discover new PTM types by data mining of large MS/MS datasets. MS-Alignment (Tsur et al., Nature Biotech., 2005)
Filtration: Combining De Novo Sequencing and Database Search in Mass-Spectrometry • So far de novo and database search were presented as two separate techniques • Database search is rather slow: many labs generate more than 100,000 spectra per day. SEQUEST takes approximately 1 minute to compare a single spectrum against SWISS-PROT (54Mb) on a desktop. • It will take SEQUEST more than 2 months to analyze the MS/MS data produced in a single day. • Can slow database search be combined with fast de novo analysis?
De novo Peptide Sequencing Sequence
Building Spectrum Graph • How to create vertices (from masses) • How to create edges (from mass differences) • How to score vertices • How to score paths • How to find the best path
S E Q U E N C E b-ions (prefix or N-terminal ions) Mass/Charge (M/Z)
a-ions = b-ions - CO = b-ions - 28 S E Q U E N C E Mass/Charge (M/Z)
Shifting Peaks: a-ions = b-ions - CO = b-ions - 28 S E Q U E N C E Mass/Charge (M/Z)
y-ions (suffix of C-terminal ions) E C N E U Q E S Mass/Charge (M/Z)
Intensity Mass/Charge (M/Z)
Intensity Mass/Charge (M/Z)
noise Mass/Charge (M/Z)
MS/MS Spectrum Intensity Mass/Charge (M/z)
Some Mass Differences between Peaks Correspond to Amino Acids u q e e q s u e n n c e e e q c s n e u s e c e
Some Mass Differences between Peaks Correspond to Amino Acids u q e e q s u e n n c e e e q c s n e u s e c e
Ion Types • Some masses correspond to fragment ions, others are just random noise • Knowing ion typesΔ={δ1, δ2,…, δk} lets us distinguish fragment ions from noise • We can learn ion types δi and their probabilities qi by analyzing a large test sample of annotated spectra.
Example of Ion Type • Δ={δ1, δ2,…, δk} • Ion types {b, b-NH3, b-H2O, b-CO} correspond to Δ={0, 17, 18, 28} *Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity
Match between Spectra and the Shared Peak Count • The match between two spectra is the number of masses (peaks) they share (Shared Peak Count or SPC) • In practice mass-spectrometrists use the weighted SPC that reflects intensities of the peaks • Match between experimental and theoretical spectra is defined similarly
Peptide Sequencing Problem Goal: Find a peptide with maximal match between an experimental and theoretical spectrum. Input: • S: experimental spectrum • Δ: set of possible ion types Output: • A peptide whose theoretical spectrum matches the experimental spectrum the best
Shifting Peaks: a-ions = b-ions - CO = b-ions - 28 S E Q U E N C E Mass/Charge (M/Z)
Reverse Shifts Shift in H2O Shift in H2O+NH3
Vertices of the Spectrum Graph • Masses of potential N-terminal peptides • Vertices are generated by reverse shifts corresponding to ion types Δ={δ1, δ2,…, δk} • Every N-terminal peptide can generate up to k ions m-δ1, m-δ2, …, m-δk • Every mass s in an MS/MS spectrum generates k vertices V(s) = {s+δ1, s+δ2, …, s+δk} corresponding to potential N-terminal peptides • Vertices of the spectrum graph: {initial vertex}V(s1) V(s2) ... V(sm) {terminal vertex}