1.6k likes | 2.21k Views
Protein Sequencing and Identification by Mass Spectrometry. Outline. Introduction to Protein Structure Peptide Mass Spectra De Novo Peptide Sequencing Spectrum Graphs Protein Identification via Database Search Spectral Convolution Spectral Alignment Problem.
E N D
Outline • Introduction to Protein Structure • Peptide Mass Spectra • De Novo Peptide Sequencing • Spectrum Graphs • Protein Identification via Database Search • Spectral Convolution • Spectral Alignment Problem TK modified for WMU CS 6030
Why are proteins and their sequences important? • Necessary in order to understand how cells and their biochemical pathways function. • A unique set of proteins is involved in each function • Each organism has a unique set of expressed proteins • Each type of cell / tissue has a unique set of expressed proteins • A key to understanding how the brain or liver works is to determine what proteins they produce TK Added for WMU CS 6030
Basic Summary of Protein Structure • A polypeptide is a polymer of amino acid residues, a linear chain based on amino acids • A polypeptide can be completely specified by indicating the sequence of amino acids • A protein is a collection of one more peptides bonded together
Formation of polypeptides • Two amino acids can combine to form a dipeptide • Structure in blue is a peptide link • If many amino acids are joined, we call it a polypeptide
Ends of the polypeptide chain • Note that a polypeptide is a chain of amino acid residues (water molecule is lost). • End of chain with –NH2 is the N-terminal • End of chain with –COOHis the C-terminal
De Novo Sequencing Strategy • Break proteins into peptide chain fragments • Use enzymes to cut into pieces • Further break apart peptide chains using mass spectrometry • Peptide chains will break up in a manner predictable by molecular physics • The masses of the molecular fragments will collectively deliver a “mass spectrum” from which the sequence can be derived • First, we’ll explore the ways that mass spectrometry can break peptides apart TK added for WMU CS 6030
Protein Backbone H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 C-terminus N-terminus AA residuei-1 AA residuei+1 AA residuei
Peptide Fragmentation Collision Induced Dissociation H+ H...-HN-CH-CO . . .NH-CH-CO-NH-CH-CO-…OH Ri-1 Ri Ri+1 Prefix Fragment Suffix Fragment • Peptides tend to fragment along the backbone. • Fragments can also loose neutral chemical groups like NH3 and H2O.
Breaking Protein into Peptides and Peptides into Fragment Ions • Proteases, e.g. trypsin, break protein into peptides. • A Tandem Mass Spectrometer further breaks the peptides down into fragment ions and measures the mass of each piece. • Mass Spectrometer accelerates the fragmented ions; heavier ions accelerate slower than lighter ones. • Mass Spectrometer measure mass/chargeratio of an ion.
N- and C-terminal Peptides P A G N F A P G N F A N P G F C-terminal peptides N-terminal peptides A N F P G P A N F G
Terminal peptides and ion types P G N F Peptide H2O Mass (D) 57 + 97 + 147 + 114 = 415 P G N F Peptide without H2O Mass (D) 57 + 97 + 147 + 114 – 18 = 397
N- and C-terminal Peptides 486 P A G N F A 71 P G N F 415 301 A N P G F 185 C-terminal peptides N-terminal peptides A N F P G 332 154 P A N F G 429 57
N- and C-terminal Peptides 486 71 415 301 185 C-terminal peptides N-terminal peptides 332 154 429 57
Peptide Fragmentation b2-H2O b3- NH3 a2 b2 a3 b3 HO NH3+ | | R1 O R2 O R3 O R4 | || | || | || | H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH | | | | | | | H H H H H H H y3 y2 y1 y2 - NH3 y3 -H2O
G V D L K L 57 Da = ‘G’ K D V G 99 Da = ‘V’ H2O D Mass Spectra • The peaks in the mass spectrum: • Prefix • Fragments with neutral losses (-H2O, -NH3) • Noise and missing peaks. mass 0 and Suffix Fragments.
G V D L K • Peptide Identification: Intensity MS/MS mass 0 mass 0 Protein Identification with MS/MS
Breaking Proteins into Peptides HPLC GTDIMR To MS/MS PAKID MPSERGTDIMRPAKID...... MPSER …… …… protein peptides
collision cell MS-2 MS-1 Ion Source Tandem Mass Spectrometry MS LC Scan 1707 MS/MS Scan 1708
Tandem Mass Spectrum • Tandem Mass Spectrometry (MS/MS): mainly generates partial N- and C-terminal peptides • Spectrum consists of different ion types because peptides can be broken in several places. • Chemical noise often complicates the spectrum. • Represented in 2-D: mass/charge axis vs. intensity axis
W R V A L Database ofknown peptidesMDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN.. Database ofknown peptidesMDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT, HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE, ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC, GVFGSVLRA, EKLNKAATYIN.. T G E P L K C W D T Database of all peptides = 20nAAAAAAAA,AAAAAAAC,AAAAAAAD,AAAAAAAE,AAAAAAAG,AAAAAAAF,AAAAAAAH,AAAAAAI, AVGELTI, AVGELTK , AVGELTL, AVGELTM, YYYYYYYS,YYYYYYYT,YYYYYYYV,YYYYYYYY W R V A L T G E P L K C W D T De Novo vs. Database Search Database Search De Novo Mass, Score AVGELTK
De Novo vs. Database Search: A Paradox • The database of all peptides is huge ≈ O(20n) . • The database of all known peptides is much smaller ≈ O(108). • However, de novo algorithms can be much faster, even though their search space is much larger! • A database search scans all peptides in the database of all known peptides search space to find best one. • De novo eliminates the need to scan database of all peptides by modeling the problem as a graph search.
De novo Peptide Sequencing Sequence
Two approaches to determining sequence from the mass spectra • Exhaustive Search • Involves analysis of 20L sequences of length L • Branch and bound advisable, but success has been limited • Analysis of Spectrum Graph • Based on experimental spectrum rather than all possible matches to possible spectra TK added slide for WMU CS 6030
Building Spectrum Graph • How to create vertices (from masses) • How to create edges (from mass differences) • How to score paths • How to find best path
Ion Types • Some masses correspond to fragment ions, others are just random noise • Knowing ion typesΔ={δ1, δ2,…, δk} lets us distinguish fragment ions from noise • We can learn ion types δi and their probabilities qi by analyzing a large test sample of annotated spectra.
Example of Ion Type • Δ={δ1, δ2,…, δk} • Ion types {b, b-NH3, b-H2O} correspond to Δ={0, 17, 18} • This is a set of masses of chemical groups frequently cleaved from peptide fragments during bombardment of molecules with electrons in the mass spectrometer collision cell *Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity TK modified slide for WMU CS 6030
Theoretical Spectra • The theoretical spectra of peptide P is designated as T(P) • T(P) can be calculated by subtracting all possible ion types Δ={δ1, δ2,…, δk} from all possible partial peptides of P • Every partial peptide will have k masses in T(P) TK added slide for WMU CS 6030
Match between Spectra and the Shared Peak Count • The match between two spectra is the number of masses (peaks) they share (Shared Peak Count or SPC) • In practice mass-spectrometrists use the weighted SPC that reflects intensities of the peaks • Match between experimental (S) and theoretical spectra (T) is defined similarly
Peptide Sequencing Problem Goal: Find a peptide with maximal match between an experimental and theoretical spectrum. Input: • S: experimental spectrum • Δ: set of possible ion types • m: parent mass Output: • P: peptide with mass m, whose theoretical spectrum T matches the experimental S spectrum the best
Vertices of Spectrum Graph • Masses of potential N-terminal peptides • Vertices are generated by reverse shifts corresponding to ion types Δ={δ1, δ2,…, δk} • Every N-terminal peptide can generate up to k ions m-δ1, m-δ2, …, m-δk • Every mass s in an MS/MS spectrum generates k vertices V(s) = {s+δ1, s+δ2, …, s+δk} corresponding to potential N-terminal peptides • Vertices of the spectrum graph: {initial vertex}V(s1) V(s2) ... V(sm) {terminal vertex}
Edges of Spectrum Graph • Two vertices with mass difference corresponding to an amino acid A: • Connect with an edge labeled by A
Peptide Sequences • A possible peptide sequence is identified by finding a path from mass 0 to parent mass m in the resulting DAG • Many possible sequences (paths) will typically be identified in the DAG • The “best path” is identified by scoring using probabilities that are based on experimental data TK added slide for WMU CS 6030
Path Score • p(P,S) = probability that peptide P produces spectrum S= {s1,s2,…sq} • p(P, s) = the probability that peptide P generates a peak (mass) s • Scoring = computing probabilities • p(P,S) = πsєSp(P, s)
Peak Score • For a position t that represents ion type dj : qj, if peak is generated at t p(P,st) = 1-qj , otherwise
Peak Score (cont’d) • For a position t that is not associated with an ion type: qR , if peak is generated at t pR(P,st) = 1-qR , otherwise • qR = the probability of a noisy peak that does not correspond to any ion type
Finding Optimal Paths in the Spectrum Graph • For a given MS/MS spectrum S, find a peptide P’ maximizing p(P,S) over all possible peptides P: • Peptides = paths in the spectrum graph • P’ = the optimal path in the spectrum graph
Ions and Probabilities • Tandem mass spectrometry is characterized by a set of ion types {δ1,δ2,..,δk}and their probabilities {q1,...,qk} • δi-ions of a partial peptide are produced independentlywith probabilities qi
Ions and Probabilities • A peptide has all k peaks with probability • and no peaks with probability • A peptide also produces a ``random noise'' with uniform probability qR in any position.
Ratio Test Scoring for Partial Peptides • Incorporates premiums for observed ions and penalties for missing ions. • Example: for k=4, assume that for a partial peptide P’ we only see ions δ1,δ2,δ4. The score is calculated as: