500 likes | 698 Views
De Novo Sequencing and Homology Searching with De Novo Sequence Tags. Inexact protein DB. protein DB. Possible Ways to Interpret MS/MS Data. MS/MS Spectra. 2. de novo sequencing. peptides. homology search. database search. 1. 3. peptides. h omologous peptides. Why Bother?.
E N D
De Novo Sequencing and Homology Searching with De Novo Sequence Tags
Inexact protein DB protein DB Possible Ways to Interpret MS/MS Data MS/MS Spectra 2 de novo sequencing peptides homology search database search 1 3 peptides homologous peptides
Why Bother? • De novo sequencing derives the sequence without looking into a database. • De novosequencing is useful for • unsequenced genomes (no protein database) • novel peptides (unmatched spectra after database search) • single amino acid polymorphism • unexpected PTM • database error • validate a database match
Outline • Basics • Manual De Novo Sequencing • De novo Sequencing Algorithm (PEAKS) • Homology Search with De Novo Tags
O [N-term]-NH-CHR-C---NH2+-[C-term]H+ Sequence-specific fragment ions (M+2H)+2 [N-term]-NH-CHR-CO+ + NH2-[C-term] H+ b-ion y-ion a – NH3 or H2O b – NH3 or H2O y – NH3 or H2O [N-term]-NH=CHR+ + CO a-ion
Why does everyone analyze positively-charged tryptic peptides? • Usually better sensitivity from positively-charged peptide ions. • “Mobile protons” protonate peptide bonds and promote b/y fragmentation • Arg sequesters protons in gas phase • Tryptic peptides typically have 0 -1 Arg • Tryptic peptide ions typically have two protons • Therefore, tryptic peptides usually have b/y ions • Placing Arg’s at the C-terminus makes it more likely that a complete series of y-ions will be observed.
MS/MS spectrum of doubly-charged tryptic peptide (one Arg and two protons) Y y6 y5 y4 y3 y2 y1 Y L Y E I A R b2 a2 y5 b2 L y2 y1 y4 y3 y6
MS/MS spectrum of a doubly-charged non-tryptic peptide(two Arg’s and two protons) (M+2H)+2 y2 Y S R R H P E a5-17 (b6+18)+2 Relative Ab. (%) b5-17 Y y4-17 (b5+18)+2 b4 H y4 y3 P y6-17 y1
300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 m/z 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 CID in traps vsquadrupoles b13 IPIGFAGAQGGFDTR Ion trap y9 y14+2 y2 y10 Relative Abundance y12 y6 y5 b6 b9 y3 y7 y8 b10 y4 b7 b8 b12 y13 b4 b5 y11 b3 b11 y2 IPIGFAGAQGGFDTR Qtof b2 (M + 2H) +2 Relative Abundance b6 b4 b5 y3 y5 y14+2 y1 y9 y12 y4 y10 y6 b3 y7 y8 y13 y11 b13 m/z
Annoying things to remember when sequencing peptides by MS/MS • Leucine and isoleucine have the same mass • Glutamine and lysine differ by 0.036 u • Phenylalanine and oxidized methionine differ by 0.033 u • Cleavages do not occur at every bond (more often than not, there is no cleavage between the first and second residues) • Certain amino acids have the same mass as pairs of other amino acids: G + G = N, A + G = Q, G + V ~ R, A + D ~ W, • S + V ~ W • However: mass accuracy resolves many of these ambiguities
Outline • Basics • Manual De Novo Sequencing • De novo Sequencing Algorithm (PEAKS) • Homology Search with De Novo Tags
Two approaches to manually sequencing peptides from MS/MS spectra • Finding a series of ions in the middle of the peptide, and working out towards the termini (illustrated using ion trap data) • Finding the C-terminus and working towards the N-terminus (illustrated using qtof data)
Sequencing from the middle: look for ion series in the region above the precursor ion (m/z 615)
An obvious series is the one that involves the more abundant fragment ions (m/z 575, 688, 775, 888, and 987) L S L V
-18 -18 -18 -18 -18 Another ion series contains pairs separated by 18 Da (water losses) L V E S
Two ion series have been identified in the region above the precursor ion Problem: Two ion series defining partial sequences LSLV and LVES have been identified, but it is not known if these are y- or b-ions (i.e., the sequence direction is unknown). Solution: Since ion trap data often exhibits high mass b-ions, check to see if the highest mass ion in either series corresponds to a loss of either Arg or Lys (the usual tryptic C-terminus). If not, check to see if the mass difference corresponds to a dipeptide containing Lys or Arg (it is possible that the b-ion defining the C-terminus is absent). Calculation: Peptide MW – 17 – fragment ion = C-terminal residue mass
For the first ion series: 1228 – 17 – 987 = 224 Da L S L V 224 – 128 = 96 224 – 156 = 68 Therefore this does not look like a b-ion series
-18 -18 -18 -18 -18 For the second ion series: 1228 – 17 – 1083 = 128 (the residue mass of Lys); this looks like a b-ion series and maybe the other one is a y-ion series L V E S K
The high mass b-series predicts the presence of some low mass y-ions; are they there? b-series: …LVESK y1: 147 No y2: 234 Yes y3: 363 Yes y4: 462 Yes y5: 575 Yes!! y-ions = residue mass plus 19 Da b-ions y-ions
The high mass y-series predicts the presence of some low mass b-ions; are they there? y-series: [242]VLSL… b2: 243 Yes b3: 342 Yes b4: 455 Yes b5: 542 Yes b6: 655!! Yes b-ions y-ions
Can I account for most of the remaining ions as neutral losses or internal fragments? [242]VLSLLVESK 242 = N+Q, N+K, L+E b-ions y-ions neutral loss
Two approaches to manually sequencing peptides from MS/MS spectra • Finding a series of ions in the middle of the peptide, and working out towards one of the termini (illustrated using ion trap data) • Finding the C-terminus and working towards the N-terminus (illustrated using qtof data)
Outline • Basics • Manual De Novo Sequencing • De novo Sequencing Algorithm (PEAKS) • Homology Search with De Novo Tags
Algorithm Design • The first thing for algorithm design is to define the property of the solution. • For the de novo sequencing problem, one wants to compute a peptide that “best matches” the given spectrum. • This “best match” is practically defined by a scoring function.
Peptide-Spectrum Match Score peptide suffix prefix • A fragment score can be computed for every two adjacent amino acids. This score depends on the presence of the corresponding b and y ions. • The peptide score is the sum of the fragment scores.
The Fragment Score for a Mass peptide suffix prefix • The fragment score calculation only requires the prefix mass but not the sequence • Note: suffix mass = total residue mass – prefix mass. • Thus it is possible to define score for each prefix mass value .
How to Define Statistically • Learn two probabilities from large training data • : Prob(a peak is observed at a y-ion m/z). • : Prob(a peak is observed at a random m/z). • Usually . • If an expected y-ion is observed, is added to . • is called the log-likelihood-ratio • If an expected y-ion is missing, , is added to . • Thus, matching ion is rewarded and missing ion is penalized. • Other fragment ion types can be considered similarly.
De Novo Sequencing • For a sequence with prefix masses the peptide score is defined as • De Novo Seuqenicng: Given scoring function and mass , computes a sequence P with total residue mass , and maximizing .
Algorithm Idea • : the maximized score that can be achieved by a prefix with mass . best sequence for • If is the best sequence for , then must be the best sequence for . • Thus, . • To compute , try 20 residues and use the one that maximizes the above formula.
Dynamic Programming • The algorithm initializes and all other cells to be . • Then computes for from 1 to by . • The best sequence can be retrieved by a backtracking process by repetitively computing the last amino acid . BestScore 0 3 1 2
A Note on PTM • Variable PTM does not cause major speed slow down for de novo sequencing algorithms. • Instead of trying 20 regular amino acids in the maximization, the algorithm simply tries all modified amino acids too. • The time complexity is increased by a constant factor. (Compare to the exponential growth in database search approach). • However, since the solution space is larger when many variable PTMs are allowed, the accuracy of the algorithm is reduced.
Accounting for Other Ion Types • When internal cleavage ions are considered in the scoring function, it becomes difficult to design efficient algorithm to find the optimal sequence. • A compromise between efficiency and accuracy is to employ a two-stage approach. • First, compute many (e.g. 10,000) sequences using an efficient score function that uses only a few of the most important ions. • Then, evaluate these candidates using a more sophisticated scoring function additional ions. • This two-round approach is a tradeoff between the algorithm speed and accuracy.
Mass Segment Error • Most errors are due to incomplete ion ladders in the spectrum. • Thus, a segment of amino acids cannot be determined. • However, the total mass of the segment, is fixed. • E.g. [242]VLSLLVESK, where 242 = N+Q, N+K, or L+E • The first two or three residues often have low confidence, because of a lack of fragment ions. • Most de novo sequencing software uses the precursor mass as a constraint (thus the mass of the derived sequence is usually correct).
Outline • Basics • Manual De Novo Sequencing • De novo sequencing Algorithm • Homology Search with De Novo Tags
Why Homology Search with De Novo Sequence • Advantages: • Database may not contain the exact peptide sequence, but a homologous one is there. • De novo + homology search is great to use the database of one organism to study a similar organism. • Disadvantages: • De novo sequence can only provide partially correct sequence tags. • Conventional homology search may fail due to de novo sequencing errors.
Traditional Sequence Alignment • Two peptide sequences are aligned by inserting spaces to appropriate positions. E.g. FVEVTKL-TDLTK | || || ||||| FAEV-KLVTDLTK • The matching residues (including gaps, ‘-’) in each column has a similarity score that can be looked up in a pre-defined amino acid substitution matrix, such as BLOSUM or PAM. • The alignment score is equal to the sum of the column-wise scores. • There are algorithms to compute the optimal alignment that maximizes the alignment score.
Limitations of Conventional Homology Search • Conventional search ignores the possible errors in de novo sequencing. • Suppose a true sequence is SLCAFK, and the de novo sequence is LSCFAK, and the homolog is SLAAFK. (denovo) X: LSCFAK | (homolog) Z: SLAAFK (denovo) X: [LS]C[FA]K (real) Y: [SL]C[AF]K || || | (homolog) Z: [SL]A[AF]K Conventional search using evolutionary similarities to explain the mismatches results in a poor match. If de novo sequencing errors are considered, the match becomes more significant.
A Simple Approach • We can enumerate all possible combinations of a mass segment, and search all of them together. • MS BLAST will do this. • Difficulties: • Do not know which portion of the sequence is error. • Exponential growth of possibilities. LSCFAK SLCFAK TVCFAK VTCFAK LSCAFK SLCAFK TVCAFK VTCAFK [LS]C[FA]K
SPIDER Model (de novo) X: [LS]C[FA]K (real) Y: [SL]C[AF]K || || | (homolog) Z: [SL]A[AF]K • Given a de novo sequence X, and a database sequence Z. Try to reconstruct the real sequence Y. • The difference between X and Y is explained by de novo sequencing errors. • The difference between Y and Z is explained by homology mutations. • The real Y should minimize the de novo errors and the homology mutations needed in the above explanation.
Two exercises (denovo) X: LSCFAV (real) Y: SLCFAV (homolog) Z: SLCF-V • The swap of L and S is more likely a de novo error than a mutation. • The deletion of A is unlikely a de novo error (de novo does not change peptide mass). (denovo) X: LSCFV (real) Y: EACFV (homolog) Z: DACFV • Mutation and de novo error overlap. Hard for manual interpretation. Algorithm is needed. blosum62 m(LS)=m(EA)=200.1 Da
Conclusion • When the target peptides are not in a database. • De novo sequencing • When the homologous peptides are in database • Homology search with the de novo tags can find them • Some de novo errors can be corrected by combining the homolog information