650 likes | 859 Views
Efficient and accurate algorithms for peptide mass spectrometry. Dissertation presentation Stephen Tanner May 30, 2007 Lab page: http://peptide.ucsd.edu. Overview. Introduction: What is mass spectrometry? How does it fit into the broader context of biology and bioinformatics? (Chapter 1)
E N D
Efficient and accurate algorithms for peptide mass spectrometry Dissertation presentation Stephen Tanner May 30, 2007 Lab page: http://peptide.ucsd.edu
Overview • Introduction: What is mass spectrometry? How does it fit into the broader context of biology and bioinformatics? (Chapter 1) • Spectrum annotation (Chapters 2, 3, 4) • Discovering post-translational modifications (Chapters 5, 6) • Genome annotation (Chapter 7) • Gene set analysis of microarrays (Chapter 8)
From genomics to proteomics DNA Transcription mRNA Translation Protein
Key technologies Genomics Capillary sequencers are a central technology for studying DNA Microarrays are a central technology for studying RNA Mass spectrometry is a central technology for studying the proteome. Transcript- omics Proteomics
2002 Chemistry Nobel Prize • Given for MS and NMR applied to proteins • The citation highlights several current and potential applications “…Some five years ago, mass spectrometry definitively crossed the border to biochemistry. The general ways that it provides structural deter-mination, identification and trace level analysis have many applications in the biochemical field. It has become an attractive alternative to Edman sequencing, earlier dominant, and has an unsurpassed ability to identify posttranscriptional modifications and non-covalent interactions in for example antigen-antibody binding studies for identifying ligands to orphan receptors….”
Peptide Mass Spectrometry A protein sample is digested (typically with trypsin) to generate peptides. The peptides are then separated by liquid chromatography.
Mass spectrometry The mass spectrometer separates the eluting peptides by mass-to-charge ratio (m/z), and records a mass spectrum. Intensity m/z
Above: Diagram of a mass spectrometer (courtesy of ChemGuide.com). Molecules are accelerated by a series of charged plates, their time of flight determined by their mass-to-charge ratio.
Left: An LTQ mass spectrometer (image from University of Vermont) Right: A high-end Fourier Transform mass spectrometer (image from Pacific Northwest National Labs)
Tandem MS Secondary Fragmentation Ionized parent peptide
Peptide fragmentation H...-HN-CH-CO-NH-CH-CO-…OH • Peptides are fragmented, typically through collision with inert atoms. • Peptides break at peptide bonds, generating an N-terminal b ion and a C-terminal y ion. Rn-1 Rn+1 H...-HN-CH-CO H3N-CH-CO-…-OH Rn-1 Rn+1 b ion (includes N-terminus) y ion (includes C-terminus) Spectrum: One peak for each fragment type
Above: A sample peptide tandem mass spectrum, identified and labeled by the InsPecT software toolkit.
The Need for Bioinformatics • High-throughput technologies like MS generate huge volumes of data much faster than the data can be analyzed and integrated by legacy methods. • Analysis becomes the bottleneck, and algorithms address this bottleneck • Bioinformatics also helps improve accuracy - and provide accurate measurements of accuracy.
Known problem Bioinformatics application • Suppose it takes 1 second to interpret one spectrum using a database. How long would it take to search 1 million spectra? • Early tools, like Sequest, have runtimes that grow linearly with the number of scans • InsPecT uses the Aho-Corasik algorithm to search efficiently (up to 100 times faster than Sequest) • Suppose it takes 1 second to locate one word in a large text. How long would it take to locate 1 million words? • (The naive answer: One million seconds!) • The Aho-Corasik algorithm takes roughly the same time to find one million words as for one word.
Key algorithms Genome assemblyand gene finding are two important problems in genomics. Finding up- and down-regulated genes and gene sets is a key problem in transcriptomics. Peptide identification (InsPecT) and modification site identification (MS-Alignment) are two important problems in proteomics. Genomics Transcript- omics Proteomics
Peptide identification • Given a peptide tandem spectrum, we wish to identify the peptide which produced it. • Identifying peptides with modified residues (or point mutation) is important as well • Many interesting applications of mass spectrometry (e.g. quantitation) rely upon accurate peptide annotations.
Tanner, S., Shu, H., Frank, A., Wang, L., Zandi, E., Mumby, M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra. Anal. Chem., 77(14):4626–4639.Frank, A., Tanner, S., Bafna, V., and Pevzner, P., 2005. Peptide sequence tags for fast database search in mass-spectrometry. J. of Proteome Research, 4(4):1287–1295. InsPecT: Fast and Accurate Spectrum Annotation
Database search • One way to identify peptides (first implemented by tools like Sequest and Mascot) is to enumerate and score all possibilities from the sequence database. • Theoretical spectra are compared against the “mass fingerprint” of the spectrum Theoretical spectrum (#1 of 10,000) Input spectrum Match score
Drawbacks of database search • Enumerating all candidates is too slow, particularly when modifications and non-tryptic peptides must be considered. • A modern instrument produces a million spectra per day! • Early tools used an over-simplified match scoring model
De novo interpretation • What if we have no sequence database? • A de novo algorithm such as PEAKS or PepNovo attempts to recover the entire peptide sequence from the spectrum. • However, due to incomplete fragmentation and noise peaks, we can only generate partial peptide reconstructions in most cases. NG? GN? AT? G V P ??
Filtering via tags • If we identify a part of the sequence (tag) from the spectrum itself, we can efficiently filter for regions containing that string. • Recall: Exact match for strings is very fast. • Search time does not grow with number of query strings. • Computational problem: identify a collection of tags from a spectrum, such that at least one matches the true peptide. • We identify tags via a graph theoretic formulation
Peptide mass graphs • We obtain candidate prefix residue masses by treating spectrum peaks as b or y fragments. • Masses which differ by the mass of an amino acid are linked by an edge. W R V A L G T E P L K C W D T
Tag-based search W • InsPecT generates short peptide sequence tags from the spectrum, and uses these tags to filter the database. • Tag-based search is a hybrid of de novo and traditional database search. • Tags make database search much faster, analogous to the way that BLAST’s filter speeds up sequence search. R TAGPrefix Mass AVG 0.0 WTD 120.2 PET 211.4 V A L T G E P L K C W D T
Tag-based filtering MDHPEDESHSEK QDDEEALARLEEIK SIEAKLTLR QNNLNPERPDSAYLR LKQINEEQREGLR FVSEAVTAICEAK SSDIQAAVQICSLLHQR EFSASLTQGLLK SAEDLEADK MDHPEDESHSEK QDDEEALARLEEIK SIEAKLTLR QNNLNPERPDSAYLR LKQINEEQREGLR FVSEAVTAICEAK SSDIQAAVQICSLLHQR EFSASLTQGLLK SAEDLEADK Tools like Sequest must score every peptide from the database with approximately correct mass (left). Using InsPecT, the expensive scoring step need only be run on those candidates matching a sequence tag (right).
(root) A D F ... H V ... I/L M Prefix 250.1Da Suffix 1000.5Da Spectrum #1 Prefix 762.8Da Suffix 626.0Da Spectrum #23 Prefix 334.5Da Suffix 220.5Da Spectrum #3 Tags from all spectra are loaded into a trie. The trie lets us scan the protein database for any number of strings in linear time. When a tripeptide tag is matched and the flanking masses are matched, we obtain a candidate peptide.
Scoring tag masses Figure 3.2: Bayesian network for scoring masses. In nodes corresponding to peaks, the odds that a peak is present (in a charge-2 or a charge-3) spectrum are indicated.
Scoring tag masses • We use a Bayesian network to score each mass, using binned intensity levels • Masses receive high scores if they have peak patterns typical of valid break points Left: Simplified portion of the conditional probability table for one node of the bayesian network. In ion trap spectra, most break points produce a relatively strong y fragment, and a weak (but present) b fragment.
Scoring tags • Each tag is scored using the Bayesian network (for masses), including flanking amino acid effects. • Edge skew is penalized. • The top 25 tags are retained for searching. • InsPecT can easily be extended to new instruments. For instance, it can be retrained to handle c and z ion series (from ETD instruments) without recompiling the code.
Scoring candidate peptides • Filtering results in a list of candidate peptides which must be scored to obtain the best match. • A match scoring function assigns a match quality score (MQScore), given a spectrum and a peptide. • The MQScore is computed using a support vector machine (SVN) on a total of seven features measuring match quality. • The MQScore distinguishes the correct candidate from incorrect candidates.
Identifying correct annotations • In a typical experiment, only 10-30% of spectra are successfully interpreted. • We wish to focus on those spectra whose top-ranking candidate is correct. • To help do this, we consider the gap between the top candidate’s MQScore and the nearest runner-up (delta-score).
False discovery rates • In any high-throughput experiment, quantifying false discovery rates is crucial • We include decoy (shuffled) proteins in the database as a negative control. • We quantify the empirical false discovery rate by counting the number of matches to these bogus records.
Above: Histogram showing false discovery rate (y axis) versus weighted score (x axis) for results of a large search.
The seqeuence of human crystallin beta B1 is shown above, annotated with post-translational modifications discovered by InsPecT in a study of cataractous lens. Some modifications are produced by chemical damage, others are “deliberate” modifications carried out in a carefully-regulated manner. Comparisons of modificaiton rates suggest that deamidation (net mass shift +1) plays a role in cataract formation.
MS-Alignment and PTMFinder: Unrestrictive Modification Search Tsur, D., Tanner, S., Zandi, E., Bafna, V., and Pevzner, P., 2005. Identification of post-translational modifications via blind search of mass-spectra. Nature Biotechnology, 23:1562–1567. Tanner, S., Pevzner, P., and Bafna, V., 2006. Unrestrictive identification of post-translational modifications through peptide mass spectrometry. Nat Protocols, 1(1):67–72. Wilmarth, P. A. amd Tanner, S., Dasari, S., Nagalla, S. R., Riviere, M. A., Bafna, V., Pevzner, P. A., and David, L. L., 2006. Age-related changes in human crystallins determined from comparative analysis of post-translational modifications in young and aged lens: Does deamidation contribute to crystallin insolubility? Journal of Proteome Research, 2006. Tanner, S., Payne, S. H., Dasari, S., Shen, Z., Wilmarth, P., David, L., Loomis, W. F., Briggs, S. P., and Bafna, V., 2007. Accurate annotation of peptide modifications through unrestrictive database search. In preparation.
Post-translational modifications • After assembly, proteins are often modified to control their structure, to regulate enzyme activity, or by chemical damage. • Hundreds of different modification types are known. Databases such as UniMod, RESID, and ABRF catalog them.
Restrictive vs. unrestrictive search • InsPecT can handle several modification types at once, but the user must still “guess” a list of allowed modification types • In unrestrictive search, the virtual database of modified peptides is thousands of times larger than the sequence database itself. • Identifying all peptide candidates becomes unfeasible. However, an alignment procedure can find the best modified peptide
Simplified diagram of MS-Alignment algorithm. We construct dots for each database position (horizontal axis) and for each spectrum peak (vertical axis). Paths are diagonal lines, with one or two modifications (horizontal / vertical segments) permitted. An annotation is a path from top to the bottom of the graph. The highest-scoring paths are retained and re-scored.
We obtained interesting results in the Nature Biotechnology paper, but did not report a false discovery rate for sites. As peptide datasets grow, there will be less emphasis on individual spectral correctness. Instead we use the high redundancy of large datasets to focus on identification of modified peptides, and modified sites. Analysis of unrestrictive results
PTMFinder • The PTMFinder procedure attaches a false discovery rate to modification sites(analogous to PeptideProphet and unmodified search) • A site may be supported by several peptides, and by hundreds of spectra. • High spectrum-level accuracy is not sufficient (or necessary) to give high site-level accuracy • Combining features across spectra produces a very accurate model.
Handling δ-correct annotations • In unrestrictive search, each peptide has dozens of “neighbors” with similar fragmentation • Examples: Q-17GEAMLAPK QG-17EAMLAPK Q-16GEAMLAPK G+111EAMLAPK • PTMFinder merges and reconciles redundant peptides, and attempts to annotate peptides using known chemical modifications (Unrestrictive, but not blind)
Figure 6.3: ROC curve for categorization of modified lens peptides using the PTMFinder support vector machine (SVM). The accuracy of the PTMFinder model is significantly higher than a simple spectrum-level score cutoff. In addition, PTMFinder is more effective than selecting those sites which correspond to the most common modification types (amino acid and mass) by spectrum count.
PTMFinder analysis • Studied a small, heavily-modified data set from human lens, and a large data set from HEK293 cell extract • Also studied ~1.4million spectra from the protist Dictyostelium discoidens
Ten different peptide species witness histidine methylation of actin. Combining evidence from multiple peptide species gives a site p-value of 6.6x10-12. Fully tryptic peptides are most common, but missed cleavages and post-digest decay produce several other peptide species. We found this modification site to be conserved between Homo sapiens and the protist Dictyostelium discoidens.
Figure 6.5: Venn diagram summarizing sites of N-terminal acetylation (left) and phosphorylation (right) sites in human proteins. Known sites from two databases (Uniprot and HPRD) are shown, along with sites identified from a corpus of ~20 million spectra derived from the HEK293 cell line analyzed by MS-Alignment and PTMFinder.
Genome Annotation Improving gene annotation with mass spectrometry. Tanner, Stephen and Shen, Zhouxin and Ng, Julio and Florea, Liliana and Guigo, Roderic and Briggs, Steven P and Bafna, Vineet, 2007. Genome Research 17(2), 231-239. Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation. Nitin Gupta, Stephen Tanner, Navdeep Jaitly, Joshua Adkins, Mary Lipton, Robert Edwards, Margaret Romine, Andrei Osterman, Vineet Bafna, Richard D. Smith, Pavel Pevzner, 2007. In preparation.
Genome annotation Genomics Genome annotation is generally seen as something done before transcriptomics and proteomics. The direction of information flow mirrors the central dogma. Transcript- omics Proteomics
Genome annotation Genomics Mass spectrometry is an attractive method for discovering genes and improving gene annotations. Roughly 25% of tryptic peptides span a splice junction, so intron boundaries (and alternative splicing) can be conformed at the translational level. MS/MS has different sources of error than ESTs, providing a novel line of evidence for gene finding. ESTs Transcript- omics Peptide IDs Proteomics
Genomic Search • Proteins have many isoforms and sequence variants. Storing and searching every feasible sequence is inefficient! • Storing the proteome as an exon graph is more efficient, and results are trivially mapped back onto the genome • (Valuable even in cases where the genome is perfectly annotated!)
Figure 7.3: A portion of the exon graph for heterogenous nuclear ribonuclear protein K. The labeled edge represents a codon split across a splice junction. The dotted edge is an “adjacent edge” corresponding to a longer form of an exon. Searching the exon graph reveals peptides spanning both outgoing edges from the central node, confirming alternative splicing at the level of translation.
Exon Graph • We used gene predictions (GeneID) and EST mappings (dbEST, ESTMapper) to build a graph of putative exons and introns in the human genome • The graph incorporates coding SNPs from dbSNP • A modified version of InsPecT was then used to search the graph