10 likes | 181 Views
Introduction. C 3 Database Compression. Faster Search. Searching Human ESTs. C 3 sequence database [1] Complete: All 30-mers are represented Correct: No new 30-mers are represented Compact: 30-mers occur exactly once.
E N D
Introduction C3 Database Compression Faster Search Searching Human ESTs C3 sequence database [1] Complete: All 30-mers are represented Correct: No new 30-mers are represented Compact: 30-mers occur exactly once. Figure 1 shows the size of the C3 sequence database compression for various databases. Human ESTs represent a largely unexplored source of peptide sequence data. EST sequence databases are highly redundant, with a sequencing error rate of about 1%. We translate the Human ESTs in all 6 frames and C3 compress the result. In the process, we eliminate open reading frames of less than 50 amino acids, and amino-acid 30-mers that are observed only once. We achieve 40 fold compression in sequence database size and therefore running time. Current sequence databases contain considerable peptide sequence redundancy. Amino-acid sequence databases often contain considerably fewer distinct 30-mers than its size suggests. The distinct 30-mers can be searched faster than the original sequence database, while the statistical significance of the peptide identifications is improved. Peptide Sequences are Short • Peptides identified in MS/MS workflows are rarely longer than 30 amino-acids. • Trypsin cuts at K or R, unless followed by P • Precursor ion is usually < 3000 Da • Charge state is usually +1, +2, or +3 • Peptides of more than 20 amino-acids typically don’t fragment well. • 30 is a conservative upper-bound on the length of peptides identified by MS/MS workflows. (Projected) (Projected) Figure 2: Absolute (seconds) and relative Mascot search time for original and C3 sequence databases. Blue represents Mascot search time, other colors represent peptide mapping time. Protein Sequence Databases More Sensitive Search We searched the Open Proteomics Database MS/MS dataset “SiHa human cell line used to model cervical cancer” against C3 Human dbEST with Mascot on a PC (512Mb RAM) Total spectra: 47788 Search Time: 4.5 hours Novel Peptides: 19 Figure 1: Sequence databases: size, C3 size, distinct 30-mers. IPI-HUMAN, from EBI; IPI, concatenation of IPI-HUMAN, IPI-MOUSE, and IPI-RAT from EBI; Swiss-Prot, from ExPASy; Swiss-Prot-VS, Swiss-Prot plus varsplic.pl variant enumeration; UniProt, concatenation of Swiss-Prot and TrEMBL; UniProt-VS, UniProt plus varsplic.pl variant enumeration; MSDB, from Imperial College; NRP, from NCI Frederick; NCBI-nr, from NCBI; and UnionNR, non-redundant union of all. Figure 1 shows size and distinct 30-mers. Experiment Parameters • ISB 17 Protein Mix (2043 MS/MS Spectra) • Mascot 2.0 • Precursor tolerance: 2 Da • Fragment tolerance: 0.15 Da • Up to 2 missed trypsin cleavages • IPI-HUMAN, Swiss-Prot(-VS), UniProt(-VS) • Dell PC with 512 Mb of RAM [1] N. Edwards and R. Lippert. Sequence database compression for peptide identification from tandem mass spectra. WABI 2004. Figure 3: Significant E-values for Mascot search against Swiss-Prot-VS (x) vs C3 Swiss-Prot-VS (y). Faster, more sensitive peptide identification fromtandem mass spectra by sequence database compression Nathan J. EdwardsCenter for Bioinformatics & Computational Biology, University of Maryland, College Park P-1014 Figure 4: Size (Blue, bottom axis, in Mb) and Mascot search time (Red, top axis, in hours) for C3 and brute force Human EST database. References