1 / 22

On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering

On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering. Ji ří Novák et al. Si R et Research Group Department of Software Engineering Charles University in Prague Czech Republic http://www.siret.cz. Outline. Introduction Tandem Mass Spectrometry

aradia
Download Presentation

On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering Jiří Novák et al. SiRet Research Group Department of Software Engineering Charles University in PragueCzech Republic http://www.siret.cz

  2. Outline • Introduction • Tandem Mass Spectrometry • Identification of Protein Sequences • Preprocessing of Mass Spectra • Non-metric Similarity Search • Metric Access Methods • Non-Metric Access Methods • Peptide Sequences Identification (Original Idea) • Protein Sequences Identification (Improvements) • Experiments • Conclusions and Future Work

  3. Tandem Mass Spectrometry • identification of protein/peptide sequences from an "in vitro” sample • proteins are enzymatically digested to peptides • peptides are charged and separated by the mass-to-charge ratio • peptide ions are “smashed” to fragment ions and a mass spectrum is captured • mass spectrum • a list of peaks • difference of mass-to-charge ratiosbetween 2 neighboring peaksin a series corresponds to the mass of an amino acid • noise peaks – up to 80%

  4. Tandem Mass Spectrometry • a mass spectrum corresponds to a peptide ion • more peptide ions correspond to a peptide sequence • a protein sequence contain many peptide sequences • an “in vitro” sample is often analyzed by more spectrometer runs • a spectrometer generates hundreds to thousands of spectra in a single run • about 90% of all spectra are noise spectra

  5. Identification of Protein Sequences • similarity search in databases of known sequences • a query spectrum is compared with the hypothetical spectra generated from a database of protein sequences • databases of protein sequences grows exponentially • MSDB (> 3.2 millions of protein sequences) • SwissProt (> 530 thousands) • requirements • similarity function • database index

  6. Identification of Protein Sequences • similarity functions • cosine similarity (angle distance) • parameterized Hausdorff distance • sigmoid similarity • other similarities • shared peak count (SPC), spectral alignment, SEQUEST-like scoring, Mascot’s MOWSE, etc. • a metric distance is needed (see later)

  7. Identification of Protein Sequences • database indexes • existing approaches • precursor mass is often employed (a mass of a peptide ion before splitting) • approaches based on the similarity search in metric spaces (MVP-tree, locality sensitive hashing) • inverted lists + cosine similarity • the precursor mass distribution is not uniform • many peptides with low mass worsens the efficiency • capability of managing spectra with posttranslational modifications (PTMs) is often limited/neglected • precursor mass of peptides with PTMs may differ from the peptides without PTMs from tens to hundreds Daltons • non-metric access methods • a precursor-free method • expected shifts of peaks mustbe generated in the spectra

  8. Identification of Protein Sequences • "de novo" interpretation • direct interpretation of spectra using graph algorithms • completeness of ions series is crucial else many peptidesequences can be assigned to a spectrum • "tag" based methods • a short sequence “tag” is determined by “de novo” then the database is searched

  9. Preprocessing of Mass Spectra • spectrum quality filtering • many parameters of spectra are analyzed • number of peaks and their relative intensity, precursor mass, number of complementary y and b ions, etc. • a score is assigned to a spectrum • only spectra with specific score are further processed • different manufacturers, different physical principles of spectrometers  significance of parameters may differ from machine to machine • clustering • independent on the properties of different machines • similar spectra (corresponding to a peptide) form a cluster • noise spectra form the clusters with single objects (singletons) and they can be left out from further processing • a spectrum similarity function is required

  10. Preprocessing of Mass Spectra • clustering algorithms • K-means • one of the best-known algorithms 1. select K centroids (number of clusters) • move each point/spectrum to the cluster with the nearest centroid • new centroids of clusters are selected 4. if some points can be moved then go to step 2, else return • not suitable for mass spectra, since we cannot predict the number K

  11. Preprocessing of Mass Spectra • clustering algorithms • hierarchical-like clustering • more suitable algorithm for mass spectra 1. the algorithm is initialized with a spectrum (vector of mass-to-charge ratios) per cluster 2. two clusters are merged if the distance between their centroids is minimal and less or equal a specified tolerance (e.g., dHP <= 0.65) 3. new centroids are selected 4. the spectra are rearranged among the clusters; a spectrum is moved to another cluster if the distance among the spectrum and all spectra in the target cluster is less or equal a specified tolerance; in case that more clusters are selected, the cluster is picked where the distance between its centroid and the moved object is minimal 5. new centroids are selected 6. repeat steps 2 - 5 in w cycles • the centroids of clusters with >= 2 spectra are further processed • many good quality spectra are missed because they have not any "twin" in the dataset captured by the spectrometer • spectra captured in more spectrometer runs (>= 2) must be merged  good quality spectra are kept, noise spectra are successfully eliminated • quadratic time complexity

  12. Metric Access Methods (MAMs) • metric distance: reflexivity, symmetry, non-negativity and triangle inequality • the triangle inequality is crucialfor organizing objects intometric regions and for pruningthose regions while searching • M-tree • dynamic and balanced tree • organizes objects (vectors of mass-to-charge ratios) to n-dimensional ball regions • supports k nearest neighbors(kNN) and range queries

  13. Non-metric Access Methods • semi-metric • the triangle inequality is violated • fast but approximate search • T-error (triangle error) – the ratio of triplets of objects which do not satisfy the triangle inequality to all objects in the database • angle distance and parameterized Hausdorff distance • T-error ≈ 0 • bad indexability by MAMs • ball regions in the M-tree overlap thus no objects are pruned when the M-tree is traversed • the search deteriorates to the sequential scan of the whole database

  14. Non-metric Access Methods • modifier function • e.g., fractional-power modifier • is applied on the original distance v • the weight w controls the violation of the triangle inequality • w can be determined automatically for the user specified T-error tolerance Θby the TriGen algorithm cosine similarity (angle distance) parameterized Hausdorff distance

  15. Non-metric Access Methods • NM-tree • naturally aggregates the M-tree with the TriGen algorithm • the T-error tolerance Θcan be changed at the query time • no need to re-index the database when the T-error tolerance is changed

  16. Peptide Sequences Identification(Original Idea) • indexing • peptide sequences are generated from a database of protein sequences • hypothetical mass spectra (mass-to-charge ratios) are generated from peptide sequences (e.g. y-ions and b-ions) • hypothetical spectra are indexed by a MAM • querying/identification of peptide sequences • a kNN (k nearest neighbor) query is processed by the MAM for each query spectrum (captured by the spectrometer) • a hypothetical spectrum in the kNN query result corresponds to the correct peptide sequence • metric distances are coarse functions  an additional re-ranking may be used to determine the correct sequence from the kNN query result

  17. Protein Sequences Identification(Improvements) • algorithm for identification of protein/peptide sequences • preprocessing (speeds up the search) • clustering; the noise spectra are eliminated • centroids of clusters with >= 2 spectra are used in the query phase • query phase (speeds up the search) • original idea; kNN queries are processed by the NM-tree • postprocessing (increases the number of identified peptides) • peptide sequence candidates are assigned to the protein sequences of their origin which form protein sequence candidates • all spectra captured by the spectrometer are compared with the hypothetical spectra generated from protein sequence candidates

  18. Experiments • preprocessing • clustering is suitable when at least spectra from 2 spectrometer runs are merged together (unification of query sets)

  19. Experiments • postprocessing • kNN queries on the NM-tree are approximate  the faster the search, the lower the number of identified peptide sequences • since protein sequences contain many peptide sequences, kNN queries are followed by the sequential scan of protein sequence candidates and the number of identified sequences is increased

  20. Experiments • preprocessing + query phase + postprocessing • ratio of identified spectra to annotated spectra • time of identification per spectrum -0.94% -3.71% -4.44% 4.4x 4.0x 100.6x 25.1x

  21. Conclusions • clustering of spectra improves efficiency of our method • speed-up more than 100x (wrt. sequential scan of entire database without clustering) • ratio of identified peptide sequences over 90% • query sets must be merged from >= 2 spectrometer runs • sequential scan of protein sequence candidates must be used as a postprocessing

  22. Future work • clustering algorithm with lower time complexity • e.g. DENCLUE O(N logN) • dealing with PTMs when spectra are clustered • embedding of clustering into our demo application SimTandem

More Related