On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering

On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering Jiří Novák et al. SiRet Research Group Department of Software Engineering Charles University in PragueCzech Republic http://www.siret.cz

Outline • Introduction • Tandem Mass Spectrometry • Identification of Protein Sequences • Preprocessing of Mass Spectra • Non-metric Similarity Search • Metric Access Methods • Non-Metric Access Methods • Peptide Sequences Identification (Original Idea) • Protein Sequences Identification (Improvements) • Experiments • Conclusions and Future Work

Tandem Mass Spectrometry • identification of protein/peptide sequences from an "in vitro” sample • proteins are enzymatically digested to peptides • peptides are charged and separated by the mass-to-charge ratio • peptide ions are “smashed” to fragment ions and a mass spectrum is captured • mass spectrum • a list of peaks • difference of mass-to-charge ratiosbetween 2 neighboring peaksin a series corresponds to the mass of an amino acid • noise peaks – up to 80%

Tandem Mass Spectrometry • a mass spectrum corresponds to a peptide ion • more peptide ions correspond to a peptide sequence • a protein sequence contain many peptide sequences • an “in vitro” sample is often analyzed by more spectrometer runs • a spectrometer generates hundreds to thousands of spectra in a single run • about 90% of all spectra are noise spectra

Identification of Protein Sequences • similarity search in databases of known sequences • a query spectrum is compared with the hypothetical spectra generated from a database of protein sequences • databases of protein sequences grows exponentially • MSDB (> 3.2 millions of protein sequences) • SwissProt (> 530 thousands) • requirements • similarity function • database index

Identification of Protein Sequences • similarity functions • cosine similarity (angle distance) • parameterized Hausdorff distance • sigmoid similarity • other similarities • shared peak count (SPC), spectral alignment, SEQUEST-like scoring, Mascot’s MOWSE, etc. • a metric distance is needed (see later)

Identification of Protein Sequences • database indexes • existing approaches • precursor mass is often employed (a mass of a peptide ion before splitting) • approaches based on the similarity search in metric spaces (MVP-tree, locality sensitive hashing) • inverted lists + cosine similarity • the precursor mass distribution is not uniform • many peptides with low mass worsens the efficiency • capability of managing spectra with posttranslational modifications (PTMs) is often limited/neglected • precursor mass of peptides with PTMs may differ from the peptides without PTMs from tens to hundreds Daltons • non-metric access methods • a precursor-free method • expected shifts of peaks mustbe generated in the spectra

Identification of Protein Sequences • "de novo" interpretation • direct interpretation of spectra using graph algorithms • completeness of ions series is crucial else many peptidesequences can be assigned to a spectrum • "tag" based methods • a short sequence “tag” is determined by “de novo” then the database is searched

Preprocessing of Mass Spectra • spectrum quality filtering • many parameters of spectra are analyzed • number of peaks and their relative intensity, precursor mass, number of complementary y and b ions, etc. • a score is assigned to a spectrum • only spectra with specific score are further processed • different manufacturers, different physical principles of spectrometers  significance of parameters may differ from machine to machine • clustering • independent on the properties of different machines • similar spectra (corresponding to a peptide) form a cluster • noise spectra form the clusters with single objects (singletons) and they can be left out from further processing • a spectrum similarity function is required

Preprocessing of Mass Spectra • clustering algorithms • K-means • one of the best-known algorithms 1. select K centroids (number of clusters) • move each point/spectrum to the cluster with the nearest centroid • new centroids of clusters are selected 4. if some points can be moved then go to step 2, else return • not suitable for mass spectra, since we cannot predict the number K

Preprocessing of Mass Spectra • clustering algorithms • hierarchical-like clustering • more suitable algorithm for mass spectra 1. the algorithm is initialized with a spectrum (vector of mass-to-charge ratios) per cluster 2. two clusters are merged if the distance between their centroids is minimal and less or equal a specified tolerance (e.g., dHP <= 0.65) 3. new centroids are selected 4. the spectra are rearranged among the clusters; a spectrum is moved to another cluster if the distance among the spectrum and all spectra in the target cluster is less or equal a specified tolerance; in case that more clusters are selected, the cluster is picked where the distance between its centroid and the moved object is minimal 5. new centroids are selected 6. repeat steps 2 - 5 in w cycles • the centroids of clusters with >= 2 spectra are further processed • many good quality spectra are missed because they have not any "twin" in the dataset captured by the spectrometer • spectra captured in more spectrometer runs (>= 2) must be merged  good quality spectra are kept, noise spectra are successfully eliminated • quadratic time complexity

Metric Access Methods (MAMs) • metric distance: reflexivity, symmetry, non-negativity and triangle inequality • the triangle inequality is crucialfor organizing objects intometric regions and for pruningthose regions while searching • M-tree • dynamic and balanced tree • organizes objects (vectors of mass-to-charge ratios) to n-dimensional ball regions • supports k nearest neighbors(kNN) and range queries

Non-metric Access Methods • semi-metric • the triangle inequality is violated • fast but approximate search • T-error (triangle error) – the ratio of triplets of objects which do not satisfy the triangle inequality to all objects in the database • angle distance and parameterized Hausdorff distance • T-error ≈ 0 • bad indexability by MAMs • ball regions in the M-tree overlap thus no objects are pruned when the M-tree is traversed • the search deteriorates to the sequential scan of the whole database

Non-metric Access Methods • modifier function • e.g., fractional-power modifier • is applied on the original distance v • the weight w controls the violation of the triangle inequality • w can be determined automatically for the user specified T-error tolerance Θby the TriGen algorithm cosine similarity (angle distance) parameterized Hausdorff distance

Non-metric Access Methods • NM-tree • naturally aggregates the M-tree with the TriGen algorithm • the T-error tolerance Θcan be changed at the query time • no need to re-index the database when the T-error tolerance is changed

Peptide Sequences Identification(Original Idea) • indexing • peptide sequences are generated from a database of protein sequences • hypothetical mass spectra (mass-to-charge ratios) are generated from peptide sequences (e.g. y-ions and b-ions) • hypothetical spectra are indexed by a MAM • querying/identification of peptide sequences • a kNN (k nearest neighbor) query is processed by the MAM for each query spectrum (captured by the spectrometer) • a hypothetical spectrum in the kNN query result corresponds to the correct peptide sequence • metric distances are coarse functions  an additional re-ranking may be used to determine the correct sequence from the kNN query result

Protein Sequences Identification(Improvements) • algorithm for identification of protein/peptide sequences • preprocessing (speeds up the search) • clustering; the noise spectra are eliminated • centroids of clusters with >= 2 spectra are used in the query phase • query phase (speeds up the search) • original idea; kNN queries are processed by the NM-tree • postprocessing (increases the number of identified peptides) • peptide sequence candidates are assigned to the protein sequences of their origin which form protein sequence candidates • all spectra captured by the spectrometer are compared with the hypothetical spectra generated from protein sequence candidates

Experiments • preprocessing • clustering is suitable when at least spectra from 2 spectrometer runs are merged together (unification of query sets)

Experiments • postprocessing • kNN queries on the NM-tree are approximate  the faster the search, the lower the number of identified peptide sequences • since protein sequences contain many peptide sequences, kNN queries are followed by the sequential scan of protein sequence candidates and the number of identified sequences is increased

Experiments • preprocessing + query phase + postprocessing • ratio of identified spectra to annotated spectra • time of identification per spectrum -0.94% -3.71% -4.44% 4.4x 4.0x 100.6x 25.1x

Conclusions • clustering of spectra improves efficiency of our method • speed-up more than 100x (wrt. sequential scan of entire database without clustering) • ratio of identified peptide sequences over 90% • query sets must be merged from >= 2 spectrometer runs • sequential scan of protein sequence candidates must be used as a postprocessing

Future work • clustering algorithm with lower time complexity • e.g. DENCLUE O(N logN) • dealing with PTMs when spectra are clustered • embedding of clustering into our demo application SimTandem

On Optimizing the Non-metric Similarity Search in Tandem Mass Spectra by Clustering