SVM-based techniques for biomarker discovery in proteomic pattern data

SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam

Overview • Variable selection • SVM-based techniques • Application to proteomic pattern data • Results • Conclusion

Variable Selection • Select a small subset of input variables (for example genes in gene expression data, m/z values in proteomic pattern data) which are used for building classifier • Advantages: • it is cheaper to measure less variables • the resulting classifier is simpler and potentially faster • prediction accuracy may improve by discarding irrelevant variables • identifying relevant variables gives more insight into the nature of the corresponding classification problem (biomarker detection)

Support Vector Machines • Advantages: • maximize the margin between two classes in the feature space characterized by a kernel function • are robust with respect to high input dimension • Disadvantages: • difficult to incorporate background knowledge • Sensitive to outliers

wTx + b = 0 wTx + b > 0 wTx + b < 0 Binary classification f(x) = sign(wTx + b)

Linear Separators

ρ SVM: separable classes Support vectors uniquely characterize optimal hyper-plane margin Optimal hyper-plane Support vector

SVM and outliers outlier

SVM-RFE Linear binary classifier decision function • Recursive Feature Elimination (SVM-RFE) • at each iteration: • eliminate threshold% of variables with lower score • recompute scores of remaining variables • SVM-RFE based algorithms: • run SVM-RFE with different thresholds • JOIN: select variables occurring more than cutoff times • ENSEMBLE: consider majority vote of resulting classifiers

SVM-RFE I. Guyon et al., Machine Learning, 46,389-422, 2002

SVM-RFE variant • Input: Train set, thresholdT, number N of variables to be selected • Output: subset of variables of size N • RFE: • Train: Run linear SVM on train set • Score: generate a sequence of variables ordered wrt the absolute value of their weight • Eliminate: remove T % of variables from ordered sequence • Repeat (train, score, eliminate) on train set restricted to remaining variables until only N variables are left

JOIN and ENSEMBLE SVM-RFE

Case Study: proteomic pattern data • Petricoin et al papers • Commercial analysis software (Proteome Quest): http://www.correlogic.com/ • Data sets available at: http://ncifdaproteomics.com/ppatterns.php

Data generation: SELDI-TOF MSSurface-enhanced laser desorption/ionization time-of-flight mass spectrometry • Method for profiling a population of proteins in a sample according to the size and net charge of individual proteins. • The readout is a spectrum of peaks. The position of a protein in the spectrum corresponds to its “time of flight” because the small proteins fly faster than the heavy ones. 1 Serum on protein binding plate 2 Insert plate in vacuum chamber 3 Irradiate plate with laser 4 This “launches” the proteins / peptides 5 Measure “time of flight” (TOF) of Ions, related to the molecular weight of proteins

Example of proteomic pattern profile from one blood sample Abundance Time of flight • Heavier peptides move slower -> • Time of flight corresponds to weight • Weight corresponds to peptides • Measurement of relative abundance of detected peptides in serum

How to use such data? • Diagnostic tool: • design a classifier for discriminating healthy from disease samples • Biomarkers identification: • Variable subset selection (VSS): select a subset of input variables (m/z values) that best discriminate the two classes (potential biomarkers)

Commercial Tools • Proteome Quest (Correlogic): GA+clustering, no pre-selection (Petricoin et al., The Lancet 2002) • Propeak (3Z Informatics): separability analysis + bootstrap • Biomarker AMplification Filter BAMF (Eclipse Diagnostics): ?

Non-commercial Techniques • Pre-processing + ranking + kNN (Zhu et al., PNAS 2003) • Pre-selection + boosted decision trees (Qu et al., Clin. Chem. 2002) • Filter FS + classifier (Liu et al., Genome Informatics 2002) • GA + SVM, SVM-RFE ensemble (Jong et al., EvoBIO 2004, Jong et al. CIBCB 2004) • Many others: any ML method for classification/FS (see, e.g., special issue on FS, JMLR 2003)

Goal and Methods • Goal: analyze performance of SVM-based techniques for classification and variable selection with proteomic pattern data • SVM • SVM-RFE • Ensemble SVM-RFE: • Majority vote of SVM-RFE classifiers obtained from SVM-RFE with different cutoff values • Join SVM-RFE: • SVM trained on N variables that have been selected more often by SVM-RFE with different threshold values

DataSets Two proteomic pattern datasets from prostate and ovarian cancer from NCI/CCR and FDA/CBER Clinical proteomics Program Databank: tot # cancer healthy M/z values Prostate 322 69 253 15154 Ovarian 4/03/02 100 115 (15 benign) 15154 215 Data sets available at: http://ncifdaproteomics.com/ppatterns.php

Experimental Setup • 10 random partitions of dataset:T (50%),H (25%),V (25%) • Algorithms: • SVM trained on union of T and H • SVM-RFE(threshold) with thresholds = 0.2,0.3,0.4,0.5, 0.6,0.7 • Choose threshold giving best classifier sensitivity on H • JOIN(cutoff, 0.2, 0.3,0.4, 0.5,0.6,0.7) with cutoffs = 1, 2, 3, 4, 5 • Choose cutoff giving best classifier sensitivity on H • Performance: average (over 10 V's)

Results Prostate Dataset

Results Ovarian Dataset

Controversy • Noise, bias, results reliability and reproducibility in serum proteomics: • Sorace, Zhan, BMC Bioinformatics, 2004, • Petricoin, BMC Bioinformatics, 2004, • Baggerly, Journal of the National Cancer Institute, vol. 97, No.4, 2005. • Liotta, Journal of the National Cancer Institute, vol. 97, No.4, 2005. • Ransohoff, Journal of the National Cancer Institute, vol. 97, No.4, 2005.

Conclusion • Many machine learning techniques can be used for potential biomarker detection with pattern proteomic data. • SVM based techniques are a possible effective choice because of the high input dimension of such data. • Computational analysis of pattern proteomic data has to use a correct methodology that considers biases induced by the selection and classification algorithms and by the data splitting. • Problems related to reliability and reproducibility of data are inherent to the laboratory technology and actually addressed by researchers and practitioners.

Acknowledgments • Connie Jimenez (Biology, VUMC) • Aad van der Vaart (Statistics, VUA)

SVM-based techniques for biomarker discovery in proteomic pattern data