1 / 26

SVM-based techniques for biomarker discovery in proteomic pattern data

SVM-based techniques for biomarker discovery in proteomic pattern data. Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam. Overview. Variable selection SVM-based techniques Application to proteomic pattern data Results Conclusion. Variable Selection.

ziva
Download Presentation

SVM-based techniques for biomarker discovery in proteomic pattern data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam

  2. Overview • Variable selection • SVM-based techniques • Application to proteomic pattern data • Results • Conclusion

  3. Variable Selection • Select a small subset of input variables (for example genes in gene expression data, m/z values in proteomic pattern data) which are used for building classifier • Advantages: • it is cheaper to measure less variables • the resulting classifier is simpler and potentially faster • prediction accuracy may improve by discarding irrelevant variables • identifying relevant variables gives more insight into the nature of the corresponding classification problem (biomarker detection)

  4. Support Vector Machines • Advantages: • maximize the margin between two classes in the feature space characterized by a kernel function • are robust with respect to high input dimension • Disadvantages: • difficult to incorporate background knowledge • Sensitive to outliers

  5. wTx + b = 0 wTx + b > 0 wTx + b < 0 Binary classification f(x) = sign(wTx + b)

  6. Linear Separators

  7. ρ SVM: separable classes Support vectors uniquely characterize optimal hyper-plane margin Optimal hyper-plane Support vector

  8. SVM and outliers outlier

  9. SVM-RFE Linear binary classifier decision function • Recursive Feature Elimination (SVM-RFE) • at each iteration: • eliminate threshold% of variables with lower score • recompute scores of remaining variables • SVM-RFE based algorithms: • run SVM-RFE with different thresholds • JOIN: select variables occurring more than cutoff times • ENSEMBLE: consider majority vote of resulting classifiers

  10. SVM-RFE I. Guyon et al., Machine Learning, 46,389-422, 2002

  11. SVM-RFE variant • Input: Train set, thresholdT, number N of variables to be selected • Output: subset of variables of size N • RFE: • Train: Run linear SVM on train set • Score: generate a sequence of variables ordered wrt the absolute value of their weight • Eliminate: remove T % of variables from ordered sequence • Repeat (train, score, eliminate) on train set restricted to remaining variables until only N variables are left

  12. JOIN and ENSEMBLE SVM-RFE

  13. Case Study: proteomic pattern data • Petricoin et al papers • Commercial analysis software (Proteome Quest): http://www.correlogic.com/ • Data sets available at: http://ncifdaproteomics.com/ppatterns.php

  14. Data generation: SELDI-TOF MSSurface-enhanced laser desorption/ionization time-of-flight mass spectrometry • Method for profiling a population of proteins in a sample according to the size and net charge of individual proteins. • The readout is a spectrum of peaks. The position of a protein in the spectrum corresponds to its “time of flight” because the small proteins fly faster than the heavy ones. 1 Serum on protein binding plate 2 Insert plate in vacuum chamber 3 Irradiate plate with laser 4 This “launches” the proteins / peptides 5 Measure “time of flight” (TOF) of Ions, related to the molecular weight of proteins

  15. Example of proteomic pattern profile from one blood sample Abundance Time of flight • Heavier peptides move slower -> • Time of flight corresponds to weight • Weight corresponds to peptides • Measurement of relative abundance of detected peptides in serum

  16. How to use such data? • Diagnostic tool: • design a classifier for discriminating healthy from disease samples • Biomarkers identification: • Variable subset selection (VSS): select a subset of input variables (m/z values) that best discriminate the two classes (potential biomarkers)

  17. Commercial Tools • Proteome Quest (Correlogic): GA+clustering, no pre-selection (Petricoin et al., The Lancet 2002) • Propeak (3Z Informatics): separability analysis + bootstrap • Biomarker AMplification Filter BAMF (Eclipse Diagnostics): ?

  18. Non-commercial Techniques • Pre-processing + ranking + kNN (Zhu et al., PNAS 2003) • Pre-selection + boosted decision trees (Qu et al., Clin. Chem. 2002) • Filter FS + classifier (Liu et al., Genome Informatics 2002) • GA + SVM, SVM-RFE ensemble (Jong et al., EvoBIO 2004, Jong et al. CIBCB 2004) • Many others: any ML method for classification/FS (see, e.g., special issue on FS, JMLR 2003)

  19. Goal and Methods • Goal: analyze performance of SVM-based techniques for classification and variable selection with proteomic pattern data • SVM • SVM-RFE • Ensemble SVM-RFE: • Majority vote of SVM-RFE classifiers obtained from SVM-RFE with different cutoff values • Join SVM-RFE: • SVM trained on N variables that have been selected more often by SVM-RFE with different threshold values

  20. DataSets Two proteomic pattern datasets from prostate and ovarian cancer from NCI/CCR and FDA/CBER Clinical proteomics Program Databank: tot # cancer healthy M/z values Prostate 322 69 253 15154 Ovarian 4/03/02 100 115 (15 benign) 15154 215 Data sets available at: http://ncifdaproteomics.com/ppatterns.php

  21. Experimental Setup • 10 random partitions of dataset:T (50%),H (25%),V (25%) • Algorithms: • SVM trained on union of T and H • SVM-RFE(threshold) with thresholds = 0.2,0.3,0.4,0.5, 0.6,0.7 • Choose threshold giving best classifier sensitivity on H • JOIN(cutoff, 0.2, 0.3,0.4, 0.5,0.6,0.7) with cutoffs = 1, 2, 3, 4, 5 • Choose cutoff giving best classifier sensitivity on H • Performance: average (over 10 V's)

  22. Results Prostate Dataset

  23. Results Ovarian Dataset

  24. Controversy • Noise, bias, results reliability and reproducibility in serum proteomics: • Sorace, Zhan, BMC Bioinformatics, 2004, • Petricoin, BMC Bioinformatics, 2004, • Baggerly, Journal of the National Cancer Institute, vol. 97, No.4, 2005. • Liotta, Journal of the National Cancer Institute, vol. 97, No.4, 2005. • Ransohoff, Journal of the National Cancer Institute, vol. 97, No.4, 2005.

  25. Conclusion • Many machine learning techniques can be used for potential biomarker detection with pattern proteomic data. • SVM based techniques are a possible effective choice because of the high input dimension of such data. • Computational analysis of pattern proteomic data has to use a correct methodology that considers biases induced by the selection and classification algorithms and by the data splitting. • Problems related to reliability and reproducibility of data are inherent to the laboratory technology and actually addressed by researchers and practitioners.

  26. Acknowledgments • Connie Jimenez (Biology, VUMC) • Aad van der Vaart (Statistics, VUA)

More Related