1 / 22

Machine Learning techniques for biomarker discovery in proteomic pattern data

Machine Learning techniques for biomarker discovery in proteomic pattern data. Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam. Overview. Proteomic pattern data How to use the data Approaches Methodology Case study Conclusion.

floria
Download Presentation

Machine Learning techniques for biomarker discovery in proteomic pattern data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam

  2. Overview • Proteomic pattern data • How to use the data • Approaches • Methodology • Case study • Conclusion

  3. SELDI-TOF MSSurface-enhanced laser desorption/ionization time-of-flight mass spectronomy • Method for profiling a population of proteins in a sample according to the size and net charge of individual proteins. • The readout is a spectrum of peaks. The position of a protein in the spectrum corresponds to its “time of flight” because the small proteins fly faster than the heavy ones. 1 Serum on protein binding plate 2 Insert plate in vacuum chamber 3 Irradiate plate with laser 4 This “launches” the proteins / peptides 5 Measure “time of flight” (TOF) of Ions, which corresponds to molecular Weights of proteins

  4. Example Abundance Time of flight • Heavier peptides move slower -> • Time of flight corresponds to weight • Weight corresponds to peptides • Measuring relative abundance of detected proteins in serum

  5. How to use the data? • Diagnostic tool: • design a classifier for discriminating healthy from disease samples • Biomarkers identification: • Feature selection (FS): select features (peptides / proteins) that best discriminate the two classes (potential biomarkers)

  6. Classification / FS • diagnostic tool => classifier • train a classifier that separates the two classes of diseased and healthy examples • biomarkers => feature subset selection • for a given type of classifier (e.g. KNN, SVM) find a small set of features that optimizes the performance of the classifier when restricted to the selected features • for a given clustering algorithm find a small set of features that maximizes the coherence of class labels of examples in the clusters (Petricoin et al, The Lancet 2002)

  7. Approaches: Commercial • Proteome Quest (Correlogic): GA+clustering, no pre-selection (Petricoin et al., The Lancet 2002) • Propeak (3Z Informatics): separability analysis + bootstrap • Biomarker AMplification Filter BAMF (Eclipse Diagnostics): ?

  8. Approaches: Non-commercial • Pre-processing + ranking + kNN (Zhu et al., PNAS 2003) • Pre-selection + boosted decision trees (Qu et al., Clin. Chem. 2002) • Filter FS + classifier (Liu et al., Genome Informatics 2002) • GA + SVM (Jong et al., EvoBIO 2004) • Many others: any ML method for classification/FS (see, e.g., special issue on FS, JMLR 2003)

  9. SVM-based methods • Linear Support Vector Machine

  10. GA_SVM • Training set T= T_1  T_2. • A genetic algorithm evolves a number of populations. Each population consists of sets of features of a given size. The fitness of an individual of the population is based on the performance of a SVM. SVM is trained on T_1 using only the features of the individual. The fitness is the SVM error over T_2. • At each generation new individuals are created and inserted into the population by selecting fit parents which are mutated and recombined. • Individuals may migrate to neighbor populations.

  11. Ensemble SVM-RFE SVM-RFE(a cutoff, a training set T=T_1T_2) • Train a linear soft-SVM(C, class label penalties) on T_1 • Order features using the weights of the resulting classifier • Eliminate features with weight smaller than cutoff • Repeat the process with T_1 restricted to the remaining features This algorithm generates a chain of feature sets F_1  F_2  …  F_k SVM-RFE selects from {F_1, …,F_k} the set F* that minimizes the error over T_2 of the classifier restricted to the feature set, plus a term for penalizing large feature sets. We proposed a variant of this FS algorithm that uses ensembles of results of SVM-RFE over different cutoff values.

  12. Methodology • Cross Validation • split data randomly in train and test set • apply the classification/FS method to the training set • use the test set only to assess the performance of the method • repeat the process a number of times to analyze bias induced by the data splitting

  13. About Methodology • Examples of recent papers that do NOT use a correct methodology: • Qu et al. (Clin. Chem. 2002): perform feature pre-selection before application of CV • Villanueva et al (Anal. Chem. 2004): use the entire dataset for feature ranking • Petricoin et al (The Lancet 2002): consider one data split into train/test set • papers addressing methodology pitfalls: • Simon et al, J Nat. Cancer Inst 2003 • Ambroise and Mc Lachlan, PNAS 2002

  14. Case Study: Data • Used in Petricoin et al papers • Commercial analysis software (Proteome Quest): http://www.correlogic.com/ • Data sets: http://ncifdaproteomics.com/ppatterns.php • Ovarian data set: • 162 Positive (Cancer) 92 Negative (Healthy) • 15154 Variables (Peptides / Proteins) • Prostate data set: • 69 Positive 253 Negative • 15154 Variables • number variables >> number examples

  15. Preliminary analysis Prostate data: • Few visible differences in means between healthy/cancer groups • But many very low p-values (in particular ovarian -> easy) Ovarian data: Difference in means Histogram p-values

  16. The Methods • Diagnostic tool: • Support Vector Machine with linear and polynomial kernel • Biomarkers Detection and Diagnostic: • Feature subset selection, using Genetic Algorithms and Support Vector Machine

  17. Diagnostics: Results • Support Vector Machine (SVM) on all features • Linear and quadratic kernel • Evaluation measures: • Error: fp + fn / total • Sensitivity: tp / (tp + fn) • Specificity: tn / (fp + tn) • Positive Predictive Value: tp / (tp + fp) Results seem consistent with preliminary analysis: ovarian easier than prostate

  18. Biomarker Detection: Results Linear SVM, Prostate data set Quadratic SVM, Prostate data set Bigger error than SVM on all features (+/- 0.06)

  19. Results of Experiments • Results of experiments with GA-SVM indicate that there is variability both due to the data splitting and the algorithm. • Different sets of features are obtained at each run, however there is a group of about 50 features that occur more often over all the runs.

  20. Results of Experiments • Ensemble-RFE-SVM achieves perfect classification on ovarian dataset while on the prostate dataset achieves sensitivity 0.97(0.04) and specificity of 0.89(0.06). • Ensemble-RFE-SVM outperforms both GA-SVM and the commercial software of Petricoin et al. However, it finds feature sets of larger sizes. • Features provided by Petricoin et al URL site yield scarce performance when SVM is used, showing that performance depends on the type of classifier used…

  21. Diagnostic tool Design • Effective FS algorithms, like ensemble SVM-RFE, have to be enhanced with a user-friendly interface and visualization features in order to become operative in research laboratories and hospitals. • The resulting tools can be used by biologists and pathologists for analyzing their data without need of direct support from CS people.

  22. Conclusion • Many machine learning techniques can be used for the analysis of pattern proteomic data. SVM based approaches are effective. • Computational analysis of pattern proteomic data has to use a correct methodology that considers biases induced by the selection and classification algorithms and by the data splitting. • Collaboration: • Connie Jimenez • Gus Smit • Kees Jong • Aad van der Vaart

More Related