1 / 24

Correlation Aware Feature Selection

Correlation Aware Feature Selection. Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli. http://mpa.itc.it. Berlin – 8/10/2005. Overview. On Feature Selection Correlation Aware Ranking Synthetic Example. Feature Selection. Step-wise variable selection:.

rune
Download Presentation

Correlation Aware Feature Selection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Correlation Aware Feature Selection Annalisa BarlaCesare FurlanelloGiuseppe JurmanStefano MerlerSilvano Paoli http://mpa.itc.it Berlin – 8/10/2005

  2. Overview • On Feature Selection • Correlation Aware Ranking • Synthetic Example

  3. Feature Selection Step-wise variable selection: One feature vs. N features n*<N effective variables modeling the classification function N features … … Step 1 Step N N steps

  4. Feature Selection Step-wise selection of the features. Ranked Features Discarded Features Steps

  5. Ranking • Classifier independent filters • Prefiltering is risky: you might discard features that turns out to be important. (ignoring labelling) • Induced by a classifier

  6. Support Vector Machines Classification function: Optimal Separating Hyperplane

  7. The classification/ranking machine • The RFE idea: given N features (genes) • Train a SVM • Compute a cost function J from the weight coefficients of the the SVM • Rank features in terms of contribution to J • Discard the feature less contributing to J • Reapply procedure on the N-1 features This is called Recursive Feature Elimination (RFE) Features are ranked according to their contribute to the classification, given the training data. Time and data consuming, and at risk of selection bias Guyon et al. 2002

  8. RFE-based Methods Considering chunks of data at a time: • Parametrics • Sqrt(N) – RFE • Bisection – RFE • Non-Parametrics • E – RFE (adapting to weight distribution):thresholding weights to a value w*

  9. Variable Elimination Correlated genes Given F={x1, x2, …, xH}such that: for a given threshold T. Each single weight is negligible w(x1)~w(x2) ~ … ~ ε < w* BUT w(x1)+w(x2)+ … >> w*

  10. Correlated Genes (1)

  11. Correlated Genes (2)

  12. N(1,1) 1 feat repeated Unif(-4,4) N(-1,1) Synthetic Data Binary problem 100 (50 +50) samples of 1000 genes: genes 150 : randomly extracted from N(1,1) and N(-1,1) respectively genes 50100 : randomly extracted from N(1,1) and N(-1,1) respectively (1 repeated 50 times) genes 101 1000 extracted from UNIF(-4,4) 1 1000 50 100 Class 1: 50 51 significantfeatures Class 2: 50 50 1x50

  13. Our algorithm step j

  14. Methodology • Implemented within the BioDCV system (50 replicates) • Realized through R - C code interaction

  15. Synthetic Data 1 100 1000 50 steps Gene 100 is consistently ranked as 2nd

  16. Work in Progress • Preservation of high correlated genes with low initial weights on microarrays datasets • Robust correlation measures • Different techniques to detect Fl families (clustering, gene functions)

  17. Synthetic Data

  18. Synthetic Data 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 227 559 864 470 363 735 Features discarded at step 9 from E-RFE procedure: Correlation Correction: Saves feature 100

  19. Challenges Challenges for predictive profiling • INFRASTRUCTURE • MPACluster -> available for batch jobs • Connecting with IFOM -> 2005 • Running at IFOM -> 2005/2006 • Production on GRID resources (spring 2005) • ALGORITHMS II • Gene list fusion: suite of algebraic/statistical methods • Prediction over multi-platform gene expression datasets (sarcoma, breast cancer): large scale semi-supervised analysis • New SVM Kernels for prediction on spectrometry data within complete validation

  20. A few issues in feature selectionwith a particular interest on classificationof genomic data WHY? To enhance information To ease computational burden Discard the (apparently) less significant features and train in a simplified space: alleviate the curse of dimensionality Highlight (and rank) the most important features and improve the knowledge of the underlying process. HOW? As a pre-processing step As a learning step Link the feature ranking to the classification task: wrapper methods, … Employ a statistical filter (t-test, S2N)

  21. Prefiltering is risky: you might discard features that turns out to be important. Nevertheless, wrapper methods are quite costing. Moreover, in the gene expression data, you have to deal also with particular situations like clones or highly correlated features that may represent a pitfall for several selection methods. A classic alternative is to map into linear combination of features,and then select. Principal Component Analysis Metagenes (a simplified model for pathways: but biological suggestions require caution) eigen-craters for unexploded bomb risk maps But we are not working anymore with the original features.

  22. Feature Selection within Complete Validation Experimental Setups Complete Validation is needed to decouple model tuning from (ensemble) model accuracy estimation: otherwise selection bias effects … Accumulating rel. importance from Random Forest models for the identification of sensory drivers(with P. Granitto, IASMA)

More Related