A Variance Reduction Framework for Stable Feature Selection

A Variance Reduction Framework for Stable Feature Selection Yue Han and Lei Yu yhan1@binghamton.edu Binghamton University

Outline • Introduction, Motivation and Related Work • Theoretical Framework • Empirical Framework : Margin Based Instance Weighting • Empirical Study • Synthetic Data • Real-world Data • Conclusion and Future Work

Terms T1 T2 ….…… TN C Documents Sports 12 0….…… 6 D1 Travel D2 3 10….…… 28 … … … … DM 0 11….…… 16 Jobs Introduction and MotivationFeature Selection Applications Pixels Vs Features Features(Genes or Proteins) Samples

MotivationStability of Feature Selection Data Space Training D1 Feature Subset R1 Training D2 Feature Subset R2 Consistent or not??? Training Dn Feature Subset Rn Stability Issue of Feature Selection Feature Selection Method Stability of Feature Selection : the insensitivity of the result of a feature selection algorithm to variations to the training set. Stability of feature selection was relatively neglected before and attracted interests from researchers in data mining recently. Training Data Learning Models Learning Algorithm Stability of Learning Algorithm

MotivationWhy is Stable Feature Selection needed? Largely different feature subsets Similarly good learning performance Domain experts (Biomedicine and Biology) also interested in: Biomarkers stable and insensitive to data variations Unstable feature selection method Dampen confidence for validation; Increase the cost for experiments Stable Feature Selection Method Learning Methods Data Variations Stable Feature Subset Learning Results Closer to characteristic features (biomarkers) Better learning performance

Theoretical FrameworkVariance, Bias and Error of Feature Selection Data Space True Feature Weight Vector Training D1 Feature Weight Vector Training D2 Feature Weight Vector Training Dn Feature Weight Vector Variance: fluctuation of n weight values around its central tendency Bias: loss of central tendency(average) from the true weight value Error: average loss of n weight values from the true weight value for Challenge:Increasing training sample size could be very costly or impractical How to represent the underlying data distribution without increasing sample size?

Theoretical FrameworkBias-variance Decomposition of Feature Selection Error Data Space: ; Training Data: D ; FS Result: r(D) ; True FS Result: r* For each individual feature: weight value instead of 0/1 selection Error: Bias: Variance: Bias-Variance Decomposition of Feature Selection Error: Average for all features • Reveals relationship between accuracy(opposite of error) and stability (opposite of variance); • Suggests a better trade-off between the bias and variance of feature selection.

Theoretical FrameworkVariance Reduction via Importance Sampling Feature Selection (Weighting)  Monte Carlo Estimator Reducing Variance of Monte Carlo Estimator: Importance Sampling ? Increasing sample size impractical and costly Importance Sampling Intuition behind importance sampling: More instances draw from important regions Less instances draw from other regions Intuition behind instance weighting: Increase weights for instances from important regions Decrease weights for instances from other regions Instance Weighting How to weight the instances? How important is each instance?

Empirical FrameworkOverall Framework • Challenges: • How to produce weights for instances from the point view of feature selection stability; • How to present weighted instances to conventional feature selection algorithms. Margin Based Instance Weighting for Stable Feature Selection

Empirical FrameworkMargin Vector Feature Space Nearest Hit Margin Vector Feature Space Original Space Nearest Miss For each Hypothesis Margin (along each dimension): captures the local profile of feature relevance for all features at hit miss • Instances exhibit different profiles of feature relevance; • Instances influence feature selection results differently.

Empirical FrameworkAn Illustrative Example (a) (b) Hypothesis-Margin based Feature Space Transformation: (a) Original Feature Space (b) Margin Vector Feature Space.

Empirical FrameworkMargin Based Instance Weighting Algorithm • Review: • Variance reduction via Importance Sampling • More instances draw • from important regions • Less instances draw from other regions exhibits different profiles of feature relevance Higher Outlying Degree Lower Weight Instance Weighting Instance influence feature selection results differently Lower Outlying Degree Higher Weight Weighting: Outlying Degree:

Empirical FrameworkAlgorithm Illustration • Time Complexity Analysis: • Dominated by Instance Weighting: • Efficient for High-dimensional Data with small sample size (n<<d)

Empirical StudySubset Stability Measures Data Space Training D1 Feature Subset R1 Training D2 Feature Subset R2 Consistent or not??? Training Dn Feature Subset Rn Stability Issue of Feature Selection Feature Selection Method Average Pair-wise Similarity: Kuncheva Index:

Empirical StudyExperiments on Synthetic Data 500 Training Data: 100 instances with 50 from and 50 from Leave-one-out Test Data: 5000 instances Synthetic Data Generation: Feature Value: two multivariate normal distributions Covariance matrix is a 10*10 square matrix with elements 1 along the diagonal and 0.8 off diagonal. 100 groups and 10 feature each Class label: a weighted sum of all feature values with optimal feature weight vector Method in Comparison: SVM-RFE: Recursively eliminate 10% features of previous iteration till 10 features remained. Measures: Variance, Bias, Error Subset Stability (Kuncheva Index) Accuracy (SVM)

Empirical StudyExperiments on Synthetic Data • Observations: • Error is equal to the sum of bias and variancefor both versions of SVM-RFE; • Error is dominated by bias during early iterations • and is dominated by variance during later iterations; • IW SVM-RFE exhibits significantly lower bias, variance and error than • SVM-RFE when the number of remaining features approaches 50.

Empirical StudyExperiments on Synthetic Data • Conclusion: • Variance Reduction via Margin Based Instance Weighting • better bias-variance tradeoff • increased subset stability • improved classification accuracy

Empirical StudyExperiments on Real-world Data Microarray Data: Measures: Variance Subset Stability Accuracies (KNN, SVM) 20-Ensemble SVM-RFE Bootstrapped Training Data Feature Subset Methods in Comparison: SVM-RFE Ensemble SVM-RFE Instance Weighting SVM-RFE Aggregated Feature Subset 20 ... ... Bootstrapped Training Data Feature Subset

Empirical StudyExperiments on Real-world Data • Observations: • Non-discriminative during early iterations; • SVM-RFE sharply increase as # of features approaches 10; • IW SVM-RFE shows significantly slower rate of increase. Note: 40 iterations starting from about 1000 features till 10 features remain

Empirical StudyExperiments on Real-world Data • Observations: • Both ensemble and instance weighting approaches improve stability consistently; • Ensemble is not as significant as instance weighting; • As # of features increases, stability score decreases because of the larger correction factor.

Empirical StudyExperiments on Real-world Data Prediction accuracy(via both KNN and SVM): non-discriminative among three approaches for all data sets • Conclusions: • Improves stability of feature selection without sacrificing prediction accuracy; • Performs much better than ensemble approach and more efficient; • Leads to significantly increased stability with slight extra cost of time.

Conclusion and Future Work Accomplishments: • Establish a bias-variance decomposition framework for feature selection; • Propose an empirical framework for stable feature selection; • Develop an efficient margin-based instance weighting algorithm; • Comprehensive study through synthetic and real-world data. Future Work: • Extend current framework to other state-of-the-art feature selection algorithms; • Explore the relationship between stable feature selection and classification performance.

Related WorkStable Feature Selection • Comparison of Feature Selection Algorithms w.r.t. Stability • (Davis et al. Bioinformatics, vol. 22, 2006; Kalousis et al. KAIS, vol. 12, 2007) • Quantify the stability in terms of consistency on subset or weight; • Algorithms varies on stability and equally well for classification; • Choose the best with both stability and accuracy. • Bagging-based Ensemble Feature Selection • (Saeys et al. ECML07) • Different bootstrapped samples of the same training set; • Apply a conventional feature selection algorithm; • Aggregates the feature selection results. • Group-based Stable Feature Selection • (Yu et al. KDD08; Loscalzo et al. KDD09) • Explore the intrinsic feature correlations; • Identify groups of correlated features; • Select relevant feature groups.

Thank you and Questions?

A Variance Reduction Framework for Stable Feature Selection