160 likes | 315 Views
Feature S ubset Selection with Local Search. Gokcen Cilingir. Feature Subset Selection. Definition: Eliminating redundant and irrelevant features from a given a set of features
E N D
Feature Subset Selection with Local Search Gokcen Cilingir
Feature Subset Selection • Definition: • Eliminating redundant and irrelevant features from a given a set of features • Process of choosing a minimum subset of M features from the original set of N features (M ≤ N), so that the feature space is optimally reduced according to a certain criterion (Ex: accuracy of the induced classifier is maximal)
Feature Subset Selection • Motivation: • Improving prediction and generalization performance of the learning model by defying the curse of dimensionality • Increasing data interpretability • Gaining space and time efficiency in learning
Feature Subset Selection Approaches • Filters • Independence assumption among features • According to a scoring function, features are ranked, and the k highest ranked features are selected. • Independent from the classifier; pre-processing step • Ex: Chi-squared test
Feature Subset Selection Approaches • Wrappers • Search the space of feature subsets using the prediction performance of a given learning machine as the scoring function • Search strategies: Greedy approach, simulated annealing, genetic algorithms Image source: George H. John Ron Kohavi, Wrappers for feature subset selection, Artificial Intelligence, 97:273-374, 1997
Feature Subset Selection Approaches • Embedded methods • Incorporate feature selection with learning, performing selection in the process of training • Implicit feature selection, not a preprocessing step • Ex: Decision trees, boosting
Focus: Local search for feature subset selection following wrapper approach • State definition • A feature subset can be represented as an n-dimensional vector with 0/1 values for absent/present features • Transition model or neighborhood definition • Addition or deletion of a feature (or a number of features) from a subset define its neighbors • Objective function • Goal test/stopping criteria
Data set • A data set called “Arcene”, published in the Feature Selection Challenge carried in NIPS 2003 Workshop on Feature Extraction • Original data set: mass-spectrometry analysis results coming from two classes: patients with cancer (ovarian or prostate cancer), and healthy patients • 100 training and 100 validation instances with 10,000 features.
Details on GA use • Random initialization • Overlapping populations • Parallel populations with allowed migration • Uniform crossover and point mutation operations are defined • Terminated when a number of generations passed without finding a fitter individual • Population diversities are monitored and preventive action is taken automatically towards local optima traps by gradually increasing mutation rate up to a limit and by changing crossover selection criteria
Implementation • A C++ library for Genetic Algorithm Components (GAlib) by Matthew Wall • C++, Visual Studio 2008 • Weka’s J48 algorithm and chi-squared test
Future study • More re-runs of GA search • Extensions for the current GA use • Better initialization using prior knowledge or other feature selection techniques • More genetic operator choices • Classifier choice can be done more systematically
Thanks for listening! • Any questions?