310 likes | 582 Views
Neural Network. Feature Selection - Sumit Sarkar Y7027453. Feature Selection for Classification. Given: a set of features F and a target variable T Find: minimum set F that achieves maximum classification performance of T. Why Feature Selection.
E N D
Neural Network Feature Selection - Sumit Sarkar Y7027453
Feature Selection for Classification • Given: a set of features F and a target variable T • Find: minimum set F that achieves maximum classification performance of T
Why Feature Selection • Improve performance of classification algorithm by using useful features • Classification algorithm may not scale up to the size of the full feature set either in space or time • Remove redundant or useless features • Better understand the domain
Why Feature Selection • By removing most irrelevant and redundant features from the data, feature selection helps improve the performance of learning models by: - Alleviating the effect of the problem of dimensionality. - Enhancing generalization capability. - Speeding up learning process. - Improving model interpretability.
Feature Selection • Thousands to millions of low level features: select the most relevant one to build better, faster, and easierto understand learning machines. n’ m n
100 500 1000 Relief: Simba: Face Recognition • Male/female classification • 1450 images (1000 train, 450 test) • Relief is an algorithm that does not filter redundancy in feature set as you can clearly notice feature set selected on both sides of face even though they might be redundant as face is symmetric • Where as Simba algorithm take only feature from one half of the face and do not repeat them on the other side
Feature Selection • Feature selection algorithms typically fall into two categories: - Feature Ranking - Subset Selection
Feature Selection • Feature ranking ranks the features by a metric and eliminates all features that do not achieve an adequate score. • Subset selection searches the set of possible features for the most desirable subset.
Feature Selection • Two kinds of methods: • Wrapper methods • Filter methods
A Wrapper Method • Given a classifier C and a set of feature F, a wrapper method searches in the space of subsets of F, using Cross Validation to compare the performance of the trained classifier C on each tested subset.
Cross-validation • It is a technique for estimating how the results of a statistical analysis will generalize to an independent data set. • It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. • One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset ( referred to as training set), and validating the analysis on the other subset (referred to as testing set). • To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.
m1 m2 m3 Feature subset assessment Split data into 3 sets: training, validation, and test set. N variables/features 1) For each feature subset, train predictor on training data. 2) Select the feature subset, which performs best on validation data. 3) Test on test data M samples
A Filter Method • A filter method does not make use of C, but rather attempts to find predictive subsets of the features by making use of simple statistics computed from the empirical distribution. • Ranks features in terms of the mutual information between the features and the class label
Filters vs. Wrappers • Main goal: rank subsets of useful features.
Feature subset All features Predictor Filter Filters vs. Wrappers • Main goal: rank subsets of useful features.
Feature subset All features Predictor Filter Multiple Feature subsets All features Predictor Wrapper Filters vs. Wrappers • Main goal: rank subsets of useful features.
Feature subset All features Predictor Filter Multiple Feature subsets All features Predictor Wrapper Filters vs. Wrappers • Danger of over-fitting with intensive search! • Main goal: rank subsets of useful features.
Over Fitting • Overfitting occurs when a statistical model describes random error or noise instead of the fundamental relationship. • Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. • A model which has been over-fit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.
Methods • Univariate method: considers one variable (feature) at a time. • Multivariate method: considers subsets of variables (features) together.
Methods • Multivariate is more complicated, it is computationally expensive and its also more statistically difficult to do ! • Then why Multivariate feature selection method ?
Search Strategies • Forward selection here we start with an empty set or feature and progressively add features • Backward elimination, start with full set and progressively eliminate • GSFS (Generalized Sequential Forward Selection) - when (n-k) features are left, try all subsets of g features for trainings. More trainings at each step, but fewer steps than what are in simple sequential process. • PTA (l,r) Mehtod: plus l , take away r – at each step, run SFS l times then SBS r times. • Floating search (SFFS and SBFS): One step of SFS (resp. SBS), then SBS (resp. SFS) as long as we find better subsets than those of the same size obtained so far. Any time, if a better subset of the same size was already found, switch abruptly.
Multivariate FS is complex • N features, 2N possible feature subsets!
Multivariate FS is complex • Multivariate FS implies a search in the space of all possible combinations of features. • For n features, there are 2^n possible subsets of features. • This yields both to a high computational and statistical complexity. • Wrappers use the performance of a learning machine to evaluate each subset. • For Large n training 2^n learning machines is not feasible, so most wrapper a • Filters function analogously to wrappers. • This is highly complex and can be replaced by embedded methods or the nested subset methods.
Eliminate useless feature(s) Performance degradation? Train SVM Embedded methods • SVM: (Support vector machine) All features No, stop! Yes, continue…
In practice… • No method is universally better.
Questions • How do we determine whether or not there is performance degradation? • Are the non-linear classifiers are always better ?