Feature Selection

Feature Selection Alexandros Potamianos School of ECE Natl. Tech. Univ. of Athens Fall 2014-2015

"A blasphemous sect suggested .. that all men should juggle letters and symbols until they constructed by an improbable gift of chance these canonical books ... The sect almost disappeared but I have seen old men who, for long periods of time, would hide in the latrines with some metal disks in a forbidden dice cup, feebly trying to mimic the divine disorder." from the Library of Babel, by Jorge Luis Borges

Feature Selection (Th&K ch5) • Preprocessing • Outlier Removal • variance based, |x – mean| > 2*std or 3*std • Data Normalization • z-normalization: (x-mean)/std • Followed optionally by sigmoid compression: 1/[1+exp(-y)] • Missing Data • Imputation (pseudo-EM) • Multiple imputation (Bayesian) • EM and variants

Feature Selection (Th&K ch5) • How to measure a good feature? • Classification error estimates • Divergence • Expected value of ln(p(x|w1)/p(x|w2)) • Kullback-Leibler distance between pdfs • Equal covariance => Mahalanobis distance • Bounds of classification performance: • Chernoff Bound and Bhattacharyya distance • Scatter Matrices and Fisher distriminant • Classification error proper!

Outline • Feature Selection • Variable Ranking • Variable Subset Selection • Feature Constr. and Dim. Reduction • Methods • Filtering (open loop) • Wrapper (closed loop) • Embedding (min. class. error)

Variable Ranking • Look at one feature at a time! • Criteria • Correlation between feature and corresponding class. labels • Pearson correlation coeff. squared • Non-linear generalizations • Information theoretic criteria • Mutual information between features and corr. class. labels (aka saliency)

Some Observations • Features that are iid are not redundant! • Perfect correlation  no new information • High correlation  could add information • A useless feature by itself can improve performance when combined • Multiple useless features by themselves can improve performance when combined

Redundant variables ?

Correlated variables ?

Useless Variable?

Variable Subset Selection • Need to select features jointly! • NP-hard problem • Forward selection • Backward selection • Greedy searches avoid over-fitting • Embedded methods • Finite difference • Quadratic approximation of cost function

Feature Selection:Computational Complexity • Select subset of k from m features • Scalar Feature Selection O(km) • Feature Vector Selection (Wrapper) • Filter (full search) m!/k!(m-k)! • Sequential Backward O(m2+k2) • Sequential Forward O(k(m+k)) • Floating search • a combination of forward and backward search • Features can be added back after being rejected • Alternates between inclusion (forward) and exclusion (backward) steps • For monotonic criteria C(X) leq C(X,xk+1), dynamic programming solutions

Variable Subset Selection (cont.) • Direct objective optimization (e.g., minimum description length) • Goodness of fit (maximize) • Number of variables (minimize) • Combination of wrappers/embedded methods + filters • Markov blankets

Dimensionality Reduction • PCA/SVD, LDA etc. • Clustering (unsupervised, supervised) • Fisher linear discriminant • Information bottleneck

Feature Selection