170 likes | 186 Views
Explore the concept of feature selection, its importance, and how it impacts classification accuracy. Learn about the optimal number of features and its dependence on training set size through analysis of a binary classification task. Discover insights and implications for decision-making using empirical evidence.
E N D
What is The Optimal Number of Features?A learning theoretic Perspective Amir Navot Joint work with Ran Gilad-Bachrach, Yiftah Navot and Naftali Tishby
What is Feature Selection? • Feature selection: select a “good” small subset of features (out of the given set) • “Good” subset: enables to build good classifiers • Feature selection is a special form of dimensionality reduction
Reasons to do Feature Selection • Reduces computational complexity • Saves the cost of measuring extra features • The selected features can provide insights about the nature of the problem • Improves accuracy • Improves accuracy
The Questions • Under which conditions can feature selection improve classification accuracy? • What is the optimal number of features? • How does this number depend on the training set size? We discuss these questions by analyzing one simple setting
Two Gaussians - Problem Setting • Binary classification task • The coordinates are the features • Optimal classifier: • given features subset F : • If is known: using all the features is optimal
Problem Setting – Cont. • Assume that is estimated from a sample of size m, • Given an estimator and a features subset F , we consider the classifier • We want to find a subset of features that minimizes the average generalization error: • We assume ) we only need to find the optimal number of features • We consider the optimal estimator:
Illustration -
Result • The number of features that minimizes the average error for a training set of size m is: • Observations: • If , there is a non trivial optimal n • When then is optimal choice • Decision on adding depends on other features
Solving for Specific -Cont. x-axis: number of features
Proof • For a given estimator for , the generalization error of the classifier is: ( is the CDF function of a standard Gaussian )
Proof – Cont. • We want to find the number of features nthat minimizes the average error: • Lemma: is a good approximation for when m is large enough.
Proof – Cont. • Therefore, minimizing is equivalent to maximizing • For the optimal estimator , and Therefore,
“Empirical Proof” of the Lemma x-axis: number of features
Linear SVM Error(averaged on 200 repeats, c=0.01, using Gavin Cawley’s tool box) x-axis: number of features
Conclusions • Even when all the features carry information and are independent, • Using all the features may be suboptimal • The decision to add a feature depends on the others • The optimal number of features depends critically on the sample size ( the quality of model estimation)
What is The Optimal Number of Features?A learning theoretic Perspective Amir Navot Joint work with Ran Gilad-Bachrach, Yiftah Navot and Naftali Tishby