240 likes | 255 Views
This paper discusses the fast correlation-based filter (FCBF) algorithm for feature selection in high-dimensional data. It compares the FCBF algorithm with other approaches and provides experimental results. The advantages and disadvantages of FCBF are also discussed.
E N D
Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution Presented by Jingting Zeng 11/26/2007
Outline • Introduction to Feature Selection • Feature Selection Models • Fast Correlation-Based Filter (FCBF) Algorithm • Experiment • Discussion • Reference
Introduction of Feature Selection • Definition • A process that chooses an optimal subset of features according to an objective function • Objectives • To reduce dimensionality and remove noise • To improve mining performance • Speed of learning • Predictive accuracy • Simplicity and comprehensibility of mined results
An Example for Optimal Subset • Data set (whole set) • Five Boolean features • C = F1∨F2 • F3= ┐F2 ,F5= ┐F4 • Optimal subset: • {F1, F2}or{F1, F3}
Models of Feature Selection • Filter model • Separating feature selection from classifier learning • Relying on general characteristics of data (information, distance, dependence, consistency) • No bias toward any learning algorithm, fast • Wrapper model • Relying on a predetermined classification algorithm • Using predictive accuracy as goodness measure • High accuracy, computationally expensive
Two Aspects for Feature Selection • How to decide whether a feature is relevant to the class or not • How to decide whether such a relevant feature is redundant or not compared to other features
Linear Correlation Coefficient • For a pair of variables (x,y): • However, it may not be able to capture the non-linear correlations
Information Measures • Entropy of variable X • Entropy of X after observing Y • Information Gain • Symmetrical Uncertainty
Fast Correlation-Based Filter (FCBF) Algorithm • How to decide whether a feature is relevant to the class C or not • Find a subset , such that • How to decide whether such a relevant feature is redundant • Use the correlation of features and class as a reference
Definitions • Predominant Correlation • The correlation between a feature and the class C is predominant • Redundant peer (RP) • If there is , is a RP of • Use to denote the set of RP for
i C
Three Heuristics • If , treat as a predominant feature, remove all features in and skip identifying redundant peers for them • If , process all the features in at first. If non of them becomes predominant, follow the first heuristic • The feature with the largest value is always a predominant feature and can be a starting point to remove other features.
i C
FCBF Algorithm Time Complexity: O(N)
FCBF Algorithm (cont.) Time complexity: O(NlogN)
Experiments • FCBF are compared to ReliefF, CorrSF and ConsSF • Summary of the 10 data sets
Pros and Cons • Advantage • Very fast • Select fewer features with higher accuracy • Disadvantage • Cannot detect some features • 4 features generated by 4 Gaussian functions and adding 4 additional redundant features, FCBF selected only 3 features
Discussion • FCBF compares only individual features with each other • Try to use PCA to capture a group of features. Based on the result, then the FCBF is used.
Reference • L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proc 12th Int Conf on Machine Learning (ICML-03), pages 856–863, 2003 • Biesiada J, Duch W (2005), Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter Solution. (CORES'05) Advances in Soft Computing, Springer Verlag, pp. 95-104, 2005. • www.cse.msu.edu/~ptan/SDM07/Yu-Ye-Liu.pdf • www1.cs.columbia.edu/~jebara/6772/proj/Keith.ppt
Thank you! Q and A