1 / 24

Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution. Presented by Jingting Zeng 11/26/2007. Outline. Introduction to Feature Selection Feature Selection Models Fast Correlation-Based Filter ( FCBF) Algorithm Experiment Discussion Reference.

meghan
Download Presentation

Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution Presented by Jingting Zeng 11/26/2007

  2. Outline • Introduction to Feature Selection • Feature Selection Models • Fast Correlation-Based Filter (FCBF) Algorithm • Experiment • Discussion • Reference

  3. Introduction of Feature Selection • Definition • A process that chooses an optimal subset of features according to an objective function • Objectives • To reduce dimensionality and remove noise • To improve mining performance • Speed of learning • Predictive accuracy • Simplicity and comprehensibility of mined results

  4. An Example for Optimal Subset • Data set (whole set) • Five Boolean features • C = F1∨F2 • F3= ┐F2 ,F5= ┐F4 • Optimal subset: • {F1, F2}or{F1, F3}

  5. Models of Feature Selection • Filter model • Separating feature selection from classifier learning • Relying on general characteristics of data (information, distance, dependence, consistency) • No bias toward any learning algorithm, fast • Wrapper model • Relying on a predetermined classification algorithm • Using predictive accuracy as goodness measure • High accuracy, computationally expensive

  6. Filter Model

  7. Wrapper Model

  8. Two Aspects for Feature Selection • How to decide whether a feature is relevant to the class or not • How to decide whether such a relevant feature is redundant or not compared to other features

  9. Linear Correlation Coefficient • For a pair of variables (x,y): • However, it may not be able to capture the non-linear correlations

  10. Information Measures • Entropy of variable X • Entropy of X after observing Y • Information Gain • Symmetrical Uncertainty

  11. Fast Correlation-Based Filter (FCBF) Algorithm • How to decide whether a feature is relevant to the class C or not • Find a subset , such that • How to decide whether such a relevant feature is redundant • Use the correlation of features and class as a reference

  12. Definitions • Predominant Correlation • The correlation between a feature and the class C is predominant • Redundant peer (RP) • If there is , is a RP of • Use to denote the set of RP for

  13. i C

  14. Three Heuristics • If , treat as a predominant feature, remove all features in and skip identifying redundant peers for them • If , process all the features in at first. If non of them becomes predominant, follow the first heuristic • The feature with the largest value is always a predominant feature and can be a starting point to remove other features.

  15. i C

  16. FCBF Algorithm Time Complexity: O(N)

  17. FCBF Algorithm (cont.) Time complexity: O(NlogN)

  18. Experiments • FCBF are compared to ReliefF, CorrSF and ConsSF • Summary of the 10 data sets

  19. Results

  20. Results (cont.)

  21. Pros and Cons • Advantage • Very fast • Select fewer features with higher accuracy • Disadvantage • Cannot detect some features • 4 features generated by 4 Gaussian functions and adding 4 additional redundant features, FCBF selected only 3 features

  22. Discussion • FCBF compares only individual features with each other • Try to use PCA to capture a group of features. Based on the result, then the FCBF is used.

  23. Reference • L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proc 12th Int Conf on Machine Learning (ICML-03), pages 856–863, 2003 • Biesiada J, Duch W (2005), Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter Solution. (CORES'05) Advances in Soft Computing, Springer Verlag, pp. 95-104, 2005. • www.cse.msu.edu/~ptan/SDM07/Yu-Ye-Liu.pdf • www1.cs.columbia.edu/~jebara/6772/proj/Keith.ppt

  24. Thank you! Q and A

More Related