AMCS/CS 340 : Data Mining

Feature Selection AMCS/CS 340 : Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

Outline • Introduction • Unsupervised Feature Selection • Clustering • Matrix Factorization • Supervised Feature Selection • Individual Feature Ranking (Single Variable Classifier) • Feature subset selection • Filters • Wrappers • Summary 2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Input dimension is too large; the curse of dimensionality problem may happen; Poor model may be built with additional unrelated inputs or not enough relevant inputs; Complex models which contain too many inputs are more difficult to understand Problems due to poor variable selection 3 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Applications OCR (optical character recognition) HWR (handwriting recognition) 4 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Benefits of feature selection 5 Facilitating data visualization Data understanding Reducing the measurement and storage requirements Reducing training and utilization times Defying the curse of dimensionality to improve prediction performance Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Thousands to millions of low level features: select/extract the most relevant one to build better, faster, and easierto understand learning machines. Feature Selection/Extraction m d d<<m X • Using label Y  • supervised • Without label Y  • unsupervised Y N {Fj} {fi} 6 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

choose a best subset of size d from the m features {fi} can be a subset of {Fj}, i=1,…,d, and j=1,…,m extractd new features by linear or non-linear combination of all the m features Linear/Non-linear feature extraction: {fi} = f({Fj}) New features may not have physical interpretation/meaning Feature Selection vs Extraction Selection: m Extraction: d X Y N {fi} {Fj} 7 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Feature Selection by Clustering • Group features into clusters • Replace (many) similar variables in one cluster by a (single) cluster centroid • E.g., K-means, Hierarchical clustering 9 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Example of student project Abdullah Khamis, AMCS/CS340 2010 Fall, “Statistical Learning Based System for Text Classification” 10 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Other unsupervised FS methods • Matrix Factorization • PCA (Principal Component Analysis) • use PCs with largest eigenvalues as “features” • SVD (Singular Value Decomposition) • use singular vectors with largest singular values as “features” • NMF (Non-negative Matrix Factorization) • Nonlinear Dimensionality Reduction • Isomap • LLE (Locally Linear Embedding) 11 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Build better, faster, and easierto understand learning machines Discover the most relevant features w.r.t. target label, e.g., find genes that discriminate between healthy and disease patients Feature Ranking m X d N Rank of useful features. - Eliminate useless features (distracters). - Rank useful features. - Eliminate redundant features. 13 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Example of detecting attacks in real HTTP logs A common request. A JS XSS attack. Remote file inclusion attack DoSattack. • Represent each HTTP request by a vector • in 95 dimensions, corresponding to the 95 types of ASCII code (between 33 and 127) • of character distribution computed as the frequency of each ASCII code in the path source of a HTTP request. For example, • Classification of HTTP vectors in 95-dim v.s. in reduced dimension space? Which dim to choose? • Which one is better? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Individual Feature Ranking (1)by AUC 1. Rank the features by AUC  1, most related  0.5, most unrelated 1 ROC curve True Positive Rate AUC -1 0 1 False Positive Rate xi 15 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

2. Rank the features by Mutual information I(i)  The higher I(i), the attributexi is more related to class y Mutual information between each variable and the target: P(Y = y): frequency count of class y P(X = xi): frequency count of attribute value xi P(X = xi,Y= y): frequency count of attribute value xi given class y Individual Feature Ranking (2)by Mutual Information 16 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

3.Rank features by Pearson correlation coefficient detect linear dependencies between variable and target rank features by R(i) or R2(i) (linear regression)  1 related;  0 unrelated Individual Feature Ranking (3) with continuous target 17 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Individual Feature Ranking (4) by T-test m- m+ • Null hypothesis H0: m+ = m- (xi and Y are independent) • Relevance index  test statistic • T statistic: If H0 is true, • 4. Rank by Pvalue  false positive rate • The lower Pvalue, xi is more related to class y -1 xi s- s+ 18 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Individual Feature Ranking (5) by Fisher Score m- m+ • • Fisher discrimination • Two-class case: • F = between class variance / pooled within class variance • 5. Rank by F value • The higher F, xi is more related to class y -1 xi s- s+ 19 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Rank features in HTTP logs http://www.lri.fr/~xlzhang/KAUST/CS340_slides/FS_rank_demo.zip 20

Issues of individual features ranking 21 • Relevance vsusefulness: • Relevance does not imply usefulness. • Usefulness does not imply relevance • Leads to the selection of a redundant subset kbest features != best kfeatures • A variable that is useless by itself can be useful with others Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Useless features become useful 22 Separation is gained by using two variables instead of one or by adding variables Ranking variables individually and independently of each other is at loss to determine which combination of variables would give best performance.

Multivariate Feature Selection is complex Kohavi-John, 1997 M features, 2M possible feature subsets! 24 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Objectives of feature selection 25 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

How to search the space of all possible variable subsets? Do we use the prediction performance to guide the search? NO  Filter Yes  Wrapper how to assess the prediction performance of a learning machine to guide the search and halt it which predictor to use popular predictors include decision trees, Naive Bayes, Least-square linear predictors, and SVM Questions before subset feature selection 26 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

The feature subset is chosen by an evaluation criterion, which measures the relation of each subset of input variables, e.g., correlation based feature selector (CFS) subsets that contain features that are highly correlated with the class and uncorrelated with each other Feature subset All features Predictor Filter Filter: Feature subset selection mean feature-class correlation how predictive of the class a set of features are average feature-feature intercorrelation how much redundancy there is among the feature subset 27 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Search in all possible feature subsets? k=1,…,M? exhaustive enumeration forward selection, backward elimination, best first, forward/backward with a stopping criterion Filter method isa pre-processing step, which is independent of the learning algorithm. Feature subset All features Predictor Filter Filter: Feature subset selection (2) 28 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Start Forward Selection Sequential forward selection (SFS), features are sequentially added to an empty candidate set until the addition of further features does not decrease the criterion n n-1 n-2 1 … Also referred to as SFS: Sequential Forward Selection 29 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Backward Elimination 1 n-2 n-1 n … Sequential backward selection (SBS), in which features are sequentially removed from a full candidate set until the removal of further features increase the criterion. Start Also referred to as SBS: Sequential Backward Selection 30 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Multiple Feature subsets All features Predictor Wrapper Wrapper: Feature selection methods • Learning model is used as a part of evaluation function and also to induce the final learning model • Subsets of features are scored according to their predictive power • Optimizing the parameters of the model by measuring some cost functions. • Danger of over-fitting with intensive search! 31 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Eliminate useless feature(s) Eliminate useless feature(s) Eliminate useless feature(s) Eliminate useless feature(s) Eliminate useless feature(s) Performance degradation? Train SVM Train SVM Train SVM Train SVM Train SVM RFE SVM Recursive Feature Elimination (RFE) SVM.Guyon-Weston, 2000. US patent 7,117,188 All features Yes, stop! 1: repeat 2: Find w and b by training a linear SVM. 3: Remove the feature with the smallest value |wi| 4: until a desired number of features remain. No, continue… 32 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Selecting feature subsets in HTTP logs 34 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Main goal: rank subsets of useful features Search strategies: explore the space of all possible feature combinations Two criteria: predictive power (maximize) and subset size (minimize). Predictive power assessment: – Filter methods: criteria not involving any learning machine, e.g., a relevance index based on correlation coefficients or test statistics – Wrapper methods: the performance of a learning machine trained using a given feature subset Wrapper is potentially very time consuming since they typically need to evaluate a cross-validation scheme at every iteration. Filter method is much faster but it do not incorporate learning. Comparsion of Filter and Wrapper: 35 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Tree classifiers, like CART (Breiman, 1984)orC4.5 (Quinlan, 1993) All the data At each step, choose the feature that “reduces entropy” most. Work towards “node purity”. Feature subset selection by Random Forest f2 f1 Choose f1 Choose f2 Forward Selection w. Trees 36 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Feature selection focuses on uncovering subsets of variables X1, X2, …predictive of the target Y. Univariate feature selection How to rank the features? Multivariate (subset) feature selection Filter, Wrapper, Embedded How to search the subset of features? How to evaluate the subsets of features? Feature extraction How to construct new features in linear/non-linear ways? Conclusion 38 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

No method is universally better: wide variety of types of variables, data distributions, learning machines, and objectives. Match the method complexity to the ratio M/N: univariate feature selection may work better than multivariate feature selection; non-linear classifiers are not always better. Feature selection is not always necessary to achieve good performance. In practice NIPS 2003 and WCCI 2006 challenges : http://clopinet.com/challenges Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Matlab: sequentialfs (Sequential feature selection, shown in demo) Forward ---- good Backward --- be careful on definition of criteria Feature Selection Toolbox 3 – freely available and open-source software in C++. Weka Feature selection toolbox 40 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

An introduction to variable and feature selection, Isabelle Guyon, André Elisseeff, JMLR 2003 Feature Extraction, Foundations and Applications, Isabelle Guyon et al, Eds. Springer, 2006. http://clopinet.com/fextract-book PabitraMitra, C. A. Murthy, and Sankar K. Pal. (2002). "Unsupervised Feature Selection Using Feature Similarity." In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3) Prof. Marc Van Hulle, KatholiekeUniversiteitLeuven, http://134.58.34.50/~marc/DM_course/slides_selection.pdf Reference 41 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

AMCS/CS 340 : Data Mining

AMCS/CS 340 : Data Mining

Presentation Transcript

CSE 634 Data Mining Concepts and Techniques Association Rule Mining

Data Mining: Preprocessing Techniques

Chapter 3: Data Mining and Data Visualization

Mining data with PolyAnalyst

DATA MINING LECTURE 4

Web Mining

CSE 538 Web Search and Mining Web Crawling

Data Mining using Fractals and Power laws

CS490D: Introduction to Data Mining Prof. Walid Aref

Data Mining using Fractals and Power laws

MMDSS 2007 Data stream management and mining

Mining text and data on chemicals

15-826: Multimedia Databases and Data Mining

Monte F. Hancock, Jr. Chief Scientist Celestech, Inc.

Data Mining with Big Data

Spatial Data Mining

Data Mining: Concepts and Techniques