400 likes | 630 Views
Feature selection. Using slides by Gideon Dror, Alon Kaufman and Roy. Learning to Classify. Learning of binary classification Given: a set of m examples ( x i ,y i ) i = 1,2…m sampled from some distribution D, where x i R n and y i {-1,+1}
E N D
Feature selection Using slides by Gideon Dror, Alon Kaufman and Roy
Learning to Classify Learning of binary classification • Given: a set of m examples (xi,yi) i = 1,2…m sampled from some distribution D, where xiRn and yi{-1,+1} • Find: a function f f: Rn -> {-1,+1} which classifies ‘well’ examples xj sampled from D. Examples: • microarray data: separate malignant from healthy tissues • text categorization: spam detection • Face detection: discriminating human faces from not faces. Learning algorithms: decision trees, nearest neighbors, bayesian networks, neural networks, Support Vector Machines …
May Improve performance of classification algorithm by removing irrelevant features Defying the curse of dimensionality - improved generalization Classification algorithm may not scale up to the size of the full feature set either in space or time Allows us to better understand the domain Cheaper to collect and store data based on reduced feature set Advantages of dimensionality reduction
Two approaches for dimensionality reduction • Feature construction • Feature selection (This talk)
Methods of Feature construction • Linear methods • Principal component analysis (PCA) • ICA • Fisher linear discriminant • …. • Non-linear methods • Non linear component analysis (NLCA) • Kernel PCA • Local linear embedding (LLE) • ….
Feature selection • Given examples (xi,yi) where xiRn, select a minimal subset of features which maximizes the performance (accuracy,….). • Exhaustive search is computationally prohibitive, except for a small number of dimensions. • There are 2n-1 possible combinations. • Basically it is an optimization problem, where the classification error is the function to be minimized.
Feature selection classifier Feature selection classifier classifier Feature selection methods Filter methods Wrapper methods Embedded methods
Filtering • Order all features according to strength of association with the target yi • Various measures of association may be used: • Pearson correlation R(Xi) = cov(Xi,Y)/XiY • 2 (discrete variables Xi) • Fisher Criterion Scoring F(Xi) = |+Xi- -Xi|/ (+Xi2+-Xi2) • Golub criterion F(Xi) = |+Xi- -Xi|/ |+Xi+-Xi| • Mutual information I(Xi,Y) =p(Xi,Y) log(p(Xi,Y)/p(Xni)p(Y) • … • Choose the first k features and feed them to the classifier
Wrappers Use the classifier as a black box, to search in the space of feature subsets, the subset which maximizes classification accuracy. Search is exponentially hard. A common example of heuristic searchis hill climbing: keep adding features one at a time until no further improvement can be achieved (“forward selection”) Alternatively we can start with the full set of predictors and keep removing features one at a time until no further improvement can be achieved (“backward selection”)
Embedded methods: Recursive Feature Elimination - RFE 0. Set V = n (total number of features) 1. build linear Support Vector Machine classifiers using V features 2. compute weight vector w = iyixi of optimal hyperplane. Omit V/2 features with lowest |wi|. 3. repeat steps 1 and 2 until one feature is left 4. choose the feature subset that gives the best performance (using cross-validation) (Has strong theoretical justification)
Margin Based Feature SelectionTheory and AlgorithmsRan Gilad-Bachrach, Amir Navot and Naftali Tishby • Feature selection based on the quality of margin they induce • Idea: use of large margin principle for feature selection • Supervise classification problem • “study-case” predictor: 1-NN
Margins • Margins measure the classifier confidence • Sample-margin – distance between the instance and the decision boundary (SVM) • Hypothesis-margin – given an instance the distance between the hypothesis and the closet hypothesis that assigns an alternative label. • In the 1-NN case (Crammer et al 2002): • Previous results: the hypothesis margin lower bounds the sample margin • Motivation: choose the features that induce large margins
Margins • Given a weight vector of the features: The evaluation function is defined for any weight vector w over the features:
Margins For 1-NN x q nearmiss(x) nearhit(x) (Crammer et al. 2002, Bachrach et al. 2004) q = ½( ||x-nearmiss(x)|| - ||x-nearhit(x)|| )
wi=wi+(xi-nearmiss(x)i) 2-(xi-nearhit(x)i)2 Iterative Search Based Algorithm(Simba) • For a set S with m samples and N features: • W=(1,1,1…..1) • For t=1:T (number iterations) • Pick a random instance x from S • Calculate nearmiss(x) and nearhit(x) considering w • For i=1:N • w=w+ • (TNm) / (Nm2)
wi=wi+(xi-nearmiss(x)i) 2-(xi-nearhit(x)i)2 Iterative Search Based Algorithm(Simba) • For a set S with m samples and N features: • W=(1,1,1…..1) • For t=1:T (number iterations) • Pick a random instance x from S • Calculate nearmiss(x) and nearhit(x) considering w • For i=1:N • w=w+ • (TNm) / (Nm2)
Application: Face Images • AR face database • 1456 images females and males • 5100 features • Train 1000 faces test: 456
Unsupervised feature selection • Background: Motivation and Methods • Our Solution • SVD-Entropy and the CE criterion • Three Feature Selection Methods • Results R. Varshavsky, A. Gottlieb, M. Linial, D. Horn. ISMB 2006
Background: Motivation • Gene Expression, Sequence Similarities • ‘Curse of dimensionality’, Dimension Reduction, Compression • Thousands – Tens of Thousands Genes in an array • Number of proteins in databases > million • Noise
samples Genes/ features The Data: An Example • Gene Expression Experiments
Background: Methods • Extraction Vs Selection • Most methods are supervised (i.e., have an objective function) • Unsupervised • Variance • Projection on the first PC (e.g., ‘gene-shaving’) • Statistical significant overabundance (Ben-Dor et al., 2001)
Our Solution: SVD-Entropy • The Normalized relative Values (Wall et al., 2003)* • SVD-Entropy (Alter et al., 2000) * S2j are the eigen values of the [nXn] XX’ matrix
SVD-Entropy (Example) A comparison of two eigenvalue distributions; the left has high entropy (0.87) and the right one has low entropy (0.14)
CE – Contribution to the Entropy • The Contribution of the i-th feature to the overall Entropy is determined according to a leave-out-out measurement CEi=E(X[nXm]) – E(X[nX(m-1)])
CEs suggest 3 groups of features • CEi>c high contribution meaningful (?) • CEi=c average contribution neutral • CEi<c low contribution uniformity
Three Feature Selection Methods • Simple Ranking (SR) • Forward selection (FS) • Aggregate the highest CE at a time • Select and remove the highest CE at a time • Backward Elimination (BE)
Fauquet virus problem 61 viruses. 18 features (amino-acid compositions of coat proteins of the viruses). Four known classes.
samples Genes/ features Results - Example (Golub et al. 1999) • Leukemia • 72 patients (samples) • 7129 genes • 4 groups • Two major types ALL & AML • T & B Cells in ALL • With/without treatment in AML
n10 n11 n01 Real Algorithm Clustering Assessment • n11 – number of pairs that are classified together, both in the ‘real’ classification and by the algorithm • n10 – number of pairs that are classified together in the ‘real’ classification, but not by the algorithm • n01 – number of pairs that are classified together by the algorithm, but not in the ‘real’ classification 1 2 3 4