350 likes | 364 Views
A New Paradigm for Feature Selection With some surprising results. Amnon Shashua School of Computer Science & Eng. The Hebrew University. Joint work with Lior Wolf. Wolf & Shashua, ICCV’03. Problem Definition. Given a sample of feature measurements. Find a subset of features.
E N D
A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University Joint work with Lior Wolf Wolf & Shashua, ICCV’03 Hebrew University
Problem Definition Given a sample of feature measurements Find a subset of features feature vector which are most “relevant” with respect to an inference (learning) task. Comments: • Data points can represent images (pixels), feature attributes, wavelet coefs,.. • The task is to select a subset of coordinates of the data points such that the • accuracy, confidence, and training sample size of a learning algorithm would be • optimal - ideally. • Need to define a “relevance” score. • Need to overcome the exponential nature of subset selection. • If a “soft” selection approach is used, need to make sure the solution is “sparse”. data point Hebrew University
similarity( , ) = e -| - |1 Examples: • Text Classification: typically features representing word frequency counts – only a small fraction is expected to be relevant. Typical examples include automatic sorting of URLs into web directory and detection of Spam Email. • Visual Recognition: tissue samples Goal: recognizing the relevant genes which separate between normal and tumor cells, between different sub classes of cancer, and so on. • Genomics: Gene expressions Hebrew University
Why Select Features? • Most learning algorithms do not scale well with the growth of irrelevant features. • ex1: number of training examples for some supervised learning methods grow exponentially. • ex2: for classifiers which can optimize their “capacity” (e.g., large margin hyper-planes) the effective VC dimension d grows fast with irrelevant coordinates - faster than the capacity increase. • Computational efficiency considerations when number of coordinates • is very high. • Structure of data gets obscured with large amounts of irrelevant coordinates. • Run-time of the (already trained) inference engine on new test examples. Hebrew University
Existing Approaches • Filter methods: pre-process of the data independent of the inference engine. examples: use of mutual information measure, correlation coefficients, cluster.. • Embedded, Wrapper: select features useful to build a good predictor/inference. • example: run SVM training on every candidate subset of features. • Computationally expensive approach in general. Hebrew University
Feature Subset Relevance - Key Idea Working Assumption: the relevant subset of rows induce columns that are coherently clustered. Note: we are assuming an unsupervised setting (data points are not labeled). The framework can easily apply to supervised settings as well. Hebrew University
How to measure cluster coherency? We wish to avoid explicitly clustering • for each subset of rows. We wish a measure which is amenable to • continuous functional analysis. • key idea: use spectral information from the affinity matrix • How to represent ? subset of features Hebrew University
Definition of Relevancy The Standard Spectrum General Idea: Select a subset of rows from the sample matrix M such that the resulting affinity matrix will have high values associated with the first k eigenvalues. subset of features resulting affinity matrix consists of the first k eigenvectors of Hebrew University
(unsupervised) Optimization Problem Let subject to Optimization is too difficult to be considered in practice (integer and continuous variables programming). Hebrew University
(unsupervised) Optimization Problem Soft Version I Let subject to The non-linear function penalizes for uniform The result is a non-linear programming problem - could be quite difficult to solve. Hebrew University
(unsupervised) Optimization Problem Let for some unknown real scalars Motivation: from spectral clustering it is known that the eigenvectors tend to be discontinuous and that may lead to an effortless sparsity property. subject to Note: the optimization definition ignores the requirements: • The weight vector should be sparse. Hebrew University
The Algorithm If were known, then is known and Q is simply the first k eigenvectors of If Q were known, then the problem becomes: subject to where is the largest eigenvector of G Hebrew University
The Algorithm Power-embedded 1. Let be defined 2. Let be the largest eigenvector of 3. Let 4. Let orthogonal iteration 5. “QR” factorization step 6. Increment Convergence proof: take k=1 for example. Steps 4,5 become: Need to show: follows from convexity Hebrew University For all symmetric matrices A and unit vectors q
Positivity and Sparsity of Hand-Waving Argument minimized if rank(A)=k add redundant terms = sum of rank-1 matrices If we would like rank(A) to be small, we shouldn’t add too many rank-1 terms, Therefore should be sparse. Note: this argument does not say anything with regard to why should be positive. Hebrew University
Positivity and Sparsity of The key for the emergence of a sparse and positive has to do with the way The entries of are defined: Consider k=1 for example, then each entry is of the form: only if Clearly, or which cannot happen Expected values of the entries of G are biased towards a positive number Hebrew University
Positivity and Sparsity of 1. What is the minimal value of when vary over the n-dimensional unit hyper sphere? 2. Given a uniform sampling of over the n-dim unit hyper sphere, what is the mean and variance of 3. Given that what is the probability that the first eigenvector of is strictly non-negative (same sign)? Hebrew University
Proposition 3: with an infinitesimal Let be the largest eigenvector. Then, where Is the cumulative distribution function of empirical Hebrew University
Proposition 4: (with O. Zeitouni and M. Ben-Or) when for any value of Hebrew University
Sparsity Gap definition Let be the fraction of relevant features and where Let Let be the largest eigenvector of G, where holds the first np entries and holds the remaining nq entries. The sparsity gap corresponding to G is the ratio where is the mean of and is the mean of Hebrew University
Sparsity Gap Proposition 4: Let Let be the largest eigenvector of The sparsity gap corresponding to G is: Example: Hebrew University
The feature selection for SVM benchmark • Two synthetic data sets were proposed in a paper by Weston,Mukherjee, Chapelle, Pontil, Poggio and Vapnik from NIPS 2001. • The data sets were designed to have few features which are separable by SVM, combined with many non relevant features. • There were 6 relevant features and 196 irrelevant ones. The linear dataset • The linear data set is almost separable linearly once the correct features are recovered. • The data sets were designed for the labeled case. • At probability 0.7 the data is almost separable by the first 3 relevant features and un-separable by the rest 3 relevant features. At probability 0.3 the second group of relevant features is the separable one. Remaining 196 features were • drawn from N(0,20). Hebrew University
Results – non-linear data set The unsupervised algorithm started to be effective only from 80 data points and up and is not shown here
There are two species of frogs in this figure: Hebrew University
Hebrew University Green frog (Rana clamitans) American toad
-| - |1 similarity( , ) = e Automatic separation • We use small patches as basic features: • In order to compare patches we use the L1 norm on the color histograms: Hebrew University
The matrix A: many features over ~40 images The similarity between an image A and a patch Bis the maximum over all similarities between the patches p in the image A and of the patch B similarity( , ) = max similarity( , ) Hebrew University
Selected features Green frog (Rana clamitans) American toad Using these features the clustering was correct on 80% of the samples – compared to 55% correct clustering using conventional spectral clustering Hebrew University
Another example elephant sea-elephant Using these features the clustering was correct on 90% of the samples – compared to 65% correct clustering using conventional spectral clustering Hebrew University
Genomics The microarray technology provides many measurements of gene expressions for different sample tissues. tissue samples Goal: recognizing the relevantgenes that separate betweencells with different biological characteristics (normal vs. tumor, different subclasses of tumor cells) • Classification of Tissue Samples (type of • Cancer, Normal vs. Tumor) • Find Novel Subclasses (unsupervised) • Find Genes responsible for classification • (new insights for drug design). Gene expressions Few samples (~50) and large dimension (~10,000) Hebrew University
The synthetic dataset of Ben-Dor, Friedman, and Yakhini • The model consists of 6 parameters: • A relevant feature is sampled or where the means of the classes mA, mBare sampled uniformly from [-1.5d,1.5d] • An irrelevant feature is sampled N(0,s) Hebrew University
The synthetic dataset of Ben-Dor, Friedman, and Yakhini • Results of simulations done by varying one parameter out of Hebrew University • MSA – max surprise algorithm of Ben-Dor, Friedman, and Yakhini.
Shashua & Wolf, ECCV’04 Follow Up Work Feature selection with “side” information: Given the “main” and “side” data. Find weights such that has coherent k clusters has low cluster coherence (single cluster) and Hebrew University
Shashua & Wolf, ECCV’04 Follow Up Work “Kernalizing” : high dimensional mapping Rather than having inner-products we have outer-products. Hebrew University
END Hebrew University