1 / 35

A New Paradigm for Feature Selection With some surprising results

This paper presents a new approach for feature selection in machine learning, aiming to find the most relevant subset of features for an inference task. The method utilizes spectral information from the affinity matrix and incorporates a soft selection approach to ensure sparsity. The algorithm iteratively optimizes the selection of features, resulting in improved accuracy, confidence, and training sample size for learning algorithms.

walterb
Download Presentation

A New Paradigm for Feature Selection With some surprising results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University Joint work with Lior Wolf Wolf & Shashua, ICCV’03 Hebrew University

  2. Problem Definition Given a sample of feature measurements Find a subset of features feature vector which are most “relevant” with respect to an inference (learning) task. Comments: • Data points can represent images (pixels), feature attributes, wavelet coefs,.. • The task is to select a subset of coordinates of the data points such that the • accuracy, confidence, and training sample size of a learning algorithm would be • optimal - ideally. • Need to define a “relevance” score. • Need to overcome the exponential nature of subset selection. • If a “soft” selection approach is used, need to make sure the solution is “sparse”. data point Hebrew University

  3. similarity( , ) = e -| - |1 Examples: • Text Classification: typically features representing word frequency counts – only a small fraction is expected to be relevant. Typical examples include automatic sorting of URLs into web directory and detection of Spam Email. • Visual Recognition: tissue samples Goal: recognizing the relevant genes which separate between normal and tumor cells, between different sub classes of cancer, and so on. • Genomics: Gene expressions Hebrew University

  4. Why Select Features? • Most learning algorithms do not scale well with the growth of irrelevant features. • ex1: number of training examples for some supervised learning methods grow exponentially. • ex2: for classifiers which can optimize their “capacity” (e.g., large margin hyper-planes) the effective VC dimension d grows fast with irrelevant coordinates - faster than the capacity increase. • Computational efficiency considerations when number of coordinates • is very high. • Structure of data gets obscured with large amounts of irrelevant coordinates. • Run-time of the (already trained) inference engine on new test examples. Hebrew University

  5. Existing Approaches • Filter methods: pre-process of the data independent of the inference engine. examples: use of mutual information measure, correlation coefficients, cluster.. • Embedded, Wrapper: select features useful to build a good predictor/inference. • example: run SVM training on every candidate subset of features. • Computationally expensive approach in general. Hebrew University

  6. Feature Subset Relevance - Key Idea Working Assumption: the relevant subset of rows induce columns that are coherently clustered. Note: we are assuming an unsupervised setting (data points are not labeled). The framework can easily apply to supervised settings as well. Hebrew University

  7. How to measure cluster coherency? We wish to avoid explicitly clustering • for each subset of rows. We wish a measure which is amenable to • continuous functional analysis. • key idea: use spectral information from the affinity matrix • How to represent ? subset of features Hebrew University

  8. Definition of Relevancy The Standard Spectrum General Idea: Select a subset of rows from the sample matrix M such that the resulting affinity matrix will have high values associated with the first k eigenvalues. subset of features resulting affinity matrix consists of the first k eigenvectors of Hebrew University

  9. (unsupervised) Optimization Problem Let subject to Optimization is too difficult to be considered in practice (integer and continuous variables programming). Hebrew University

  10. (unsupervised) Optimization Problem Soft Version I Let subject to The non-linear function penalizes for uniform The result is a non-linear programming problem - could be quite difficult to solve. Hebrew University

  11. (unsupervised) Optimization Problem Let for some unknown real scalars Motivation: from spectral clustering it is known that the eigenvectors tend to be discontinuous and that may lead to an effortless sparsity property. subject to Note: the optimization definition ignores the requirements: • The weight vector should be sparse. Hebrew University

  12. The Algorithm If were known, then is known and Q is simply the first k eigenvectors of If Q were known, then the problem becomes: subject to where is the largest eigenvector of G Hebrew University

  13. The Algorithm Power-embedded 1. Let be defined 2. Let be the largest eigenvector of 3. Let 4. Let orthogonal iteration 5. “QR” factorization step 6. Increment Convergence proof: take k=1 for example. Steps 4,5 become: Need to show: follows from convexity Hebrew University For all symmetric matrices A and unit vectors q

  14. Positivity and Sparsity of Hand-Waving Argument minimized if rank(A)=k add redundant terms = sum of rank-1 matrices If we would like rank(A) to be small, we shouldn’t add too many rank-1 terms, Therefore should be sparse. Note: this argument does not say anything with regard to why should be positive. Hebrew University

  15. Positivity and Sparsity of The key for the emergence of a sparse and positive has to do with the way The entries of are defined: Consider k=1 for example, then each entry is of the form: only if Clearly, or which cannot happen Expected values of the entries of G are biased towards a positive number Hebrew University

  16. Positivity and Sparsity of 1. What is the minimal value of when vary over the n-dimensional unit hyper sphere? 2. Given a uniform sampling of over the n-dim unit hyper sphere, what is the mean and variance of 3. Given that what is the probability that the first eigenvector of is strictly non-negative (same sign)? Hebrew University

  17. Proposition 3: with an infinitesimal Let be the largest eigenvector. Then, where Is the cumulative distribution function of empirical Hebrew University

  18. Proposition 4: (with O. Zeitouni and M. Ben-Or) when for any value of Hebrew University

  19. Sparsity Gap definition Let be the fraction of relevant features and where Let Let be the largest eigenvector of G, where holds the first np entries and holds the remaining nq entries. The sparsity gap corresponding to G is the ratio where is the mean of and is the mean of Hebrew University

  20. Sparsity Gap Proposition 4: Let Let be the largest eigenvector of The sparsity gap corresponding to G is: Example: Hebrew University

  21. The feature selection for SVM benchmark • Two synthetic data sets were proposed in a paper by Weston,Mukherjee, Chapelle, Pontil, Poggio and Vapnik from NIPS 2001. • The data sets were designed to have few features which are separable by SVM, combined with many non relevant features. • There were 6 relevant features and 196 irrelevant ones. The linear dataset • The linear data set is almost separable linearly once the correct features are recovered. • The data sets were designed for the labeled case. • At probability 0.7 the data is almost separable by the first 3 relevant features and un-separable by the rest 3 relevant features. At probability 0.3 the second group of relevant features is the separable one. Remaining 196 features were • drawn from N(0,20). Hebrew University

  22. Results – linear data set

  23. Results – non-linear data set The unsupervised algorithm started to be effective only from 80 data points and up and is not shown here

  24. There are two species of frogs in this figure: Hebrew University

  25. Hebrew University Green frog (Rana clamitans) American toad

  26. -| - |1 similarity( , ) = e Automatic separation • We use small patches as basic features: • In order to compare patches we use the L1 norm on the color histograms: Hebrew University

  27. The matrix A: many features over ~40 images The similarity between an image A and a patch Bis the maximum over all similarities between the patches p in the image A and of the patch B similarity( , ) = max similarity( , ) Hebrew University

  28. Selected features Green frog (Rana clamitans) American toad Using these features the clustering was correct on 80% of the samples – compared to 55% correct clustering using conventional spectral clustering Hebrew University

  29. Another example elephant sea-elephant Using these features the clustering was correct on 90% of the samples – compared to 65% correct clustering using conventional spectral clustering Hebrew University

  30. Genomics The microarray technology provides many measurements of gene expressions for different sample tissues. tissue samples Goal: recognizing the relevantgenes that separate betweencells with different biological characteristics (normal vs. tumor, different subclasses of tumor cells) • Classification of Tissue Samples (type of • Cancer, Normal vs. Tumor) • Find Novel Subclasses (unsupervised) • Find Genes responsible for classification • (new insights for drug design). Gene expressions Few samples (~50) and large dimension (~10,000) Hebrew University

  31. The synthetic dataset of Ben-Dor, Friedman, and Yakhini • The model consists of 6 parameters: • A relevant feature is sampled or where the means of the classes mA, mBare sampled uniformly from [-1.5d,1.5d] • An irrelevant feature is sampled N(0,s) Hebrew University

  32. The synthetic dataset of Ben-Dor, Friedman, and Yakhini • Results of simulations done by varying one parameter out of Hebrew University • MSA – max surprise algorithm of Ben-Dor, Friedman, and Yakhini.

  33. Shashua & Wolf, ECCV’04 Follow Up Work Feature selection with “side” information: Given the “main” and “side” data. Find weights such that has coherent k clusters has low cluster coherence (single cluster) and Hebrew University

  34. Shashua & Wolf, ECCV’04 Follow Up Work “Kernalizing” : high dimensional mapping Rather than having inner-products we have outer-products. Hebrew University

  35. END Hebrew University

More Related