Feature Selection Focused within Error Clusters

Feature Selection Focused within Error Clusters Sui-Yu Wang and Henry Baird Presented by Sui-Yu Wang

Feature Selection • Given a set of n features, find a subset of k < n features that still performs well • Best k features chosen separately are usually not the best k when chosen together (Elashoff et. al, 1967) • To select the optimal subset, one has to exhaustively search through all k-elements subsets (Cover and Campenhout, 1977) • Given limited number of training samples and features, finding the minimum subset of features without misclassifying any training sample is NP complete (Van Horn and Martinez, 1994)

Feature Selection • Methods can be divided into three categories: wrappers, filters, and embedded methods. (Guyon and Elisseeff, 2003) • Filters: rank features according to various metrics • Wrappers: evaluate subset of features according to given classifier • Embedded methods: similar to wrapper, but uses non-exhaustive search methods

A Motivating Example 4 Task: Classify each pixel into handwriting or blank: We have to search in a diameter of 25 pixels to get any useful features: D ≈ 450+ pixel values So possible features can be extremely numerous: any combination of 450 pixel values

Popular Method: PCA Principal Components Analysis • PCA finds a small number of linear combinations of original features • PCA finds the dimension that represents the data best in a least square sense, but does not guarantee good separation of data (Pearson, 1901) • Most algorithms employee PCA first then operate respective feature selection algorithm on the reduced set • Could throw away potentially interesting information

Our Research Strategy • We want to find methods for guiding the search for a few strongly discriminating features. • We adopt a greedy heuristic: constructing one feature at a time. • We focus our search on cases where the current features fail.

Formalities • We assume a two class problem • The original sample space is , D is huge • We are given d << D hand-crafted features, all samples are projected into this feature space by feature extractor . We may lose information during the process • If there is any discriminating information in the sample space but not in the feature space , it is must be in the null space

Finding the Null Space • If is linear, the null space can be computed by linear algebra methods • Given , a singular value decomposition, or SVD, can be used to find the set of vectors spanning the null space of : • can be factorized as where and are orthogonal matrices • And

Finding the Next Feature • Samples that fall at the same point in are not discriminated by the current feature set • Samples that lie in tight clusters in are only weakly discriminated by the current feature set • A tight cluster of errors of both classes indicates cases where the current feature set fails completely • Therefore, we use these tight clusters to guide the forward search for new features • Once we have projected samples from the tight error cluster into the null space, we find a hyperplane that best separates the data, and calculate a given sample x’s distance to this hyperplane, , as the new feature

Operate on Points in the Null Space • There are many ways to projects points in the sample space into the null space of , • The orthogonal projection onto a particular subspace is unique • Let where is an orthonormal basis for the subspace . Then

Outline of the Algorithm Repeat Draw enough samples to train a classifier Draw enough samples to build a test set Find clusters of errors in Repeat Choose a tight cluster with both types of errors Draw enough samples to populate this cluster (if necessary) Project the cluster into the null space Find a separating hyperplane in the null space with normal vector that best separates the samples in this cluster Construct a new feature and examine its performance Until the feature lowers the error rate sufficiently Until the error rate is satisfactory to the user

Experiments • Experiments were conducted on a document image content extraction problem • Each image pixel is treated as a sample • The task is to classify each sample into handwriting or machine print • Possible features are extracted from a 2525 pixel square, D=625

Experimental Results PH MP BL HW

Experiments • We divide the data into three sets: training set, discovery set, and test set. • The training set consists of 4,469,740 MP samples and 943,178 HW samples • The feature discovery set consists of 4,980,418 MP and 1,496,949 HW samples • The test set consists of 816,673 MP samples and 649,113 HW samples

Experimental Results

Which Cluster is Best? • Experiments suggest that tight balanced clusters are best

Future Work • Apply the method to other problems • Continue the experiment to see how low the error can drop • Analyze cluster statistics to establish rules for selecting better cluster candidate • Try other hyperplane-finding methods • Establish theoretical framework as to when this approach is guaranteed to work and when it fails

Feature Selection Focused within Error Clusters

Feature Selection Focused within Error Clusters

Presentation Transcript

Transmutation Feature Within MCNPX

Feature selection

Feature Selection

Feature selection

Feature Selection

Feature selection methods

Feature Selection and Extraction

Feature Selection

Feature Selection

FEATURE SELECTION = GENE SELECTION

Feature selection

Feature Selection

Example: Feature selection

Feature Selection

Feature Selection, Feature Extraction

Feature Selection

Feature selection

Feature Selection

Feature Selection

Feature selection

Example: Feature selection

Feature Selection Methods