Genetic-Algorithm-Based Instance and Feature Selection

Genetic-Algorithm-Based Instance and Feature Selection Instance Selection and Construction for Data Mining Ch. 6 H. Ishibuchi, T. Nakashima, and M. Nii

Abstract • GA based approach for selecting a small number of instances from a given data set in a pattern classification problem. • To improve the classification ability of our nearest neighbor classifier by searching for an appropriate reference set.

Genetic Algorithm • Coding • Binary string of the length (n+m) • ai: inclusion or exclusion of the i-th feature • sp : the inclusion or exclusion of the p-th instance • Fitness function • Minimize |F|, minimize |P|, and maximize g(S) • |F| : number of selected feature • |P| : number of selected instance • g(S) : classification performance

Genetic Algorithm • Performance measure (first one) : gA(S) • The number of correctly classified instances • Minimize |P| subject to gA(S) = m • Performance measure (second one) : gB(S) • When an instance xq was included in the reference set, xq was not selected as its own nearest neighbor. • fitness

Genetic Algorithm • Initialization • Genetic Operation: Iterate the following procedure Npop/2 times to generate Npop string • Randomly select a pair of strings • Apply a uniform crossover • Apply a mutation operator • Generation Update: Select the Npop best string from 2Npop • Termination test

Numerical Example

Biased Mutation • For effectively decreasing the number of selected instances is to bias the mutation probability • In the biased mutation, a much larger probability is assigned to the mutation from sp = 1 to sp = 0.

Data sets • 2 artificial + 4 real • Normal distribution with small overlap • Normal distribution with large overlap • Iris data • Appendicitis Data • Cancer Data • Wine Data

Parameter Specifications • Pop Size : 50 • Crossover Prob. : 1.0 • Mutation Prob. • Pm = 0.01 for feature selection • Pm(1  0) = 0.1 for instance selection • Pm(0  1) = 0.01 for instance selection • Stopping condition : 500 gen. • Weight values : Wg = 5; WF = 1; WP = 1 • Performance measure : gA(S) or gB(S) • 30 trials for each data

Performance on Training Data

Performance on Test Data • Leaving-one-out procedure (iris & appendicitis) • 10-fold cross-validation (cancer & wine)

Effect of Feature Selection

Effect on NN

Some Variants

Genetic-Algorithm-Based Instance and Feature Selection

Genetic-Algorithm-Based Instance and Feature Selection

Presentation Transcript

Feature selection

Distributed Genetic Algorithm for feature selection in Gaia RVS spectra

Feature Selection

Feature selection

Feature Grouping-Based Fuzzy-Rough Feature Selection

Feature Selection

Efficient huge-scale feature selection with speciated genetic algorithm

Feature Selection

Feature Selection

Feature selection

Feature Selection

Graph-based Iterative Hybrid Feature Selection

Feature Selection

Feature Selection

A genetic algorithm-based method for feature subset selection

Feature selection

Feature Selection

Instance Selection

Feature Selection

Feature selection

Genetic-Algorithm-Based Instance and Feature Selection

Sequential Genetic Search for Ensemble Feature Selection