140 likes | 363 Views
Genetic-Algorithm-Based Instance and Feature Selection. Instance Selection and Construction for Data Mining Ch. 6 H. Ishibuchi, T. Nakashima, and M. Nii. Abstract. GA based approach for selecting a small number of instances from a given data set in a pattern classification problem.
E N D
Genetic-Algorithm-Based Instance and Feature Selection Instance Selection and Construction for Data Mining Ch. 6 H. Ishibuchi, T. Nakashima, and M. Nii
Abstract • GA based approach for selecting a small number of instances from a given data set in a pattern classification problem. • To improve the classification ability of our nearest neighbor classifier by searching for an appropriate reference set.
Genetic Algorithm • Coding • Binary string of the length (n+m) • ai: inclusion or exclusion of the i-th feature • sp : the inclusion or exclusion of the p-th instance • Fitness function • Minimize |F|, minimize |P|, and maximize g(S) • |F| : number of selected feature • |P| : number of selected instance • g(S) : classification performance
Genetic Algorithm • Performance measure (first one) : gA(S) • The number of correctly classified instances • Minimize |P| subject to gA(S) = m • Performance measure (second one) : gB(S) • When an instance xq was included in the reference set, xq was not selected as its own nearest neighbor. • fitness
Genetic Algorithm • Initialization • Genetic Operation: Iterate the following procedure Npop/2 times to generate Npop string • Randomly select a pair of strings • Apply a uniform crossover • Apply a mutation operator • Generation Update: Select the Npop best string from 2Npop • Termination test
Biased Mutation • For effectively decreasing the number of selected instances is to bias the mutation probability • In the biased mutation, a much larger probability is assigned to the mutation from sp = 1 to sp = 0.
Data sets • 2 artificial + 4 real • Normal distribution with small overlap • Normal distribution with large overlap • Iris data • Appendicitis Data • Cancer Data • Wine Data
Parameter Specifications • Pop Size : 50 • Crossover Prob. : 1.0 • Mutation Prob. • Pm = 0.01 for feature selection • Pm(1 0) = 0.1 for instance selection • Pm(0 1) = 0.01 for instance selection • Stopping condition : 500 gen. • Weight values : Wg = 5; WF = 1; WP = 1 • Performance measure : gA(S) or gB(S) • 30 trials for each data
Performance on Test Data • Leaving-one-out procedure (iris & appendicitis) • 10-fold cross-validation (cancer & wine)