Faten Hussein

The University of British Columbia Department of Electrical & Computer Engineering Feature Selection and Weighting using Genetic Algorithm for Off-line Character Recognition Systems Presented by Faten Hussein

Outline • Introduction & Problem Definition • Motivation & Objectives • System Overview • Results • Conclusions

Introduction Off-line Character Recognition System Text document • Address readers • Bank Cheques readers • Reading data entered in forms (tax forms) • Detecting forged signatures Scanning Pre-Processing Feature Extraction Classification Classified text Post-Processing

Introduction For typical handwritten recognition task: • Many variants of character (symbol) shape, size. • Different writers have different writing styles. • Same person could have different writing style. • Thus, unlimited number of variations for a single character exists.

Introduction Variations in handwritten digits extracted from zip codes L=0, E=3 To overcome this diversity, a large number of features must be added L=1, E=1 L=2, E=0 An example of features that we used are: moment invariants, number of loops, number of end points, centroid, area, circularity and so on.

Problem Dilemma Add more features • Increase problem size • Increase run time/memory for classification • To accommodate variations in symbols • Add-hoc process, depends on experience and trail and error Character Recognition System • Might add redundant/irrelevant features which decrease the accuracy • Hope to increase classification accuracy

Feature Selection Solution:Feature Selection Definition:Select a relevant subset of features from a larger set of features while maintaining or enhancing accuracy Advantages • Remove irrelevant and redundant features • Total of 40 features -> reduced to 16 • 7 Hu moments -> only first three • Area removed -> redundant (Circularity) • Maintain/enhance the classification accuracy 70% recognition rate using 40 features -> 75% after FS & using only 16features • Faster classification and less memory requirements

Feature Selection/Weighting • The process of assigning weights (binary or real valued) to features needs a search algorithm to search for the set of weights that results in best classification accuracy (optimization problem) • Genetic algorithm is a good search method for optimization problems

Genetic Feature Selection/Weighting Why use GA for FS/FW • Has been proven to be a powerful search method for FS problem • Does not require derivative information or any extra knowledge; only the objective function (classifier’s error rate) to evaluate the quality of the feature subset • Search a population of solutions in parallel, so they can provide a number of potential solutions not only one • GA is resistant to becoming trapped in local minima

Objectives & Motivations Build a genetic feature selection/weighting system to be applied to character recognition problem and investigate the following issues: • Study the effect of varying weight values on the number of selected features (FS often eliminates more features than FW, how much ??) • Compare the performance of genetic feature selection/weighting in the presence of irrelevant & redundant features (not studied before) • Compare the performance of genetic feature selection/weighting for regular cases (test the hypothesis that says that FW should have better or at least same results as FS ??) • Evaluate the performance of the better method (GFS or GFW) in terms of optimality and time complexity (study the feasibility of genetic search for optimality & time)

Methodology • The recognition problem is to classify isolated handwritten digits • Used k-nearest-neighbor as a classifier (k=1) • Used genetic algorithm as search method • Applied genetic feature selection and weighting in the wrapper approach (i.e. fitness function is the classifier’s error rate) • Used two phases during the program run: training/testing phase and validation phase

System Overview Best feature subset (M <N) Pre-Processing Module Feature Extraction Module All Extracted features N Feature selection/weighting Module (GA) Input (isolated handwritten digits images) Clean images Assessment of feature subset Feature subset Evaluation Module (KNN classifier) Training/Testing Evaluation Validation

Results (Comparison 1) Effect of varying weight values on the number of selected features • As the number of weight values increase, the probability of a feature having weight value=0 (POZ) decreases, so the number of eliminated features decreases • GFS eliminates more features (thus selects less features) than GFW because of its smaller number of weight values (0/1) and without compromising classification accuracy

Performance of genetic feature selection/weighting in the presence of irrelevant features Results (Comparison 2) • The performance of 1-NN classifier rapidly degrades by increasing the number of irrelevant features • As the number of irrelevant features increases, FS outperform all FW settings in both classification accuracy and elimination of features

Performance of genetic feature selection/weighting in the presence of redundant features Results (Comparison 3) • The classification accuracy of 1-NN does not suffer so much by adding redundant features, but they increase the problem size • As the number of redundant features increases, FS has slightly better classification accuracy than all FW settings, but significantly outperform FW in elimination of features

Performance of genetic feature selection/weighting for regular cases (not necessarily having irrelevant/redundant) Results (Comparison 4) • FW has better training accuracies than FS, but FS is better in generalization (have better accuracies for unseen validation samples) • FW over-fits the training samples

Results (Evaluation 1) Convergence of GFS to an Optimal or Near-Optimal Set of Features • GFS was able to return optimal or near-optimal values (reached by the exhaustive search) • The worst average value obtained by GFS less than 1% away from optimal value

Number of Features Best Exh. (opt. & near-opt.) Exhaustive Run Time Best GA Average GA (for 5 runs) Number of Generations GA Run Time (single run) 8 74, 73.8 2 minutes 74 73.68 5 2 minutes 10 75.2, 75 13 minutes 75.2 74.96 5 3 minutes 12 77.2, 77 47 minutes 77 76.92 10 5 minutes 14 79, 78.8 3 hours 79 78.2 10 5.5 minutes 16 79.2, 79 6 hours 79.2 78.48 15 8 minutes 18 79.4, 79.2 1.5 days 79.4 78.92 20 11 minutes Convergence of GFS to an Optimal or Near-Optimal Set of Features within an Acceptable Number of Generations Results (Evaluation 2) The time needed for GFS is bounded by (lower) linear-fit and (upper) exponential-fit curves The use of GFS for highly dimensional problems need parallel processing

Conclusions • GFS is superior to GFW in feature reduction and without compromising classification accuracy • In the presence of irrelevant features, GFS is better than GFW in both feature reduction and classification accuracy • In the presence of redundant features, GFS is also preferred over GFW due its increased ability to feature reduction • For regular databases, it is advisable to use 2 or 3 weight values at most to avoid over-fitting • GFS is a reliable method to find optimal or near-optimal solution, but need parallel processing for large problem sizes

Questions ?

Faten Hussein

Faten Hussein

Presentation Transcript

Saddam Hussein

Saddam Hussein

ENDOCRINOLOGY Prof/ Faten & Dr . Taj

About Taha Hussein

Capture OF Saddam Hussein

Barrack Hussein Obama Jr.

SADDAM HUSSEIN

Dear Saddam Hussein,

Barack Hussein Obama

Saddam Hussein

Saddam Hussein

HUSSEIN AMIN JUNE 2008

Ibrahim Hussein Malaysia

Barack Hussein Obama II

By Bassam Hussein

About Taha Hussein

SAYINGS OF IMAM HUSSEIN

Dr Hussein Farghaly PSMMC

Hussein Standwithdignity

Faten Hussein

Faten Hussein

Presentation Transcript

Saddam Hussein

Saddam Hussein

ENDOCRINOLOGY Prof/ Faten &amp; Dr . Taj

About Taha Hussein

Capture OF Saddam Hussein

Barrack Hussein Obama Jr.

SADDAM HUSSEIN

Dear Saddam Hussein,

Barack Hussein Obama

Saddam Hussein

Saddam Hussein

HUSSEIN AMIN JUNE 2008

Ibrahim Hussein Malaysia

Barack Hussein Obama II

By Bassam Hussein

About Taha Hussein

SAYINGS OF IMAM HUSSEIN

Dr Hussein Farghaly PSMMC

Hussein Standwithdignity

ENDOCRINOLOGY Prof/ Faten & Dr . Taj