Character Recognition Using Machine Learning Techniques

Character Recognition Using Machine Learning Techniques Matthew Peterson MukteshKhole AseemGogte Jeremy Kindseth

Problem Statement and Assumptions “Box” letters for analysis Identify alphanumeric symbols in documents or images (A-Z, a-z, 0-9) ImageCharacter A B C

Problem Difficulty Harder Easier r s M k H a u e d a t e H Not easily separable; surrounding artifacts; no consistent font D R O F S N A R T M E Skewed but separable Easy to box Simple to Decompose Complex Fonts, boxed Simple Fonts, boxed

Hypotheses • Accuracy: K-Means Clustering + EM Algorithm can be used for accurate classification of alphanumeric symbols • K-Means Clustering is as accurate as pixel-by-pixel detection using MSE • Efficiency: K-Means Clustering is faster than pixel-by-pixel detection using MSE

Features • Analysis of geometric locations of centroids X axis Y axis

Training and Testing Criteria

Accuracy Results Training: OCRA – Testing: Arial Training: OCRA – Testing: Handwritten Training: Frankenset– Testing: Handwritten Training: SuperSet– Testing: Handwritten Training: Frankenset– Testing: OCRA Training: SuperSet– Testing: OCRA Training: SuperSet– Testing: Arial Training: OCRA – Testing: OCRA Training: Frankenset– Testing: Arial

Time Performance • SuperSet Training vs. Pix-by-Pix • Testing against Fonts/Handwritten • K = 11 used

Hard Problems • Trouble pairs • 2;7 • 3;6;8;9 • O;Q;0 • V;U • B;D;R • I;l (uppercase “i”, lowercase “L”) • Scaling problems (X vs. x; C vs. c)

Conclusions Method • Random starting locations did not work • Scaling is important, especially for upper/lowercase differentiation (scaled to 128 x 128) • “Adaptable K” K-Means Clustering may be interesting • SuperSet method could be used to handle transformed data Results • Pixel-by-Pixel is more accurate when SuperSetused for training • For K-Means Clustering, SuperSetworked better for training • SuperSet increases memory usage and processor time – O(n!) • Frankenset is not a training model

Conclusions Regarding Hypotheses • Accuracy: K-Means Clustering + EM Algorithm can be used for accurate classification of alphanumeric symbols • K-Means Clustering is as accurate as pixel-by-pixel detection using MSE • Efficiency: K-Means Clustering is faster than pixel-by-pixel detection using MSE All Hypotheses Affirmed K-Means Clustering Effectively Serves as Method for Compressing Alphanumeric Data

References Sheshadri, Karthik, PavanAmbekar, Deeksha Prasad, Ramakanth Kumar. “An OCR System for Printed Kannada Using K-means Clustering.” Industrial Technology (ICIT) 2010 IEEE International Conference on (March 2010): 14-17.

Character Recognition Using Machine Learning Techniques