190 likes | 335 Views
Recognition of Fragmented Characters Using Multiple Feature-Subset Classifiers. Institute of Information Science, Academia Sinica, Taiwan C.H. Chou, C.Y. Guo, and F. Chang. Inter. Conf. Document Analysis and Recognition 2007. Introduction.
E N D
Recognition of Fragmented Characters Using Multiple Feature-Subset Classifiers Institute of Information Science, Academia Sinica, TaiwanC.H. Chou, C.Y. Guo, and F. Chang Inter. Conf. Document Analysis and Recognition 2007
Introduction • Recognizing fragmented characters, broken characters, in printed documents of poor printing quality. • Complement to ordinary mending techniques. • Using only intact characters as training samples. • Multiple features apply to enhance recognition accuracy. • The resultant classifiers can classify both intact and fragmented characters with a high degree of accuracy.
Example • from Chinese newspapers published between 1951 and 1961. • (a) most severe • (b) less severe • (c) least severe
Feature Extraction • Binary image, each pixel is represented by 1 (black) or 0 (white). • LD (Linear Normalization + Density Feature) Invariant to character fragmentation. LN → Reduction. Feature vector consists of 256 components, values range [0, 16]. • ND (Nonlinear Sharp Normalization + Direction Feature) Invariant to sharp deformation. NSN → Contour → 4 Direction map → Blurring → Reduction. Feature vector consists of 256 components, values range [0, 255].
Random Subspace Method • The Random Subspace Method (RSM) consists in random selection of a certain number of subspaces from the original feature space, and train a classifier on each subspace • Each set of training samples is derived from a set of feature vectors projected into a subspace. • Subspace Projection of ordinary feature vector to Sub-characters. • Randomly select a small number of dimensions from a ordinary feature vector. • The applied dimensions (w) of subspace: 32, 64, 128.
Random Subspace Method Voting
The accuracy of different classification methods • Multiple classifiers outperform single classifiers. • Hybrid feature always outperforms both LD and ND features. • GCNNs performs higher accuracy than CARTs.
The accuracy for three types of test documents • LD outperforms ND for most severe and less severe data. • ND is better than LD for least severe data. • Hybrid has the better accuracy than either LD or ND.
CARTs VS. GCNNs • The accuracy rates of CARTs and GCNNs with incremental number of classifiers and different w of subspace • The more classifiers get the better accuracy. • GCNNs require fewer classifiers to archive saturation accuracy than CARTs.
Conclusion • Proposing a learning approach to deal with both intact and fragmented characters in archived newspapers • The multiple predictors achieve much higher accuracy rates than single classifiers. • The hybrid predictors, which use both types of feature, perform better than those using only a single feature. • GCNN rule achieve higher accuracy, and require fewer classifiers, than those generated by the CART algorithm.