Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant

Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang1,2 , Qiang Huo1 1Microsoft Research Asia, Beijing, China 2The University of Hong Kong, Hong Kong, China (qianghuo@microsoft.com) ICASSP-2010, Dallas, Texas, U.S.A., March 14-19, 2010

Outline • Background • What’s our new approach • How does it work • Conclusions

Background of Minimum Classification Error (MCE) Formulation for Pattern Classification • Pioneered by Amari and Tsypkin in late 1960s • S. Amari, “A theory of adaptive pattern classifiers,” IEEE Trans. On Electronic Computers, Vol. EC-16, No. 3, pp.299-307, 1967. • Y. Z. Tsypkin, Adaptation and learning in automatic systems, 1971. • Y. Z. Tsypkin, Foundations of the theory of learning systems, 1973. • Proposed originally for supervised online adaptation of a pattern classifier • to minimize the expected risk (cost) • via a sequential probabilistic descent (PD) algorithm • Extended by Juang and Katagiri in early 1990s • B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Trans. on Signal Processing, Vol. 40, No. 12, pp.3043-3054, 1992.

MCE Formulation by Juang and Katagiri (1) • Define a proper discriminant function of an observation for each pattern class • To enable a maximum discriminant decision rule for pattern classification • Largely an art and application dependent

MCE Formulation by Juang and Katagiri (2) • Define a misclassification measure for each observation • to embed the decision process in the overall MCE formulation • to characterize the degree of confidence (or margin) in making decision for this observation • a differentiable function of the classifier parameters • A popular choice: where • Many possible ways => which one is better? => an open problem!

MCE Formulation by Juang and Katagiri (3) • Define a loss (cost) function for each observation • a differentiable and monotonically increasing function of the misclassification measure • many possibilities => sigmoid function most popular for approximating MCE • MCE training via minimizing • empirical average loss (cost) by an appropriate optimization procedure, e.g., gradient descent (GD), Quickprop, Rprop, etc., or • expected loss (cost) by a sequential probabilistic descent (PD) algorithm (a.k.a. GPD)

Some Remarks • Combinations of different choices for each of the previous three steps and optimization methods lead to various MCE training algorithms. • The power of MCE training has been demonstrated by many research groups for different pattern classifiers in different applications. • How to improve the generalization capability of an MCE-trained classifier?

One Possible Solution: SSM-based MCE Training • Sample Separation Margin (SSM) • Defined as the smallest distance of an observation to the classification boundary formed by the true class and the most competing class • There is a closed-form solution for piecewise linear classifier • Define misclassification measure as negative SSM • Other parts of the formulation is the same as “traditional” MCE • A happy result  • Minimized empirical error rate, and • Improved generalization • Correctly recognized training samples have a large margin from the decision boundaries! • For more info: • T. He and Q. Huo, “ A study of a new misclassification measure for minimum classification error training of prototype-based pattern classifiers, ’’ in Proc. ICPR-2008

What’s New in This Study? • Extend SSM-based MCE training to pattern classifier with a quadratic discriminant function (QDF) • No closed-form solution to calculate SSM • Demonstrate its effectiveness on a large-scale Chinese handwriting recognition task • Modified QDF (MQDF) is widely used in state-of-the-art Chinese handwriting recognition systems

Two Technical Issues • How to calculate the SSM efficiently? • Formulated as a nonlinear programming problem • Can be solved efficiently because it is a quadratically constrained quadratic programming (QCQP) problem with a very special structure: • A convex objective function with one quadratic equality constraint • How to calculate the derivative of the SSM? • Using a technique known as sensitivity analysis in nonlinear programming • Calculated by using the solution to the problem in Eq. (1) • Please refer to our paper for details

Experimental Setup • Vocabulary: • 6763 simplified Chinese characters • Dataset: • Training: 9,447,328 character samples • # of samples per class: 952 – 5,600 • Testing: 614,369 character samples • Feature extraction: • 512 “8-directional features” • Use LDA to reduce dimension to 128 • Use MQDF for each character class • # of retained eigenvectors: 5 and 10 • SSM-based MCE Training • Use maximum likelihood (ML) trained model as seed model • Update mean vectors only in MCE training • Optimize MCE objective function by batch-mode Quickprop (20 epochs) Distribution of writing styles in testing data

Experimental Results (1) • MQDF, K=5

Experimental Results (2) • MQDF, K=10

Experimental Results (3) • Histogram of SSMs on training set • SSM-based MCE-trained classifier vs. conventional MCE-trained one • Training samples are pushed away from decision boundaries • Bigger the SSM, better the generalization

Conclusion and Discussions • SSM-based MCE training offers an implicit way of minimizing empirical error rate and maximizing sample separation margin simutaneously • Verified for quadratic classifiers in this study • Verified for piecewise linear classifiers previously (He&Huo, ICPR-2008) • Ongoing and future works • SSM-based MCE training for discriminative feature extraction • SSM-based MCE training for more flexible classifiers based on GMM and HMM • Searching for other (hopefully better) methods to combine MCE training and maximum margin training

Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant