Analysis of Classification-based Error Functions

Analysis of Classification-based Error Functions Mike Rimer Dr. Tony Martinez BYU Computer Science Dept. 18 March 2006

Overview • Machine learning • Teaching artificial neural networks with an error function • Problems with conventional error functions • CB algorithms • Experimental results • Conclusion and future work

input output f (x) Machine Learning • Goal: Automating learning of problem domains • Given a training sample from a problem domain, induce a correct solution-hypothesis over the entire problem population • The learning model is often used as a black box

Teaching ANNs with an Error Function • Used to train a multi-layer perceptron (MLP) • to guide the gradient descent learning procedure to an optimal state • Conventional error metrics are sum-squared error (SSE) and cross entropy (CE) • SSE suited to function approximation • CE aimed at classification problems • CB error functions [Rimer & Martinez 06] work better for classification

O1 O2 ERROR 1 ERROR 2 0 1 SSE, CE • Attempts to approximate 0-1 targets in order to represent making a decision Pattern labeled as class 2

Issues with approximating hard targets • Requires weights to be large to achieve optimality • Leads to premature weight saturation • Weight decay, etc., can improve the situation • Learns areas of the problem space unevenly and at different times during training • Makes global learning problematic

Classification-basedError Functions Designed to more closely match the goal of learning a classification task (i.e. correct classifications, not low error on 0-1 targets), avoiding premature weight saturation and discouraging overfit • CB1 [Rimer & Martinez 02, 06] • CB2 [Rimer & Martinez 04] • CB3 (submitted to ICML ‘06)

~T ~T T T ERROR 0 1 0 1 Misclassified Correct CB1 • Only backpropagates error on misclassified training patterns

~T ~T ~T T T T ERROR ERROR μ μ μ Correct, and satisfies margin 0 1 0 1 0 1 Misclassified Correct, but doesn’t satisfy margin CB2 • Adds a confidence margin, μ, that is increased globally as training progresses

ERROR ERROR ~T ~T ~T T T T Ci Ci ERROR Correct with learned low confidence Correct with learned high confidence 0 1 0 1 0 1 Misclassified CB3 • Learns a confidence Ci for each training pattern i as training progresses • Patterns often misclassified have low confidence • Patterns consistently classified correctly gain confidence

Neural Network Training • Influenced by: • Initial parameter (weight) settings • Pattern order presentation (stochastic training) • Learning rate • # of hidden nodes • Goal of training: • High generalization • Low bias and variance

Experiments • Empirical comparison of six error functions • SSE, CE, CE w/ WD, CB1-3 • Used eleven benchmark problems from the UC Irvine Machine Learning Repository • ann, balance, bcw, derm, ecoli, iono, iris, musk2, pima, sonar, wine • Testing performed using stratified 10-fold cross-validation • Model selection by hold-out set • Results were averaged over ten tests • LR = 0.1, M = 0.7

Classifier output difference (COD) • Evaluation of behavioral difference of two hypotheses (e.g. classifiers) T is the test set I is the identity or characteristic function

Robustness to initial network weights • Averaged 30 random runs over all datasets

Robustness to initial network weights • Averaged over all tests

Robustness to pattern presentation order • Averaged 30 random runs over all datasets

Robustness to pattern presentation order • Averaged over all tests

Robustness to learning rate • Average of varying the learning rate from 0.01 – 0.3

Robustness to learning rate

Robustness to number of hidden nodes • Average of varying the number of nodes in the hidden layer from 1 - 30

Robustness to number of hidden nodes

Conclusion • CB1-3 are generally more robust than SSE, CE, and CE w/ WD, with respect to: • Initial weight settings • Pattern presentation order • Pattern variance • Learning rate • # hidden nodes • CB3 is superior, most robust, with most consistent results

Questions?

Analysis of Classification-based Error Functions