290 likes | 449 Views
Analysis of Classification-based Error Functions. Mike Rimer Dr. Tony Martinez BYU Computer Science Dept. 18 March 2006. Overview. Machine learning Teaching artificial neural networks with an error function Problems with conventional error functions CB algorithms Experimental results
E N D
Analysis of Classification-based Error Functions Mike Rimer Dr. Tony Martinez BYU Computer Science Dept. 18 March 2006
Overview • Machine learning • Teaching artificial neural networks with an error function • Problems with conventional error functions • CB algorithms • Experimental results • Conclusion and future work
input output f (x) Machine Learning • Goal: Automating learning of problem domains • Given a training sample from a problem domain, induce a correct solution-hypothesis over the entire problem population • The learning model is often used as a black box
Teaching ANNs with an Error Function • Used to train a multi-layer perceptron (MLP) • to guide the gradient descent learning procedure to an optimal state • Conventional error metrics are sum-squared error (SSE) and cross entropy (CE) • SSE suited to function approximation • CE aimed at classification problems • CB error functions [Rimer & Martinez 06] work better for classification
O1 O2 ERROR 1 ERROR 2 0 1 SSE, CE • Attempts to approximate 0-1 targets in order to represent making a decision Pattern labeled as class 2
Issues with approximating hard targets • Requires weights to be large to achieve optimality • Leads to premature weight saturation • Weight decay, etc., can improve the situation • Learns areas of the problem space unevenly and at different times during training • Makes global learning problematic
Classification-basedError Functions Designed to more closely match the goal of learning a classification task (i.e. correct classifications, not low error on 0-1 targets), avoiding premature weight saturation and discouraging overfit • CB1 [Rimer & Martinez 02, 06] • CB2 [Rimer & Martinez 04] • CB3 (submitted to ICML ‘06)
~T ~T T T ERROR 0 1 0 1 Misclassified Correct CB1 • Only backpropagates error on misclassified training patterns
~T ~T ~T T T T ERROR ERROR μ μ μ Correct, and satisfies margin 0 1 0 1 0 1 Misclassified Correct, but doesn’t satisfy margin CB2 • Adds a confidence margin, μ, that is increased globally as training progresses
ERROR ERROR ~T ~T ~T T T T Ci Ci ERROR Correct with learned low confidence Correct with learned high confidence 0 1 0 1 0 1 Misclassified CB3 • Learns a confidence Ci for each training pattern i as training progresses • Patterns often misclassified have low confidence • Patterns consistently classified correctly gain confidence
Neural Network Training • Influenced by: • Initial parameter (weight) settings • Pattern order presentation (stochastic training) • Learning rate • # of hidden nodes • Goal of training: • High generalization • Low bias and variance
Experiments • Empirical comparison of six error functions • SSE, CE, CE w/ WD, CB1-3 • Used eleven benchmark problems from the UC Irvine Machine Learning Repository • ann, balance, bcw, derm, ecoli, iono, iris, musk2, pima, sonar, wine • Testing performed using stratified 10-fold cross-validation • Model selection by hold-out set • Results were averaged over ten tests • LR = 0.1, M = 0.7
Classifier output difference (COD) • Evaluation of behavioral difference of two hypotheses (e.g. classifiers) T is the test set I is the identity or characteristic function
Robustness to initial network weights • Averaged 30 random runs over all datasets
Robustness to initial network weights • Averaged over all tests
Robustness to pattern presentation order • Averaged 30 random runs over all datasets
Robustness to pattern presentation order • Averaged over all tests
Robustness to learning rate • Average of varying the learning rate from 0.01 – 0.3
Robustness to number of hidden nodes • Average of varying the number of nodes in the hidden layer from 1 - 30
Conclusion • CB1-3 are generally more robust than SSE, CE, and CE w/ WD, with respect to: • Initial weight settings • Pattern presentation order • Pattern variance • Learning rate • # hidden nodes • CB3 is superior, most robust, with most consistent results