1 / 29

Analysis of Classification-based Error Functions

Analysis of Classification-based Error Functions. Mike Rimer Dr. Tony Martinez BYU Computer Science Dept. 18 March 2006. Overview. Machine learning Teaching artificial neural networks with an error function Problems with conventional error functions CB algorithms Experimental results

quynh
Download Presentation

Analysis of Classification-based Error Functions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis of Classification-based Error Functions Mike Rimer Dr. Tony Martinez BYU Computer Science Dept. 18 March 2006

  2. Overview • Machine learning • Teaching artificial neural networks with an error function • Problems with conventional error functions • CB algorithms • Experimental results • Conclusion and future work

  3. input output f (x) Machine Learning • Goal: Automating learning of problem domains • Given a training sample from a problem domain, induce a correct solution-hypothesis over the entire problem population • The learning model is often used as a black box

  4. Teaching ANNs with an Error Function • Used to train a multi-layer perceptron (MLP) • to guide the gradient descent learning procedure to an optimal state • Conventional error metrics are sum-squared error (SSE) and cross entropy (CE) • SSE suited to function approximation • CE aimed at classification problems • CB error functions [Rimer & Martinez 06] work better for classification

  5. O1 O2 ERROR 1 ERROR 2 0 1 SSE, CE • Attempts to approximate 0-1 targets in order to represent making a decision Pattern labeled as class 2

  6. Issues with approximating hard targets • Requires weights to be large to achieve optimality • Leads to premature weight saturation • Weight decay, etc., can improve the situation • Learns areas of the problem space unevenly and at different times during training • Makes global learning problematic

  7. Classification-basedError Functions Designed to more closely match the goal of learning a classification task (i.e. correct classifications, not low error on 0-1 targets), avoiding premature weight saturation and discouraging overfit • CB1 [Rimer & Martinez 02, 06] • CB2 [Rimer & Martinez 04] • CB3 (submitted to ICML ‘06)

  8. ~T ~T T T ERROR 0 1 0 1 Misclassified Correct CB1 • Only backpropagates error on misclassified training patterns

  9. ~T ~T ~T T T T ERROR ERROR μ μ μ Correct, and satisfies margin 0 1 0 1 0 1 Misclassified Correct, but doesn’t satisfy margin CB2 • Adds a confidence margin, μ, that is increased globally as training progresses

  10. ERROR ERROR ~T ~T ~T T T T Ci Ci ERROR Correct with learned low confidence Correct with learned high confidence 0 1 0 1 0 1 Misclassified CB3 • Learns a confidence Ci for each training pattern i as training progresses • Patterns often misclassified have low confidence • Patterns consistently classified correctly gain confidence

  11. Neural Network Training • Influenced by: • Initial parameter (weight) settings • Pattern order presentation (stochastic training) • Learning rate • # of hidden nodes • Goal of training: • High generalization • Low bias and variance

  12. Experiments • Empirical comparison of six error functions • SSE, CE, CE w/ WD, CB1-3 • Used eleven benchmark problems from the UC Irvine Machine Learning Repository • ann, balance, bcw, derm, ecoli, iono, iris, musk2, pima, sonar, wine • Testing performed using stratified 10-fold cross-validation • Model selection by hold-out set • Results were averaged over ten tests • LR = 0.1, M = 0.7

  13. Classifier output difference (COD) • Evaluation of behavioral difference of two hypotheses (e.g. classifiers) T is the test set I is the identity or characteristic function

  14. Robustness to initial network weights • Averaged 30 random runs over all datasets

  15. Robustness to initial network weights • Averaged over all tests

  16. Robustness to pattern presentation order • Averaged 30 random runs over all datasets

  17. Robustness to pattern presentation order • Averaged over all tests

  18. Robustness to learning rate • Average of varying the learning rate from 0.01 – 0.3

  19. Robustness to learning rate

  20. Robustness to number of hidden nodes • Average of varying the number of nodes in the hidden layer from 1 - 30

  21. Robustness to number of hidden nodes

  22. Conclusion • CB1-3 are generally more robust than SSE, CE, and CE w/ WD, with respect to: • Initial weight settings • Pattern presentation order • Pattern variance • Learning rate • # hidden nodes • CB3 is superior, most robust, with most consistent results

  23. Questions?

More Related