1 / 15

Experimental Evaluation of Learning Algorithms

Experimental Evaluation of Learning Algorithms. Part 1. Motivation. Evaluating the performance of learning systems is important because: Learning systems are usually designed to predict the class of “future” unlabeled data points.

lanaj
Download Presentation

Experimental Evaluation of Learning Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Experimental Evaluation of Learning Algorithms Part 1

  2. Motivation • Evaluating the performance of learning systems is important because: • Learning systems are usually designed to predict the class of “future” unlabeled data points. • In some cases, evaluating hypotheses is an integral part of the learning process (example, when pruning a decision tree)

  3. Recommended Steps for Proper Evaluation • Identify the “interesting” properties of the classifier. • Choose an evaluation metric accordingly. • Choose the learning algorithms to involve in the study along with the domain(s) on which the various systems will be pitted against. • Choose a confidence estimation method. • Check that all the assumptions made by the evaluation metric and confidence estimator are verified. • Run the evaluation method with the chosen metric and confidence estimator, and analyze the results. • Interpret the results with respect to the domain(s).

  4. The Classifier Evaluation Procedure Choose Learning Algorithm(s) to Evaluate Select Datasets for Comparison Select Performance Measure of Interest Select Statistical Test Select Error-Estimation/Sampling Method Perform Evaluation 1 1 2 2 Means knowledge of 1 is necessary for 2 Means feedback from 1 should be used to adjust 2

  5. Typical (but not necessarily optimal) Choices • Identify the “interesting” properties of the classifier. • Choose an evaluation metric accordingly • Choose a confidence estimation method . • Check that all the assumptions made by the evaluation metric and confidence estimator are verified. • Run the evaluation method with the chosen metric and confidence estimator, and analyze the results. • Interpret the results with respect to the domain.

  6. Typical (but not necessarily optimal) Choices I: • Identify the “interesting” properties of the classifier. • Choose an evaluation metric accordingly • Choose a confidence estimation method . • Check that all the assumptions made by the evaluation metric and confidence estimator are verified. • Run the evaluation method with the chosen metric and confidence estimator, and analyze the results. • Interpret the results with respect to the domain. Almost only the highlighted part is performed. These steps are typically considered, but only very lightly

  7. Typical (but not necessarily optimal) Choices II: • Typical choices for Performance Evaluation: • Accuracy • Precision/Recall • Typical choices for Sampling Methods: • Train/Test Sets (Why is this necessary?) • K-Fold Cross-validation • Typical choices for significance estimation • t-test (often a very bad choice, in fact!)

  8. Confusion Matrix / Common Performance evaluation Metrics • Accuracy = (TP+TN)/(P+N) • Precision = TP/(TP+FP) • Recall/TP rate = TP/P • FP Rate = FP/N A Confusion Matrix

  9. Sampling and Significance Estimation: Questions Considered • Given the observed accuracy of a hypothesis over a limited sample of data, how well does this estimate its accuracy over additional examples? • Given that one hypothesis outperforms another over some sample data, how probable is it that this hypothesis is more accurate, in general? • When data is limited what is the best way to use this data to both learn a hypothesis and estimate its accuracy?

  10. k-Fold Cross-Validation 1. Partition the available data D0intokdisjoint subsets T1, T2, …, Tk of equal size, where this size is at least 30. 2. For i from 1 to k, do use Ti for the test set, and the remaining data for training set Si • Si <- {D0 - Ti} • hA <- LA(Si) • hB <- LB(Si) • i <- errorTi(hA)-errorTi(hB) 3. Return the value avg(), where . avg() = 1/k i=1k i

  11. Confidence of the k-fold Estimate • The most commonly used approach to confidence estimation in Machine learning is: • To run the algorithm using 10-fold cross-validation and to record the accuracy at each fold. • To compute a confidence interval around the average of the difference between these reported accuracies and a given gold standard, using the t-test, i.e., the following formula: δ +/- tN,9 * sδwhere • δ is the average difference between the reported accuracy and the given gold standard, • tN,9 is a constant chosen according to the degree of confidence desired, • sδ= sqrt(1/90 Σi=110 (δi – δ)2) where δi represents the difference between the reported accuracy and the given gold standard at fold i.

  12. What’s wrong with Accuracy? • Both classifiers obtain 60% accuracy • They exhibit very different behaviours: • On the left: weak positive recognition rate/strong negative recognition rate • On the right: strong positive recognition rate/weak negative recognition rate

  13. What’s wrong with Precision/Recall? • Both classifiers obtain the same precision and recall values of 66.7% and 40% • They exhibit very different behaviours: • Same positive recognition rate • Extremely different negative recognition rate: strong on the left / nil on the right • Note: Accuracy has no problem catching this!

  14. What’s wrong with the t-test? • Classifiers 1 and 2 yield the same average mean and confidence interval. • Yet, Classifier 1 is relatively stable, while Classifier 2 is not. • Problem: the t-test assumes a normal distribution. The difference in accuracy between classifier 2 and the gold-standard is not normally distributed

  15. So what can be done? • Think about evaluation carefully prior to starting your experiments. • Use performance measures other than accuracy and precision recall. E.g., ROC Analysis, combinations of measures. Also, think about the best measure for your problem. • Use re-sampling methods other than cross-validation, when necessary: bootstrapping? Randomization? • Use statistical tests other than the t-test: non-parametric tests; tests appropriate for many classifiers compared on many domains (the t-test is not appropriate for this case, which is the most common one). • We will try to discuss some of these issues next time.

More Related