Experimental Evaluation of Learning Algorithms

Experimental Evaluation of Learning Algorithms Part 1

Motivation • Evaluating the performance of learning systems is important because: • Learning systems are usually designed to predict the class of “future” unlabeled data points. • In some cases, evaluating hypotheses is an integral part of the learning process (example, when pruning a decision tree)

Recommended Steps for Proper Evaluation • Identify the “interesting” properties of the classifier. • Choose an evaluation metric accordingly. • Choose the learning algorithms to involve in the study along with the domain(s) on which the various systems will be pitted against. • Choose a confidence estimation method. • Check that all the assumptions made by the evaluation metric and confidence estimator are verified. • Run the evaluation method with the chosen metric and confidence estimator, and analyze the results. • Interpret the results with respect to the domain(s).

The Classifier Evaluation Procedure Choose Learning Algorithm(s) to Evaluate Select Datasets for Comparison Select Performance Measure of Interest Select Statistical Test Select Error-Estimation/Sampling Method Perform Evaluation 1 1 2 2 Means knowledge of 1 is necessary for 2 Means feedback from 1 should be used to adjust 2

Typical (but not necessarily optimal) Choices • Identify the “interesting” properties of the classifier. • Choose an evaluation metric accordingly • Choose a confidence estimation method . • Check that all the assumptions made by the evaluation metric and confidence estimator are verified. • Run the evaluation method with the chosen metric and confidence estimator, and analyze the results. • Interpret the results with respect to the domain.

Typical (but not necessarily optimal) Choices I: • Identify the “interesting” properties of the classifier. • Choose an evaluation metric accordingly • Choose a confidence estimation method . • Check that all the assumptions made by the evaluation metric and confidence estimator are verified. • Run the evaluation method with the chosen metric and confidence estimator, and analyze the results. • Interpret the results with respect to the domain. Almost only the highlighted part is performed. These steps are typically considered, but only very lightly

Typical (but not necessarily optimal) Choices II: • Typical choices for Performance Evaluation: • Accuracy • Precision/Recall • Typical choices for Sampling Methods: • Train/Test Sets (Why is this necessary?) • K-Fold Cross-validation • Typical choices for significance estimation • t-test (often a very bad choice, in fact!)

Confusion Matrix / Common Performance evaluation Metrics • Accuracy = (TP+TN)/(P+N) • Precision = TP/(TP+FP) • Recall/TP rate = TP/P • FP Rate = FP/N A Confusion Matrix

Sampling and Significance Estimation: Questions Considered • Given the observed accuracy of a hypothesis over a limited sample of data, how well does this estimate its accuracy over additional examples? • Given that one hypothesis outperforms another over some sample data, how probable is it that this hypothesis is more accurate, in general? • When data is limited what is the best way to use this data to both learn a hypothesis and estimate its accuracy?

k-Fold Cross-Validation 1. Partition the available data D0intokdisjoint subsets T1, T2, …, Tk of equal size, where this size is at least 30. 2. For i from 1 to k, do use Ti for the test set, and the remaining data for training set Si • Si <- {D0 - Ti} • hA <- LA(Si) • hB <- LB(Si) • i <- errorTi(hA)-errorTi(hB) 3. Return the value avg(), where . avg() = 1/k i=1k i

Confidence of the k-fold Estimate • The most commonly used approach to confidence estimation in Machine learning is: • To run the algorithm using 10-fold cross-validation and to record the accuracy at each fold. • To compute a confidence interval around the average of the difference between these reported accuracies and a given gold standard, using the t-test, i.e., the following formula: δ +/- tN,9 * sδwhere • δ is the average difference between the reported accuracy and the given gold standard, • tN,9 is a constant chosen according to the degree of confidence desired, • sδ= sqrt(1/90 Σi=110 (δi – δ)2) where δi represents the difference between the reported accuracy and the given gold standard at fold i.

What’s wrong with Accuracy? • Both classifiers obtain 60% accuracy • They exhibit very different behaviours: • On the left: weak positive recognition rate/strong negative recognition rate • On the right: strong positive recognition rate/weak negative recognition rate

What’s wrong with Precision/Recall? • Both classifiers obtain the same precision and recall values of 66.7% and 40% • They exhibit very different behaviours: • Same positive recognition rate • Extremely different negative recognition rate: strong on the left / nil on the right • Note: Accuracy has no problem catching this!

What’s wrong with the t-test? • Classifiers 1 and 2 yield the same average mean and confidence interval. • Yet, Classifier 1 is relatively stable, while Classifier 2 is not. • Problem: the t-test assumes a normal distribution. The difference in accuracy between classifier 2 and the gold-standard is not normally distributed

So what can be done? • Think about evaluation carefully prior to starting your experiments. • Use performance measures other than accuracy and precision recall. E.g., ROC Analysis, combinations of measures. Also, think about the best measure for your problem. • Use re-sampling methods other than cross-validation, when necessary: bootstrapping? Randomization? • Use statistical tests other than the t-test: non-parametric tests; tests appropriate for many classifiers compared on many domains (the t-test is not appropriate for this case, which is the most common one). • We will try to discuss some of these issues next time.

Experimental Evaluation of Learning Algorithms

Experimental Evaluation of Learning Algorithms

Presentation Transcript

Experimental Evaluation

Evaluation of Fast Electrostatics Algorithms

Master Class on Experimental Study of Algorithms

Monitoring, Evaluation and (Experimental) Impact Evaluation

Performance Evaluation for Learning Algorithms

Experimental evaluation in education

Evaluation of Learning

An Experimental Evaluation of Rate Adaptation Algorithms in Adaptive Streaming over HTTP

Performance Evaluation of Machine Learning Algorithms

IAT 334 Experimental Evaluation

Evaluation of Queue Management Algorithms

An experimental evaluation of incremental and hierarchical k -median algorithms

Validation and Evaluation of Algorithms

Experimental Analysis of Algorithms: What to Measure

An Experimental Test of House Matching Algorithms

Performance Evaluation of Grouping Algorithms

Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning

Goal of Learning Algorithms

Experimental Evaluation of Voltage Unbalance

Experimental Evaluation

CS 391L: Machine Learning: Experimental Evaluation

Master Class on Experimental Study of Algorithms