90 likes | 111 Views
Data Mining CSCI 307, Spring 2019 Lecture 25. Evaluating the Results. Credibility: Evaluating What's Been Learned. Issues: training, testing, tuning Predicting performance: confidence limits Holdout, cross-validation, bootstrap. Evaluation: the Key to Success.
E N D
Data MiningCSCI 307, Spring 2019Lecture 25 Evaluating the Results
Credibility: Evaluating What's Been Learned • Issues: training, testing, tuning • Predicting performance: confidence limits • Holdout, cross-validation, bootstrap
Evaluation: the Key to Success How predictive is the model we learned? • Performance on the training data is not a good indicator of performance on future data • Simple solution that can be used if lots of (labeled) data is available: • Split data into training and test set • However: quality data is often scarce
Issues in Evaluation • We need statistics to estimate differences in performance • What should be measured? There are performance measure methods/choices: • Number of correct classifications • Accuracy of probability estimates in predicting the class • Error numeric predictions (versus nominal predictions) • As a practical matter, costs assigned to different types of errors • a misclassification error depends on the type of error —i.e. positive example erroneously classified as negative or vice versa
Model Evaluation and Selection Summary Evaluation Metrics: How can we measure accuracy? Other metrics to consider? • Use test set of class-labeled instances instead of training set when assessing accuracy Looking Ahead: • Methods for estimating a classifier’s accuracy: • Holdout method, Cross-validation, Bootstrap • Comparing classifiers: • Confidence intervals, Cost-benefit analysis and ROC (receiver operating characteristic—graph to show performance) Curves • Thinking about numeric prediction 5
5.1 Training and Testing • Natural performance measure for classification problems: Error Rate • Success: instance’s class is predicted correctly • Error: instance’s class is predicted incorrectly • Error Rate: proportion of errors made over the whole set of instances • Resubstitution error: error rate obtained from using training data to measure performance. • Resubstitution error is (hopelessly) optimistic
Training and Testing continued • Test set: independent instances that have played no part in formation of classifier • Assumption: both training data and test data are representative samples of the underlying problem • Test and training data may differ in nature • Example: classifiers built using customer data from two different towns A and B • To estimate performance of the classifier from town A in a completely new town, test it on data from town B
Parameter Tuning We can see that it is important that the test data is not used in any way to create the classifier • Some learning schemes operate in two stages: • Stage 1: build the basic structure • Stage 2: optimize parameter settings • Test data cannot be used for parameter tuning. Must use three sets: training data (for stage 1), validation data (for stage 2), and test data
Making the Most of the Data • Often, once evaluation is complete, all the data can be used to build the final classifier • Generally, the larger the training data the better the classifier • The larger the test data the more accurate the error estimate • Holdout procedure: method of splitting original data into training and test set • Dilemma: ideally both training set and test set should be large (and representative)