M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2009

COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan • (mskhan@liv.ac.uk)‏ • Dept. of Computer Science • University of Liverpool • 2009 Classification: Evaluation February 23, 2009 Slide 1

COMP527: Data Mining COMP527: Data Mining Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam Classification: Evaluation February 23, 2009 Slide 2

Today's Topics COMP527: Data Mining Evaluation Samples Cross Validation Bootstrap Confidence of Accuracy Classification: Evaluation February 23, 2009 Slide 3

Evaluation COMP527: Data Mining We need some way to quantitatively evaluate the results of data mining. • Just how accurate is the classification? • How accurate can we expect a classifier to be? • If we can't evaluate the classifier, how can it be improved? • Can different types of classifier be evaluated in the same way? • What are useful criteria for such a comparison? • How can we evaluate clusters or association rules? There are lots of issues to do with evaluation. Classification: Evaluation February 23, 2009 Slide 4

Evaluation COMP527: Data Mining Assuming classification, the basic evaluation is how many correct predictions it makes as opposed to incorrect predictions. Can't test on data used for training the classifier and get an accurate result. The result is "hopelessly optimistic" (Witten). Eg: Due to over-fitting, a classifier might get 100% accuracy on the data it was trained from and 0% accuracy on other data. This is called the resubstitution error rate -- the error rate when you substitute the data back into the classifier generated from it. So we need some new, but labeled data to test on. Classification: Evaluation February 23, 2009 Slide 5

Validation COMP527: Data Mining Most of the time we do not have enough data to have a lot for training and a lot for testing, though sometimes this is possible (eg sales data)‏ Some systems have two phases of training. An initial learning period and then fine tuning. For example the Growing and Pruning sets for building trees. It's important to not use the validation set either. Note that this reduces the amount of data that you can actually train on by a significant amount. Classification: Evaluation February 23, 2009 Slide 6

Numeric Data, Multiple Classes COMP527: Data Mining Further issues to consider: • Some classifiers produce probabilities for one or more classes. We need some way to handle the probabilities – for a classifier to be partly correct. Also for multi-class problems (eg instance has 2 or more classes) we need some 'cost' function for getting an accurate subset of the classes. • Regression/Numeric Prediction produces a numeric value. We need statistical tests to determine how accurate this is rather than true/false for nominal classes. Classification: Evaluation February 23, 2009 Slide 7

Hold Out Method COMP527: Data Mining Obvious answer: Keep part of the data set aside for testing purposes and use the rest to train the classifier. Then use the test set to evaluate the resulting classifier in terms of accuracy. Accuracy: Number of correctly classified instances / total number of instances to classify. Ratio is often 2/3rds training, 1/3rd test. How should we select the instances for each section? Classification: Evaluation February 23, 2009 Slide 8

Samples COMP527: Data Mining Easy: Randomly select instances. Data could be very unbalanced: Eg 99% one class, 1% the other class. Then random sampling is likely to not draw any of the 1% class. Stratified: Group the instances by class and then select a proportionate number from each class. Balanced: Randomly select a desired amount of minority class instances, and then add the same number from the majority class. Classification: Evaluation February 23, 2009 Slide 9

Samples COMP527: Data Mining Stratified: Group the instances by class and then select a proportionate number from each class. Classification: Evaluation February 23, 2009 Slide 10

Samples COMP527: Data Mining Balanced: Randomly select a desired amount of minority class instances, and then add the same number from the majority class. Classification: Evaluation February 23, 2009 Slide 11

Small Data Sets COMP527: Data Mining For small data sets, removing some as a test set and still having a representative set to train from is hard. Solutions? Repeat the process multiple times, select a different test set. Then find the error from each, and average across all of the iterations. Of course there's no reason to do this only for small data sets! Different test sets might still overlap, which might give a biased estimate of the accuracy. (eg if it randomly selects good records multiple times)‏ Can we prevent this? Classification: Evaluation February 23, 2009 Slide 12

Cross Validation COMP527: Data Mining Split the dataset up into k parts, then use each part in turn as the test set and the others as the training set. If the data set is also stratified, we can have stratified cross validation, rather than perhaps ending up with a non representative sample in one or more parts. Common values for k are 3 (eg hold out) and 10. Hence: stratified 10-fold cross validation Again, the error values are averaged after the k iterations. Classification: Evaluation February 23, 2009 Slide 13

Cross Validation COMP527: Data Mining Why 10? Extensive testing shows it to be a good middle ground -- not too much processing, not too random. Cross validation is used extensively in all data mining literature. It's the simplest and easiest to understand evaluation technique, while having a good accuracy. There are other similar evaluation techniques, however ... Classification: Evaluation February 23, 2009 Slide 14

Leave One Out COMP527: Data Mining Select one instance and train on all others. Then see if the instance is correctly classified. Repeat and find the percentage of accurate results. Eg: N-fold cross validation, where N is the number of instances in the data set. Attractive: • If 10 is good, surely N is better :)‏ • No random sampling problems • Trains with the most amount of data Classification: Evaluation February 23, 2009 Slide 15

Leave One Out COMP527: Data Mining Disadvantages: • Computationally expensive, builds N models! • Guarantees a non-stratified, non-balanced sample. Worst case: class distribution is exactly 50/50. Data is so complicated, classifier simply picks the most common class. -- Will always pick the wrong class. Classification: Evaluation February 23, 2009 Slide 16

Bootstrap COMP527: Data Mining Until now, the sampling has been without replacement (eg each instance occurs once, either in training or test set). However we could put back an instance to be drawn again -- sampling with replacement. This results in the 0.632 bootstrap evaluation technique. Draw a training set from the data set with replacement such that the number of instances in both is the same, then use the instances which are not in the training set as the test set. (Eg some instances will appear more than once in the training set)‏ Statistically, the likelihood of an instance not being picked is 0.368, hence the name. Classification: Evaluation February 23, 2009 Slide 17

Bootstrap COMP527: Data Mining Eg: Have a dataset of 1000 instances. We sample with replacement 1000 times – eg we randomly select an instance from all 1000 instances 1000 times. This should leave us with approximately 368 instances that have not been selected. We remove these and use them for the test set. Error rate will be pessimistic – only training on 63% of the data, with some repeated instances. We compensate by combining with the optimistic error rate from resubstitution: error rate: 0.632 * error-on-test + 0.368 * error-on-training Classification: Evaluation February 23, 2009 Slide 18

Confidence of Accuracy COMP527: Data Mining What about the size of the test set? More test instances should make us more confident that the accuracy predicted is close to the true accuracy. Eg getting 75% on 10,000 samples is more likely closer to the accuracy than 75% on 10. A series of events that succeed of fail is a Bernoulli process, eg coin tosses. We can find out S successes from N trials, and then S/N ... but what does that tell us about the true accuracy rate? Statistics can then tell us the range within which the true accuracy rate should fall. Eg: 750/1000 is very likely to be between 73.2% to 76.7%. (Witten 147 to 149 has the full maths!)‏ Classification: Evaluation February 23, 2009 Slide 19

Confidence of Accuracy COMP527: Data Mining We might wish to compare two classifiers of different types. Could compare accuracy of 10 fold cross validation, but there's another method: Student's T-Test Method: • We perform cross validation 10 times – eg 10 times TCV = 100 models • Perform the same repeated TCV with the second classifier • This gives us x1..x10 for the first, and y1..y10 for the second. • Find the mean of the 10 cross-validation runs for each. • Find the difference between the two means. We want to know if the difference is statistically significant. Classification: Evaluation February 23, 2009 Slide 20

Student's T-Test COMP527: Data Mining We then find 't' by: Where d is the difference between the means, k is the number of times the cross validation was performed, and 2 is the variance of the differences between the samples. (variance = sum of squared differences between mean and actual)‏ Then look up on the table for k-1 number of degrees of freedom. (more tables! But printed in Witten pg 155)‏ If t is greater than z on the table, then it is statistically significant. Classification: Evaluation February 23, 2009 Slide 21

Further Reading COMP527: Data Mining • Introductory statistical text books, again • Witten, 5.1-5.4 • Han 6.2, 6.12, 6.13 • Berry and Browne, 1.4 • Devijver and Kittler, Chapter 10 Classification: Evaluation February 23, 2009 Slide 22

M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2009