160 likes | 291 Views
Chapter 5: Credibility. Introduction. Performance on the training set is not a good indicator of performance on an independent set. We need to predict performance bounds Quality training data is difficult to obtain---not always in abundance
E N D
Introduction • Performance on the training set is not a good indicator of performance on an independent set. • We need to predict performance bounds • Quality training data is difficult to obtain---not always in abundance • Performance prediction based on limited training data is controversial---repeated cross validation technique is most useful in these situations • Cost of misclassification is also an important criteria • Statistical tests are also needed to validate the conclusions
Training and Testing • Error rate of a classifier---if a classifier does a correct classification, it is counted as success; otherwise, it is an error. If out of 1000 instances, 700 are successful and 300 in error, then error rate is 30%. • Is the classifier performance on training data a good indicator of its performance on test data and future data? • Error rate on training data (resubstitution error) is not a good indicator of error rate on test data!! (overfit) • Tests data --- data not used in the training phase • Training data --- used by one or more learning methods to come up with classifiers • Validation data --- optimize parameters or select a particular one • Test data --- calculate the error rate of the optimized classifier
Predicting Performance • 100-%error rate = %success rate • Confidence interval: When the test data is not large, we refer to the resulting error rates (success rate) in the context of confidence intervals
Cross Validation • Hold out: Hold back 1/3 of the available data for testing and use the remaining for training. • Stratified holdout: Training data should be a good representative of the overall data---each class should be represented in the same proportion as its size • Repeated holdout method---repeat the random selection several times and obtain different error rates • Three-fold Cross validation---divide the data into three equal partitions: make three iterations, each time choosing one of the three partitions (folds) as test data and the other two as training data • 10-fold cross-validation: Use 9 of the 10 to train, and the remaining one for testing; 10 error estimates are averaged to yield an overall error estimate • Sometimes, we may repeat the 10-fold cross validation several times with different random samples of 10 folds
Other Estimates • Leave-one-out cross validation: It is n-fold cross validation where n is the #of instances in the dataset. Each instance in turn is left out for testing, and n-1 instances are used for training. The results of all n judgments are averaged and that is the final error estimate. • Bootstrap error estimation method: Sampling with replacement: A data set of n instances is sampled n times, with replacement, to give another dataset of n instances. Those instances that have not been picked in the training data will be chosen in the validation set. This is also referred as 0.632 bootsrtrap---because there is a 0.632 probability that an instance may not be chosen in the training set. • The error estimate obtained over the test set will be a pessimistic estimate of the true error rate, because the training set only contains 63% of the overall data, where as it covers 90% data in the 10-fold validation. • Final error rate is computed as: E = 0.632 ERROR RATE OVER TEST INSTANCES + 0.368* ERROR RATE OVER TRAINING INSTANCES • Repeat the bootstrap procedure several times and average the error rate
Comparing Data Mining Methods • If a new learning algorithm is proposed, its proponents must show that it improves on the state of the art for the problem at hand and demonstrates that the observed improvement is not just a chance effect in the estimation process. • A technique cannot be thrown out because it does not do well on one dataset; its average performance over different sets must be considered • Determine whether or not the mean of a set of samples---cross-validation estimates for the various data sets that we sampled from the domain---is significantly greater than, or significantly less than, the mean of another. t-test or Student’s t-test and paired t-test are preferred tools.
Predicting Probabilities • 0-1 loss function • When a classification is done with a probability, it is not a 0-1 situation • Quadratic loss function: If <p1,p2,…,pk> is a probability vector for an instance to belong to the k classes, and <a1,a2,…,ak> is the actual outcome vector where all but the entry that it belongs to is 1 and the rest 0. ∑j (pj-aj)2 is the quadratic loss function. If i is the correct class (ai=1), then it can be rewritten as: 1-2pi+∑jpj2. When test set contains several instance, the loss function is summed over all of them. • Informational loss function: -log2pi where the ith prediction is the correct one
Counting the Cost • Cost of making a wrong decision? • Cost of missing a threat versus cost of false positives? • Confusion matrix: True positive (TP) (Actual=predicted=yes) and True negative (TN) ( actual=predicted = no) are correct ones; false positive (FP) (actual=no, predicted = yes) and false negative (FN) (actual=yes, predicted=no) are incorrect ones. • True positive rate = TP/(TP+FN); out of all actual “yes” , what fraction is correctly predicted as “yes” • False positive rate = FP/(TN+FP); out of all actual “no”, what fraction is incorrectly predicted as “yes” • Overall success rate = (TP+TN)/(TP+TN+FP+FN) • Error rate = 1.0-success rate • Multiclass prediction---use confusion matrix: c rows and c columns; In Table 5.4 (a), there are 100 instances of class a, 60 of b, and 40 of c. Out of these 88+40+12 or 140 were correctly predicted---a success rate of 70%. The predictor predicted 120 of class a, 60 of b, and 20 of c. • The question is “is it a random prediction or a chance or an intelligent one?” If there is a random predictor that randomly classifies the instances based on the actual ratio of classes (6:3:1 in this case), we get Fig. 5.4 (b) results. It got 82 instances correct as opposed to 140 by the learning technique. Is this significant? Kappa statistic • Kappa statistic: 140-82 = 58 extra successes out of a possible total of 200-82 = 118, or 49.2%. Maximum value of Kappa is 100%. This is also not a cost-sensitive classification • Good link
Cost-sensitive classification • Benefits of TP and TN; costs of FP and FN • Sometimes the cost of a learning techniques may also be taken into account • Suppose a predictor predicts a class a instance as class a, b, c with probabilities pa, pb, and pc, then the default cost is pb+pc or 1-pa.
Cost-sensitive Learning • Take cost into consideration at training time • Generate training data with a different proportion of yes and no instances. • For example, if we want to avoid errors on the no instances, since false positives are penalized 10 times to that of false negatives, we could choose the number of no instances to be 10 times that of yes instances in the training set. • One way to vary the proportion of training instances is to duplicate instances in the training dataset. • Other is to assign weights for different instances and build cost sensitive trees
Lift Charts • Lift factor---Increase in response rate due to the selection of a different group (If one group yields a response of 1% and the other group 5%, then lift factor is 5 for the 2nd group.) • Table 5.6: 150 instances; 50 are yes (actual); 100 are no (actual). So 33% success rate. The 150 instances are sorted based on the predicted probability for by the learning scheme. For example, for instance 1, the learning scheme predicts a success of 0.95. For the next one, it is 0.93, and so on. When the actual class is no, and the technique predicts yes, then it is a false positive. • If we were to chose only 10 samples, then we go for the top most 10. Out of these, 2 are actually negative. So success rate would be 80%. Compared to the overall average success of 33%, there is a lift factor of 80/33 or 2.4 (tps/Ns)/(tpt/Nt) • Lift chart: Figure 5.1 --- % sample size (proportion of total test data); and number of respondents. Diagonal---expected number of respondents if random sample is taken; The upper curve shows a more intelligent choice of the samples. • Reference 1 • Reference 2
ROC Curves • Idea (as in lift chapters) is to choose samples with high proportion of positives • ROC curves depict the performance of a classifier without regard to class distribution or error costs. • Receiver operating characteristics---how does a signal receiver respond to noise + signal. • ROC curves depict the performance of a sample; % of +ves in the sample w.r.t. total +ves in the test data vs. % of –ves in the sample over all –ves in the test data. • Generating ROC curves from the cross-validation: (i) Collect the predicted for all the various test sets (10 sets in a 10-fold cross-validation) along with the TRUE class labels for each instance (ii) Generate a single ranked list based on this data (ii) Build ROC curve • Figure 5.3: ROC curves for two learning methods– when do we choose A and when do we choose B? • By combing both the techniques with a weight factor, we can get the best: the top of the convex hull---In other words, if a classifier A predicts an instance to be positive with prob. 0.95 and another classifier B predicts it with prob. 0.72, then assigning a weight of 0.7 to A and 0.3 to B, gives a prob. of 0.881. • Link 1 • Link 2
Recall-precision curves • Example: A1 locates 100 documents of which 20 are relevant: A2 locates 400 documents of which 80 are relevant. Which one is better? Cost of false positives and false negatives. • Recall = #of documents retrieved that are relevant/total #of documents that are relevant • Example: If the total# of relevant are 100, then A1: recall = 0.2; A2 recall = 0.8; • Precision: #of docs retrieved that are relevant/total #of docs retrieved • Example: A1: precision=20/100=0.2; A2 precision: 80/400=0.2; • Summary: Table 5.7 page 172 • Ultimate objective: Choose a set of instances with a high proportion of yes instances and a high coverage of yes instances; of course, this is to be done with as few samples as possible. • 3-point average recall: Average precision obtained at recall values of 20%, 50%, and 80%. In the example (see Excel sheet); 3-point average recall = (12+30+48)/3 = 30% • 11-point average recall = (4+6+12+18+24+30+36+42+48+54+60)/11 =30.36% • F-measure = 2*recall*precision/(recall+precision) = 2TP/(2TP+FP+FN); In the excel sheet example, TP=15, FP=5; let FN=4; then F-measure = 30/(30+5+4)=77% • Success rate: (TP+TN)/(TP+FP+TN+FN)
Cost curves • ROC curves and related measures are useful for exploring the tradeoffs among different classifiers over a range of costs. • But they are not ideal for evaluating models in situations with known error costs (cost of false negatives and cost of false positives). • Cost curves are suitable for this purpose---where a single classifier corresponds to a straight line that shows how the performance varies as the class distribution changes. Works best in two classes. • Fig 5.4 a: Expected error against the probability of one of the classes (+ and -) • If p(+) < 0.2, then always predicting – is better than method A. If p(+)>0.65, then always picking + is better than A. • Figure 5.4b: Taking costs into consideration. • Prob. Cost function = pc[+] = (p[+]*C[+|-])/(p[+]*C[+|-]+p[-]*C[-|+}) • Normalized expected cost = fn*pc[+]+fp*(1-pc[+]) Where fp is false positive rate and fn is false negative rate
Evaluating Numeric Prediction • Applies to numeric prediction (not nominal values) • Metrics (Table 5.8) • Mean squared error (MSE) • Root MSE • Mean absolute error • Relative squared error---relative to what it would have been if a simple predictor had been used---say the average values of the training data • Relative absolute error • Correlation coefficient---statistical correlation between actual values and predicted values • Table 5.8 and Table 5.9 • Choose the classifiers that gives the best results in terms of the chosen metric