Data Mining CSCI 307, Spring 2019 Lecture 26

Data MiningCSCI 307, Spring 2019Lecture 26 Evaluating the Results Confidence Intervals

5.1 Training and Testing • Natural performance measure for classification problems: Error Rate • Success: instance’s class is predicted correctly • Error: instance’s class is predicted incorrectly • Error Rate: proportion of errors made over the whole set of instances • Resubstitution error: error rate obtained from using training data to measure performance. • Resubstitution error is (hopelessly) optimistic

Training and Testing continued • Test set: independent instances that have played no part in formation of classifier • Assumption: both training data and test data are representative samples of the underlying problem • Test and training data may differ in nature • Example: classifiers built using customer data from two different towns A and B • To estimate performance of classifier from town A in completely new town, test it on data from B

Classifier Evaluation MetricsConfusion Matrix Givenm classes, an entry, CMi,j in a confusion matrix indicates # of instances in class i that were labeled by the classifier as class j May have extra rows/columns to provide totals

Example: Confusion Matrix 5

Accuracy, Error Rate Classifier Accuracy (aka recognition rate): percentage of test set instances correctly classified (if we use training set, risk resubstitutionerror) Accuracy = (TP + TN)/All Error rate (1 –accuracy): Error rate = (FP + FN)/All 6

Sensitivity and Specificity • Class Imbalance Problem: • Main class of interest is rare • Significant majority of the negative class and minority of the positive class • Sensitivity: True Positive recognition rate Sensitivity = TP/P • Specificity: True Negative recognition rate Specificity = TN/N 7

Cancer Example Sensitivity = Specificity = Accuracy= 8

Parameter Tuning We can see that it is important that the test data is not used in any way to create the classifier • Some learning schemes operate in two stages: • Stage 1: build the basic structure • Stage 2: optimize parameter settings • Test data cannot be used for parameter tuning. Must use three sets: training data (for stage 1), validation data (for stage 2), and test data

Making the Most of the Data • Often, once evaluation is complete, all the data can be used to build the final classifier • Generally, the larger the training data the better the classifier • The larger the test data the more accurate the error estimate • Holdout procedure: method of splitting original data into training and test set • Ideally both training set and test set should be large (and representative)

5.2 Predicting Performance • Assume the estimated success rate is 75% (the error rate is 25%). How close is this to the true success rate? • Depends on the amount of test data • Prediction is like tossing a (biased) coin • “Head” is a “success”, “Tail” is an “error” • In statistics, a succession of independent events like this is called a Bernoulli process • Statistical theory provides us with confidence intervals for the true underlying proportion

Confidence Intervals We can say that p (true success rate) lies within a certain specified interval with a certain specified confidence Example: S=750 successes in N=1000 trials Estimated success rate (f =S/N ): 75% How close is this to true success rate p? Answer: with 80% confidence p in [73.2,76.7] Another example: S=75 and N=100 Again, estimated success rate: 75% How close is this to true success rate p? Answer: with 80% confidence p in [69.1,80.1] 80% confident true success rate between 73.2% and 76.7% 80% confident true success rate between 69.1% and 80.1%

Mean and Variance Given a single Bernoulli trial with success rate p, The mean is p The variance is p(1–p) (variance = (Std. Dev.)2) Given N trials...... • Expected success rate f=S/N • The meanfor f : p • The variance for f : p(1–p)/N • Statisticians tell us that for large enough N, f follows a Normal distribution

Mean and Variance • The probability of random variable X with 0 mean lies within the confidence range –z ≤ X ≤ z is: Pr[−z ≤ X ≤ z]=c where z is given in terms of standard deviations from the mean. • For normal distributions, the values of c and the corresponding values of z are given in tables to look up. • The distribution is symmetric, so the tables only give the probability for one side, so we need to multiply by 2. • The table gives probabilities that X is outside the range (so we need to subtract from 1) Pr[−z ≤ X ≤z]=1−2 x Pr[X ≥ z]

Sidebar.....

Confidence Limits Confidence limits for the normal distribution with 0 mean and a variance of 1: 1−2 x Pr[X≥1.65] = Pr[−1.65 ≤ X ≤ 1.65]=90% Pr[−3.09 ≤ X ≤ 3.09]= To use this we have to reduce our random variable f to have 0 mean and unit variance

Transforming f Transformed value for f : (i.e. subtract the mean and divide by the standard deviation) Resulting equation: Solving for p :

Back to the Original Example Example: S=750 successes in N=1000 trials Estimated success rate: 75% How close is this to true success rate p? How do we get that with 80% confidence p is in [73.2,76.7]? f = 75%, N = 1000, c = 80% we want this area, i.e. where 80% lives 10% So z = 1.28 1−2 x Pr[X≥1.28] = 1 - 2 x .10= 1-.20= .80 = 80%

Example f = 75%, N = 1000, c = 80% (so that z = 1.28) f2 = 0.5625 z2 = 1.6384 p = (0.75 + 0.0008192 ± 1.28√0.00075-0.0005625+0.0000004) / 1.0016384 p = (0.7508192 + 0.0175457) / 1.0016384 0.7683649 / 1.0016384 = 0.767108 OR p = (0.7508192 - 0.0175457) / 1.0016384 0.7332735 / 1.0016384 = 0.732074 p ∈ [0.732, 0 .767] means that we are 80% confident that our true success rate, p, is really between 73.2% and 76.7%

Example f = 75%, N = 100, c = 80% (so that z = 1.28) f2 = 0.5625 z2 = 1.6384 p = (0.75 + 0.008192 ± 1.28√0.0075-0.005625+0.0000409) / 1.016384 p = (0.758192 + 0.0560267) / 1.016384 0.8142187 / 1.016384 = 0.8010935 OR p = (0.758192 - 0.0560267) / 1.016384 0.7021653 / 1.016384 = 0.6908464 p ∈ [0.691,0 .801]means that we are 80% confident that our true success rate, p, is really between 69.1% and 80.1%

Example Summary f = 75%, N = 1000, c = 80% (so that z = 1.28): p ∈ [0.732,0 .767] f = 75%, N = 100, c = 80% (so that z = 1.28): p ∈ [0.691,0 .801] Note that normal distribution assumption is only valid for large N (i.e. N > 100), so this last one is of questionable validity. f = 75%, N = 10, c = 80% (so that z = 1.28): p ∈ [0.549,0 .881] So for same confidence level, true success range is bigger as data is fewer

Data Mining CSCI 307, Spring 2019 Lecture 26

Data Mining CSCI 307, Spring 2019 Lecture 26

Presentation Transcript

Data Mining CSCI 307 Spring, 2019

Data Mining CSCI 307, Spring 2019 Lecture 13

Data Structures CSCI 132, Spring 2019 Lecture 21 Doubly Linked Lists

Data Mining Chapter 26

Data Structures CSCI 132, Spring 2014 Lecture 17 Backtracking

Data Structures CSCI 132, Spring 2019 Lecture 14 Review for Exam 1

Data Mining Spring 2013

Data Structures CSCI 132, Spring 2019 Lecture 18 Recursion and Look-Ahead

Data Mining Spring 2007