200 likes | 221 Views
Evaluation (practice). Predicting performance. Assume the estimated error rate is 25%. How close is this to the true error rate? Depends on the amount of test data Prediction is just like tossing a (biased!) coin “Head” is a “success”, “tail” is an “error”
E N D
Predicting performance • Assume the estimated error rate is 25%. How close is this to the true error rate? • Depends on the amount of test data • Prediction is just like tossing a (biased!) coin • “Head” is a “success”, “tail” is an “error” • In statistics, a succession of independent events like this is called a Bernoulli process • Statistical theory provides us with confidence intervals for the true underlying proportion
Confidence intervals • We can say: p lies within a certain specified interval with a certain specified confidence • Example: S=750 successes in N=1000 trials • Estimated success rate: 75% • How close is this to true success rate p? • Answer: with 80% confidence p[73.2,76.7] • Another example: S=75 and N=100 • Estimated success rate: 75% • With 80% confidence p[69.1,80.1] • I.e. the probability that p[69.1,80.1] is 0.8. • Bigger the N more confident we are, i.e. the surrounding interval is smaller. • Above, for N=100 we were less confident than for N=1000.
Mean and Variance • Let Y be the random variable with possible values 1 for success and 0 for error. • Let probability of success be p. • Then probability of error is q=1-p. • What’s the mean? 1*p + 0*q = p • What’s the variance? (1-p)2*p + (0-p)2*q = q2*p+p2*q = pq(p+q) = pq
Estimating p • Well, we don’t know p. Our goal is to estimate p. • For this we make N trials, i.e. tests. • More trials we do more confident we are. • Let S be the random variable denoting the number of successes, i.e. S is the sum of N value samplings of Y. • Now, we approximate p with the success rate in N trials, i.e. S/N. • By the Central Limit Theorem, when N is big, the probability distribution of the random variable f=S/N is approximated by a normal distribution with • mean p and • variance pq/N.
Estimating p • c% confidence interval [–z ≤ X ≤ z] for random variable with 0 mean is given by: Pr[− z≤ X≤ z]= c • With a symmetric distribution: Pr[− z≤ X≤ z]=1−2× Pr[ x≥ z] • Confidence limits for the normal distribution with 0 mean and a variance of 1: Thus: Pr[−1.65≤ X≤1.65]=90% To use this we have to reduce our random variable f=S/N to have 0 mean and unit variance
Estimating p Thus: Pr[−1.65≤ X≤1.65]=90% To use this we have to reduce our random variable S/N to have 0 mean and unit variance: Pr[−1.65≤ (S/N – p) / S/N ≤1.65]=90% Now we solve two equations: (S/N – p) / S/N =1.65 (S/N – p) / S/N =-1.65
Estimating p Let N=100, and S=70 S/N is sqrt(pq/N) and we approximate it by sqrt(p'(1-p')/N) where p' is the estimation of p, i.e. 0.7 So, S/N is approximated by sqrt(.7*.3/100) = .046 The two equations become: (0.7 – p) / .046 =1.65 p = .7 - 1.65*.046 = .624 (0.7 – p) / .046 =-1.65 p = .7 + 1.65*.046 = .776 Thus, we say: With a 90% confidence we have that the success rate p of the classifier will be 0.624 p 0.776
Exercise • Suppose I want to be 95% confident in my estimation. • Looking at a detailed table we find: Pr[−2≤ X≤2] 95% • Normalizing S/N, we need to solve: (S/N – p) / f =2 (S/N – p) / f =-2 • We approximate fwith • where p' is the estimation of p through trials, i.e. S/N • So we need to solve: • So,
Exercise • Suppose N=1000 trials, S=590 successes • p'=S/N=590/1000 = .59
Cross-validation k-fold cross-validation: First step: split data into k subsets of equal size Second step: use each subset in turn for testing, the remainder for training • The error estimates are averaged to yield an overall error estimate
Comparing data mining Schemes • Frequent question: which of two learning schemes performs better? • Obvious way: compare for example 10-fold Cross Validation estimates • Problem: variance in estimate • We don’t know whether the results are reliable • need to use statistical-test for that
Paired t-test • Student’s t-test tells whether the means of two samples are significantly different. • In our case the samples are cross-validation estimates for different datasets from the domain • Use a paired t-test because the individual samples are paired • The same Cross Validation is applied twice
Distribution of the means • x1, x2, … xk and y1, y2, … yk are the 2k samples for the k different datasets • mx and my are the means • With enough samples, the mean of a set of independent samples is normally distributed • Estimated variances of the means are • sx2 / k and sy2 / k • If x and y are the true means then the following are approximately normally distributed with mean 0, and variance 1:
Student’s distribution • With small samples (k < 30) the mean follows Student’s distribution with k–1 degrees of freedom • similar shape, but wider than normal distribution • Confidence limits (mean 0 and variance 1):
Distribution of the differences • Let md = mx – my • The difference of the means (md) also has a Student’s distribution with k–1 degrees of freedom • Let sd2 be the estimated variance of the difference • The standardized version of md is called the t-statistic:
Performing the test • Fix a significance level • If a difference is significant at the %level, there is a (100- )% chance that the true means differ • Divide the significance level by two because the test is two-tailed • Look up the value for z that corresponds to /2 • If t–z or tz then the difference is significant • I.e. the null hypothesis (that the difference is zero) can be rejected
Example • We have compared two classifiers through cross-validation on 10 different datasets. • The error rates are: Dataset Classifier A Classifier B Difference 1 10.6 10.2 .4 2 9.8 9.4 .4 3 12.3 11.8 .5 4 9.7 9.1 .6 5 8.8 8.3 .5 6 10.6 10.2 .4 7 9.8 9.4 .4 8 12.3 11.8 .5 9 9.7 9.1 .6 10 8.8 8.3 .5
Example • md = 0.48 • sd = 0.0789 • The critical value of t for a two-tailed statistical test, = 10% and 9 degrees of freedom is: 1.83 • 19.24 is way bigger than 1.83, so classifier B is much better than A.
Dependent estimates • We assumed that we have enough data to create • several datasets of the desired size • Need to reuse data if that's not the case • E.g. running cross-validations with different randomizations on the same data • Samples become dependent insignificant differences can become significant • A heuristic test is the corrected resampled t-test: • Assume we use the repeated holdout method, with n1 instances for training and n2 for testing • New test statistic is: