Data Mining II: Error Estimation & Bias-Variance Tradeoff

Error estimation Data Mining IIYear 2009-10Lluís Belanche Alfredo Vellido

Error estimation • Introduction • Resampling methods: • The Holdout • Crossvalidation • Random subsampling • kfold Crossvalidation • Leaveoneout • The Bootstrap • Error evaluation • Accuracy and all that

Bias and variance estimates with the bootstrap

Example: estimating bias & variance

Three-way data splits (1)

Three-way data splits (2)

Summary (data sample of size n) • Resubstitution: • optimistically-biased estimate • especially when the ratio of n to dimension is small • Holdout (if iterated we getRandom subsampling): • pessimistically-biased estimate • different partitions yield different estimates • K-fold CV (K«n): • higher bias than LOOCV; lower than holdout • lower variance than LOOCV • LOOCV (n-fold CV): unbiased - large variance • Bootstrap: • lower variance than LOOCV • useful for very small n Computational burden

Error Evaluation • Given: • Hypothesis h(x): XC, in hypothesis space H, • mapping features x to a number of classes • A data sample S of size n • Questions: • What is the error of h on unseen data? • If we have two competing hypotheses, which one will be better on unseen data? • How do we compare two learning algorithms in the face of limited data? • How certain are we about the answers to these questions?

Apparent & True Error We can define two errors: 1) Error(h|S) is the apparent error, measured on the sample S: 2) Error(h|P) is the true error on data sampled from the distribution P(x): where f(x) is the true hypothesis.

A note on True Error • True Error need not be zero! • Not even if we knew the probabilities P(x) • Causes: • Lack of relevant features • Intrinsic randomness of the process • A consequence of this is that we shall not attempt to fit hypotheses with zero apparent error, ie error(h|S)=0 !!! Quite on the contrary, we should favor those hypotheses s.t. error(h|S) ≈ error(h|P) If error(h|S) >> error(h|P), then h is underfitting the sample S If error(h|S) << error(h|P), then h is overfitting the sample S

How to estimate True Error (te)? • Estimate te as te in TE • Note te is a r.v.  CI • Let TE- the subset of TE wrongly predicted by h • Let n = |S|, t = |TE| • |TE-| follows a binomial distribution B(te, t) S The ML estimation of te is te = |TE-| / t This estimator is unbiased: E[te] = te Var[te] = te(1–te)/t

Confidence Intervals for te “With N% confidence te=error(h|P) is contained in the interval:” te– s ≤ te ≤ te+ s where s = zN√(te(1–te)/t) In words, te is within zN standard errors of the estimation. This is because, for te(1–te)t>5 or t>30, it is safe to approximate a Binomial by a Gaussian, for which we can compute “z-values”. 80% Normal(0,1)

Example 1 • n = |S| = 1,000; t = |TE| = 250 (25% of S) • Suppose |TE-| = 50 (our h hits 80% of TE) • Then te = 0.2. For a CI at the 95% level: • z0.95 = 1.967 and te is in [0.15, 0.25] • Exercise: recompute CI at the 99% level, using z0.99 = 2.326

Example 2: comparing two hypotheses • Assume we need to compare 2 hypotheses h1 and h2 on the same data • We have t = |TE| = 100, on which h1 makes 10 errors and h2 makes 13 • The CIs at the 95% (α=0.05) level are: • [0.04, 0.16] for h1 • [0.06, 0.20] for h2 • We cannot conclude that h1 is better than h2 • Note: above is written 10%±6% (h1), 13%±7% (h2)

Size does matter after all … • How large would TE need to be (say T) to affirm that h1 is better than h2? • Assume both h1, h2 keep same accuracy • Force that UL of CI for h1 < LL of CI for h2 • UL of CI for h1 is 0.10 + 1.967√(0.1*0.9 / T) • LL of CI for h2is 0.13 – 1.967√(0.13*0.87 / T) • It turns out that T>1,742 (old size was 100!!!) • The probability that this fails is at most (1-α)/2

Paired t-test • Chunk the data set S up in subsets s1,...,sk with |si|>30 • Design classifiers h1, h2 on every S\si • On each subset sicompute the errors and define: • Now compute: • With N% confidence the difference in error between h1 and h2is: • “tN,k-1” is the t-statistic related to the student-t distribution • Since error(h1 | si) and error(h2| si) are both approximately Normal • their difference is approximately Normal

Exercise: the real case … • A team of doctors has own classifier and sample data of size 500 • Split it in TR of size 300 and TE of size 200 • They get an error of 22% on TE • They ask us for further advice … • We design a second classifier • It has an error of 15% on same TE

Exercise: the real case … • Answer the following questions: • Will you affirm that yours is better than theirs? • How large would TE need to be to (very reasonably) affirm that yours is better than theirs? • What do you deduce from the above? • Suppose we move to 10-fold CV on the entire data set. • Give a new estimation of the error of your classifier • Perform a statistical test to check if there is any real difference The doctors’ classifier errors: 0.22, 0.22, 0.29, 0.19, 0.23, 0.22, 0.20, 0.25, 0.19, 0.19 Your classifier’ errors: 0.15, 0.17, 0.21, 0.14, 0.13, 0.15, 0.14, 0.19, 0.11, 0.11

No. of correct predictions Accuracy = No. of predictions TP + TN = TP + TN + FP + FN What is Accuracy?

Example • Clearly, B, C, D are all better than A • Is B better than C, D? • Is C better than B, D? • Is D better than B, C? Accuracy may not tell the whole story

No. of correct positive predictions Sensitivity = No. of positives (wrt positives) TP = TP + FN What is Sensitivity (aka Recall)? Sometimes sensitivity wrt negatives is termed specificity

No. of correct positive predictions Precision = wrt positives No. of positive predictions TP = TP + FP What is Specificity (aka Precision)?

A predicts better than B if A has better recall and precision than B There is a trade-off between recall and precision In some applications, once you reach a satisfactory precision, you optimize for recall In some applications, once you reach a satisfactory recall, you optimize for precision Precision-Recall Trade-off recall precision

Comparing prediction performance • Accuracy is the obvious measure • But it conveys the right intuition only when the positive and negative populations are roughly equal in size • Recall and precision together form a better measure • But what do you do when A has better recall than B and B has better precision than A?

2 * recall * precision F = recall + precision F-measure • The harmonic mean of recall and precision (wrt positives) Does not accord with intuition

Abstract model of a classifier • Given a test observation x • Compute the prediction h(x) • Predict x as negative if h(x) < t • Predict x as positive if h(x) > t t is the decision threshold of the classifier changing t affects the recall and precision, and hence accuracy, of the classifier

By changing t, we get a range of sensitivities and specificities of a classifier Leads to ROC curve that plots sensitivity vs. (1 – specificity) A predicts better than B if A has better sensitivities than B at most specificities Then the larger the area under the ROC curve, the better sensitivity 1 – specificity ROC Curves P(TP) P(FP)

Data Mining II: Error Estimation & Bias-Variance Tradeoff