Evaluating Classifiers

Evaluating Classifiers Lecture 2 Instructor: Max Welling

Evaluation of Results • How do you report classification error? • How certain are you about the error you claim? • How do you compare two algorithms? • How certain are you if you state one algorithm performs better than another?

Evaluation • Given: • Hypothesis h(x): XC, in hypothesis space H, • mapping attributes x to classes c=[1,2,3,...C] • A data-sample S(n) of size n. • Questions: • What is the error of “h” on unseen data? • If we have two competing hypotheses, which one is better • on unseen data? • How do we compare two learning algorithms in the face of limited data? • How certain are we about our answers?

Sample and True Error We can define two errors: 1) Error(h|S) is the error on the sample S: 2) Error(h|P) is the true error on the unseen data sampled from the distribution P(x): where f(x) is the true hypothesis.

Binomial Distributions • Assume you toss a coin n times. • And it has probability p of coming heads (which we will call success) • What is the probability distribution governing the number of heads in n trials? • Answer: the Binomial distribution.

Distribution over Errors • Consider some hypothesis h(x) • Draw n samples Xk~P(X). • Do this k times. • Compute e1=n*error(h|X1), e2=n*error(h|X2),...,ek=n*error(h|Xk). • {e1,...,ek} are samples from a Binomial distribution ! • Why? imagine a magic coin, where God secretly determines the probability • of heads by the following procedure. First He takes some random hypothesis h. • Then, He draws x~P(x) and observes if h(x) correctly predicts the label correctly. • If it does, he makes sure the coin lands heads up... • You have a single sample S, for which you observe • e(S) errors. What would be a reasonable estimate for Error(h|P) you think?

Binomial Moments mean • If we match the mean, np, with the observed value n*error(h|S) we find: • If we match the variance we can obtain an estimate of the width:

Confidence Intervals • We would like to state: • With N% confidence we believe that error(h|P) is contained in the interval: 80% Normal(0,1) • In principle is hard to compute exactly, but for np(1-p)>5 or n>30 it is safe to • approximate a Binomial by a Gaussian for which we can easily compute “z-values”.

Bias-Variance • The estimator is unbiased if • Imagine again you have infinitely many sample sets X1,X2,.. of size n. • Use these to compute estimates E1,E2,... of p where Ei=error(h|Xi) • If the average of E1,E2,.. converges to p, then error(h|X) is an unbiased estimator. • Two unbiased estimators can still differ in their • variance (efficiency). Which one do you prefer? p Eav

Flow of Thought • Determine the property you want to know about the future data (e.g. error(h|P)) • Find an unbiased estimator E for this quantity based on observing data X (e.g. error(h|X)) • Determine the distribution P(E) of E under the assumption you have infinitely • many sample sets X1,X2,...of some size n. (e.g. p(E)=Binomial(p,n), p=error(h|P)) • Estimate the parameters of P(E) from an actual data sample S (e.g. p=error(h|S)) • Compute mean and variance of P(E) and pray P(E) it is close to a Normal distribution. • (sums of random variables converge to normal distributions – central limit theorem) • State you confidence interval as: with confidence N% error(h|P) is contained in the interval

Assumptions • We only consider discrete valued hypotheses (i.e. classes) • Training data and test data are drawn IID from the same distribution P(x). • (IID: independently & identically distributed) • The hypothesis must be chosen independently from the data sample S! • When you obtain a hypothesis from a learning algorithm, split the data • into a training set and a testing set. Find the hypothesis using the training set • and estimate error on the testing set.

Comparing Hypotheses • Assume we like to compare 2 hypothesis h1 and h2, which we have • tested on two independent samples S1 and S2 of size n1 and n2. • I.e. we are interested in the quantity: ? • Define estimator for d: • with X1,X2 sample sets of size n1,n2. • Since error(h1|S1) and error(h2|S2) are both approximately Normal • their difference is approximately Normal with: • Hence, with N% confidence we believe that d is contained in the interval:

Paired Tests • Consider the following data: • error(h1|s1)=0.1 error(h2|s1)=0.11 • error(h1|s2)=0.2 error(h2|s2)=0.21 • error(h1|s3)=0.66 error(h2|s3)=0.67 • error(h1|s4)=0.45 error(h2|s4)=0.46 • and so on. • We have var(error(h1)) = large, var(error(h2)) = large. • The total variance of error(h1)-error(h2) is their sum. • However, h1 is consistently better than h2. • We ignored the fact that we compare on the same data. • We want a different estimator that compares data one by one. • You can use a “paired t-test” (e.g. in matlab) to see if the two errors • are significantly different, or if one error is significantly larger than the other.

Paired t-test • Chunk the data up in subsets T1,...,Tk with |Ti|>30 • On each subset compute the error and compute: • Now compute: • State: With N% confidence the difference in error between h1 and h2 is: • “t” is the t-statistic which is related to the student-t distribution (table 5.6).

Comparing Learning Algorithms • In general it is a really bad idea to estimate error rates on the same data • on which a learning algorithm is trained. WHY? • So just as in x-validation, we split the data into k subsets: • S{T1,T2,...Tk}. • Train both learning algorithm 1 (L1) and learning algorithm 2 (L2) on the complement • of each subset: {S-T1,S-T2,...) to produce hypotheses {L1(S-Ti), L2(S-Ti)} for all i. • Compute for all i : • Note: we train on S-Ti, but test on Ti. • As in the last slide perform a paired t-test on these differences to compute an • estimate and a confidence interval for the relative error of the hypothesis produced • by L1 and L2.

Evaluation: ROC curves moving threshold class 1 (positives) class 0 (negatives) TP = true positive rate = # positives classified as positive divided by # positives FP = false positive rate = # negatives classified as positives divided by # negatives TN = true negative rate = # negatives classified as negatives divided by # negatives FN = false negatives = # positives classified as negative divided by # positives Identify a threshold in your classifier that you can shift. Plot ROC curve while you shift that parameter.

Conclusion Never (ever) draw error-curves without confidence intervals (The second most important sentence of this course)

Evaluating Classifiers