120 likes | 325 Views
Statistical Comparison of Two Learning Algorithms. Presented by: Payam Refaeilzadeh. Overview . How can we tell if one algorithm can learn better than another? Design an experiment to measure the accuracy of the two algorithms. Run multiple trials.
E N D
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh
Overview • How can we tell if one algorithm can learn better than another? • Design an experiment to measure the accuracy of the two algorithms. • Run multiple trials. • Compare the samples not just their means. Do a statistically sound test of the two samples. • Is any observed difference significant? Is it due to true difference between algorithms or natural variation in the measurements?
Statistical Hypothesis Testing • Statistical Hypothesis: A statement about the parameters of one or more populations • Hypothesis Testing: A procedure for deciding to accept or reject the hypothesis • Identify the parameter of interest • State a null hypothesis, H0 • Specify an alternate hypothesis, H1 • Choose a significance level α • State an appropriate test statistic
Statistical Hypothesis Testing Cont • Null Hypothesis (H0): A statement presumed to be true until statistical evidence shows otherwise • Usually specifies an exact value for a parameter • Example H0: µ = 30 Kg • Alternate Hypothesis (H1): Accepted if the null hypothesis is rejected • Test Statistic: Particular statistic calculated from measurements of a random sample / experiment • A test statistic is assumed to follow a particular distribution (normal, t, chi-square, etc) • That particular distribution can be used to test for the significance of the calculated test statistic.
Error in Hypothesis Testing • Type I error occurs when H0 is rejected but it is in fact true • P(Type I error)=αor significance level • Type II error occurs when we fail to reject H0 but it is in fact false • P(Type II error)=β • power = 1-β = Probability of correctly rejecting H0 • power = ability to distinguish between the two populations
Paired t-Test • Collect data in pairs: • Example: Given a training set DTrain and a test set DTest, train both learning algorithms on DTrain and then test their accuracies on DTest. • Suppose n paired measurements have been made • Assume • The measurements are independent • The measurements for each algorithm follow a normal distribution • The test statistic T0 will follow a t-distribution with n-1 degrees of freedom
Paired t-Test cont Null Hypothesis: H0:µD = Δ0 Test Statistic: Assume: X1 follows N(µ1,σ1) X2 follows N(µ2,σ2) Let: µD = µ1 - µ2 Di = X1i - X2i i=1,2,...,n Rejection Criteria: H1:µD ≠ Δ0 |t0| > tα/2,n-1 H1:µD > Δ0 t0 > tα,n-1 H1:µD < Δ0 t0 < -tα,n-1
Cross Validated t-test • Paired t-Test on the 10 paired accuracies obtained from 10-fold cross validation • Advantages • Large train set size • Most powerful (Diettrich, 98) • Disadvantages • Accuracy results are not independent (overlap) • Somewhat elevated probability of type-1 error (Diettrich, 98)
5x2 Cross Validated t-test • Run 2-fold cross validation 5 times • Use results from the first of five replications to estimate mean difference • Use results for all folds to estimate the variance • Advantage: • Lowest Type-1 error (Diettrich, 98) • Disadvantage • Not as powerful as 10 fold cross validated t-test (Diettrich, 98)
Re-sampled t-test • Randomly divide data into train / test sets (usually 2/3 – 1/3) • Run multiple trials (usually 30) • Perform a paired t-test between the trial accuracies • This test has very high probability of type-1 error and should never be used.
Calibrated Tests • Bouckaert – ICML 2003: • It is very difficult to estimate the true degrees of freedom because independence assumptions are being violated • Instead of correcting for the mean-difference, calibrate on the degrees of freedom • Recommendation: use 10 times repeated 10-fold cross validation with 10 degrees of freedom
References • R. R. Bouckaert. Choosing between two learning algorithms based on calibrated tests. ICML’03: PP 51-58. • T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10:1895–1924, 1998. • D. C. Montgomery et al. Engineering Statistics. 2nd Edition. Wiley Press. 2001