320 likes | 592 Views
Evaluating Hypotheses. How good is my classifier?. How good is my classifier?. Have seen the accuracy metric Classifier performance on a test set. If we are to trust a classifier’s results Must keep the classifier blindfolded Make sure that classifier never sees the test data
E N D
Evaluating Hypotheses How good is my classifier?
How good is my classifier? Have seen the accuracy metric Classifier performance on a test set Evaluating Hypotheses
If we are to trust a classifier’s results Must keep the classifier blindfolded Make sure that classifier never sees the test data When things seem too good to be true… First and Foremost… Evaluating Hypotheses
Confusion Matrix Could collect more information Evaluating Hypotheses
Sensitivity vs. Specificity Sensitivity Out of the things predicted as being positive, how many were correct Specificity Out of the things predicted as being negative how many were correct • Not as sensitive if begins missing what it is trying to detect • If identify more and more things as target class, then beginning to get less specific Evaluating Hypotheses
Can we quantify our Uncertainty? Will the accuracy hold with brand new, never before seen data? Once we’re sure no cheating is going on… Evaluating Hypotheses
Binomial Distribution Discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments Successes or failures—Just what we’re looking for! Evaluating Hypotheses
Binomial Distribution Probability that the random variable R will take on a specific value r Might be probability of an error or of a positive Since we have been working with accuracy let’s go with positive Book works with errors Evaluating Hypotheses
Binomial Distribution Very simple calculations Evaluating Hypotheses
What Does This Mean? We can use as an estimator of p Now have p and the distribution given p We have the tools to figure out how confident we should be in our estimator Evaluating Hypotheses
The question How confident should I be in the accuracy measure? If we can live with statements like: 95% of the accuracy measures will fall in the range of 94% and 97% Life is good Confidence interval Evaluating Hypotheses
How calculate We want the quantiles where area outside is 5% We can estimate p There are tools available in most programming languages Evaluating Hypotheses
Example In R lb=qbinom(.025,n,p) ub=qbinom(.975,n,p) Lower and upper bound constitute confidence interval Evaluating Hypotheses
Still, Are We Really This Confident? What if none of the small cluster of Blues were in the training set? All of them would be in the test set How well would it do? Sample error vs. true error Might have been an accident—a pathological case Evaluating Hypotheses
Cross-Validation What if we could test the classifier several times with different test sets If it performed well each time wouldn’t we be more confident in the results? Reproducibility Consistency Evaluating Hypotheses
K-fold Cross-Validation Usually we have a big chunk of training data If we bust it up into randomly drawn chunks Can train on remainder And test with chunk Evaluating Hypotheses
K-fold Cross-Validation If 10 chunks Train 10 times Now have performance data on ten completely different test datasets Evaluating Hypotheses
Must stay blindfolded while training Must discard all lessons after each fold Remember, No Cheating Evaluating Hypotheses
10-fold Appears to be Most Common Default Weka and DataMiner both default to 10-fold Could be just as easily be 20-fold or 25-fold With 20-fold it would be a 95-5 split Performance is reported as the average accuracy across the K runs Evaluating Hypotheses
What is the best K? Related to the question of How large should the training set be Should be large enough to support a test set of size n such that Rule of thumb At least 30 examples not too close to 0 or 1 For ten-fold If 1/10th must be 30, Training set must be 300 If 10-fold satisfies this should be in good shape K-Fold Evaluating Hypotheses
Can Even Use K=1 Called of leave-one-out Disadvantage: slow Largest possible training set Smallest possible test set Has been promoted as an unbiased estimator or error Recent studies indicate that there is no unbiased estimator Evaluating Hypotheses
Recap Can calculate confidence interval with a single test set More runs (K-fold) gives us more confidence that we didn’t just get lucky in test set selection Do these runs help narrow the confidence interval? Confidence Interval Evaluating Hypotheses
When we average the performance… Central limit applies As the number of runs grows the distribution approaches normal With a reasonably large number of runs we can derive a more trustworthy confidence interval With 30 test runs (30-fold) can use traditional approaches to calculating mean and standard deviations, and therefore: confidence intervals Evaluating Hypotheses
Central Limit Theorem Consider a set of independent, identically distributed random variables Y1…Yn governed by an arbitrary probability distribution with mean and finite variance . Define the sample mean, then as the distribution governing approaches a Normal distribution, with zero mean and standard deviation equal to 1 Book: This is a quite surprising fact, because it states that we know the form of the distribution that governs the sample mean even when we do not know the form of the underlying distribution that governs the individual Evaluating Hypotheses
Checking accuracy in R meanAcc = mean(accuracies) sdAcc = sd(accuracies) qnorm(.975,meanAcc,sdAcc) 0.9980772 qnorm(.025,meanAcc,sdAcc) 0.8169336 Evaluating Hypotheses
Can we say that one classifier is significantly better than another T-test Null hypothesis: they are from the same distribution My Classifier’s Better than Yours Evaluating Hypotheses
T-test In R t.test(distOne,distTwo,paired= TRUE) Paired t-test data: distOne and distTwo t = -55.8756, df = 29, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.2052696 -0.1907732 sample estimates: mean of the differences -0.1980214 Evaluating Hypotheses
T-test In Perl use Statistics::TTest; my $ttest = new Statistics::TTest; $ttest->load_data(\@r1,\@r2); $ttest->set_significance(95); $ttest->print_t_test(); print "\n\nt statistic is ". $ttest->t_statistic."\n"; print "p val ".$ttest->{t_prob}."\n"; t_prob: 0 significance: 95 … df1: 29 alpha: 0.025 t_statistic: 12.8137016607408 null_hypothesis: rejected t statistic is 12.8137016607408 p val 0 Evaluating Hypotheses
Example, would you trust this classifier? The classifier performed exceptionally well achieving 99.9% classifier accuracy on the 1,000 member training set. The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross-validation on a training-set of size 1,000. The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 10-fold cross-validation on a training-set of size 1,000. The variance in the ten accuracy measures indicates a 95% confidence interval of 97%-98%. The classifier performed exceptionally well achieving an average classifier accuracy of 97.5% utilizing 30-fold cross-validation on a training-set of size 1,000. The variance in the thirty accuracy measures indicates a 95% confidence interval of 97%-98%. Evaluating Hypotheses
Randomly permute an array From the Perl Cookbook http://docstore.mik.ua/orelly/perl/cookbook/ch04_18.htm A Useful Technique sub fisher_yates_shuffle{ my $array = shift; my $i; for ($i = @$array; --$i; ) { my $j = int rand ($i+1); next if $i == $j; @$array[$i,$j] = @$array[$j,$i]; } } Evaluating Hypotheses
What about chi squared Evaluating Hypotheses