260 likes | 437 Views
Approximate Randomization tests. February 5 th , 2013. Classic t-test. Why ar testing ?. Classic tests often assume a given distribution (student t, normal , …) of the variable This is ≈ok for recall , but not for precision or F-score
E N D
ApproximateRandomization tests February 5th, 2013
Why ar testing? • Classic tests oftenassume a givendistribution (student t, normal, …) of the variable • This is ≈ok forrecall, but notforprecision or F-score • Possible hypotheses to test with non-parametric tests is limited
Illustration • 30,000 runs, 1000 instances, 500 of class A • True positives (TP): 400 (stdev:80) • Falsepositives (FP): 60 (stdev: 15) • Assumption: trueandfalsepositivesfor class A are normallydistributed. Thisis alreadyanapproximationsince TP and FP are restrictedby 0 and the number of instances.
Definitions • Recall = trulypredicted A / A in reference = trulypredicted A / CteIf A is normal, recall is normal. • Precision = trulypredicted A / A in system A in system is a non-linearcombination of TP and FP. Precision is notnormal. • F-score: non-linearcombination of recallandprecisionNotnormal.
Approximaterandomization test • No assumption on distribution • Can handle complicatedstatistics • Onlyassumption: independencebetweenshuffledelements • References: • Computer Intensive MethodsforTesting Hypotheses, Noreen, 1989. • More accurate tests for the statisticalsignificance of resultsdifferences, Yeh, 2000.
Basic idea • Exact randomization test
Exact probability H0: expert is independent of contents P(ncorrect ≥ 2) = 7/24 = 0.29 Thus, do notreject H0 because the probability is largerthanalpha=0.05.
Approximateprobability • The number of permutations is n! => quickincrease of number of permutations • Iftoomuchpermutationstocompute: approximation: P = (nge + 1) / (NS + 1) • nge : number of timespseudostatistic ≥ actualstatistic • NS: number of shuffles • +1: correctionforvalidity
Translationtoinstances • Eachglass is aninstance • Contents and expert are twolabeling systems • Contents has anaccuracy of 100%, expert has anaccuracy of 50% • Statistic is precision, f-score, recall, … instead of accuracy
Stratifiedshuffling • For labeledinstances, itmakes no sense toshuffle the class label of oneinstancetoanother • Onlyshufflelabels per instance
MBT • Assumpton of independencebetweeninstances • Shuffle per sentenceratherthan per token
Term extraction • Shufflingextractedtermsbetween output of two term extraction systems
Script • http://www.clips.ua.ac.be/~vincent/software.html#art • http://www.clips.ua.ac.be/scripts/art • Options: • Exact andapproximaterandomization tests • Instancebased, alsofor MBT • Term extractionbased • StratifiedShuffling • Twosided / one-sided (check code!)
Remarks on usage • It makes no sense toshuffleif exact randomizationcanbecomputed • The value of p depends on NS. The larger NS, the lower p canbe • Validity check • Sign-test • Re-test: toalleviate bad randomization
Sign test • Canbecomparedwith P foraccuracy • H0: correctness is independent ofsystem i.e.P(groen) = 0.5 • Binomial test
Interpretation (1) • How much do these two systems differbased on precisionfor the A label? • Maximally • Intermediate • Minimally
Conclusion • Approximaterandomizationtestingcanbeusedformanyapplications. • The basic idea is that the actualdifferencebetweentwo systems is (im)probabletooccurwhenallpossiblepermutions of the outputs are evaluated. • Differencecanbecomputed in manyways as long as the shuffledelements are independent.