Approximate Randomization tests

ApproximateRandomization tests February 5th, 2013

Classic t-test

Why ar testing? • Classic tests oftenassume a givendistribution (student t, normal, …) of the variable • This is ≈ok forrecall, but notforprecision or F-score • Possible hypotheses to test with non-parametric tests is limited

Illustration • 30,000 runs, 1000 instances, 500 of class A • True positives (TP): 400 (stdev:80) • Falsepositives (FP): 60 (stdev: 15) • Assumption: trueandfalsepositivesfor class A are normallydistributed. Thisis alreadyanapproximationsince TP and FP are restrictedby 0 and the number of instances.

Definitions • Recall = trulypredicted A / A in reference = trulypredicted A / CteIf A is normal, recall is normal. • Precision = trulypredicted A / A in system A in system is a non-linearcombination of TP and FP. Precision is notnormal. • F-score: non-linearcombination of recallandprecisionNotnormal.

Approximaterandomization test • No assumption on distribution • Can handle complicatedstatistics • Onlyassumption: independencebetweenshuffledelements • References: • Computer Intensive MethodsforTesting Hypotheses, Noreen, 1989. • More accurate tests for the statisticalsignificance of resultsdifferences, Yeh, 2000.

Basic idea • Exact randomization test

Exact probability H0: expert is independent of contents P(ncorrect ≥ 2) = 7/24 = 0.29 Thus, do notreject H0 because the probability is largerthanalpha=0.05.

Approximateprobability • The number of permutations is n! => quickincrease of number of permutations • Iftoomuchpermutationstocompute: approximation: P = (nge + 1) / (NS + 1) • nge : number of timespseudostatistic ≥ actualstatistic • NS: number of shuffles • +1: correctionforvalidity

Different setups

Translationtoinstances • Eachglass is aninstance • Contents and expert are twolabeling systems • Contents has anaccuracy of 100%, expert has anaccuracy of 50% • Statistic is precision, f-score, recall, … instead of accuracy

Stratifiedshuffling • For labeledinstances, itmakes no sense toshuffle the class label of oneinstancetoanother • Onlyshufflelabels per instance

MBT • Assumpton of independencebetweeninstances • Shuffle per sentenceratherthan per token

Term extraction • Shufflingextractedtermsbetween output of two term extraction systems

Script • http://www.clips.ua.ac.be/~vincent/software.html#art • http://www.clips.ua.ac.be/scripts/art • Options: • Exact andapproximaterandomization tests • Instancebased, alsofor MBT • Term extractionbased • StratifiedShuffling • Twosided / one-sided (check code!)

Remarks on usage • It makes no sense toshuffleif exact randomizationcanbecomputed • The value of p depends on NS. The larger NS, the lower p canbe • Validity check • Sign-test • Re-test: toalleviate bad randomization

Sign test • Canbecomparedwith P foraccuracy • H0: correctness is independent ofsystem i.e.P(groen) = 0.5 • Binomial test

Interpretation (1) • How much do these two systems differbased on precisionfor the A label? • Maximally • Intermediate • Minimally

Interpretation (2)

Conclusion • Approximaterandomizationtestingcanbeusedformanyapplications. • The basic idea is that the actualdifferencebetweentwo systems is (im)probabletooccurwhenallpossiblepermutions of the outputs are evaluated. • Differencecanbecomputed in manyways as long as the shuffledelements are independent.

Approximate Randomization tests

Approximate Randomization tests

Presentation Transcript

Issues in Randomization

COMPUTER INTENSIVE AND RE-RANDOMIZATION TESTS IN CLINICAL TRIALS

Randomization

Issues in Randomization

3. Randomization

Early Inference: Using Randomization to Introduce Hypothesis Tests

Randomization Overview

Randomization

Randomization

Randomization and Controls

Months since Randomization

Randomization

Randomization workshop

Randomization workshop

Network randomization

Adaptive randomization

StatKey: Online Tools for Bootstrap Intervals and Randomization Tests

Randomization:

Mendelian Randomization

Port randomization (draft-ietf-tsvwg-port-randomization)