1 / 25

Approximate Randomization tests

Approximate Randomization tests. February 5 th , 2013. Classic t-test. Why ar testing ?. Classic tests often assume a given distribution (student t, normal , …) of the variable This is ≈ok for recall , but not for precision or F-score

ashby
Download Presentation

Approximate Randomization tests

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ApproximateRandomization tests February 5th, 2013

  2. Classic t-test

  3. Why ar testing? • Classic tests oftenassume a givendistribution (student t, normal, …) of the variable • This is ≈ok forrecall, but notforprecision or F-score • Possible hypotheses to test with non-parametric tests is limited

  4. Illustration • 30,000 runs, 1000 instances, 500 of class A • True positives (TP): 400 (stdev:80) • Falsepositives (FP): 60 (stdev: 15) • Assumption: trueandfalsepositivesfor class A are normallydistributed. Thisis alreadyanapproximationsince TP and FP are restrictedby 0 and the number of instances.

  5. Definitions • Recall = trulypredicted A / A in reference = trulypredicted A / CteIf A is normal, recall is normal. • Precision = trulypredicted A / A in system A in system is a non-linearcombination of TP and FP. Precision is notnormal. • F-score: non-linearcombination of recallandprecisionNotnormal.

  6. Approximaterandomization test • No assumption on distribution • Can handle complicatedstatistics • Onlyassumption: independencebetweenshuffledelements • References: • Computer Intensive MethodsforTesting Hypotheses, Noreen, 1989. • More accurate tests for the statisticalsignificance of resultsdifferences, Yeh, 2000.

  7. Basic idea • Exact randomization test

  8. Exact probability H0: expert is independent of contents P(ncorrect ≥ 2) = 7/24 = 0.29 Thus, do notreject H0 because the probability is largerthanalpha=0.05.

  9. Approximateprobability • The number of permutations is n! => quickincrease of number of permutations • Iftoomuchpermutationstocompute: approximation: P = (nge + 1) / (NS + 1) • nge : number of timespseudostatistic ≥ actualstatistic • NS: number of shuffles • +1: correctionforvalidity

  10. Different setups

  11. Translationtoinstances • Eachglass is aninstance • Contents and expert are twolabeling systems • Contents has anaccuracy of 100%, expert has anaccuracy of 50% • Statistic is precision, f-score, recall, … instead of accuracy

  12. Stratifiedshuffling • For labeledinstances, itmakes no sense toshuffle the class label of oneinstancetoanother • Onlyshufflelabels per instance

  13. MBT • Assumpton of independencebetweeninstances • Shuffle per sentenceratherthan per token

  14. Term extraction • Shufflingextractedtermsbetween output of two term extraction systems

  15. Script • http://www.clips.ua.ac.be/~vincent/software.html#art • http://www.clips.ua.ac.be/scripts/art • Options: • Exact andapproximaterandomization tests • Instancebased, alsofor MBT • Term extractionbased • StratifiedShuffling • Twosided / one-sided (check code!)

  16. Remarks on usage • It makes no sense toshuffleif exact randomizationcanbecomputed • The value of p depends on NS. The larger NS, the lower p canbe • Validity check • Sign-test • Re-test: toalleviate bad randomization

  17. Sign test • Canbecomparedwith P foraccuracy • H0: correctness is independent ofsystem i.e.P(groen) = 0.5 • Binomial test

  18. Interpretation (1) • How much do these two systems differbased on precisionfor the A label? • Maximally • Intermediate • Minimally

  19. Interpretation (2)

  20. Conclusion • Approximaterandomizationtestingcanbeusedformanyapplications. • The basic idea is that the actualdifferencebetweentwo systems is (im)probabletooccurwhenallpossiblepermutions of the outputs are evaluated. • Differencecanbecomputed in manyways as long as the shuffledelements are independent.

More Related