1 / 32

The Significance of Result Differences

The Significance of Result Differences. Why Significance Tests?. everybody knows we have to test the significance of our results but do we really? evaluation results are valid for data from specific corpus extracted with specific methods for a particular type of collocations

libitha
Download Presentation

The Significance of Result Differences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Significance of Result Differences

  2. Why Significance Tests? • everybody knows we have to test the significance of our results • but do we really? • evaluation results are valid for • data from specific corpus • extracted with specific methods • for a particular type of collocations • according to the intuitions of one particular annotator (or two)

  3. Why Significance Tests? • significance tests are about generalisations • basic question:"If we repeated the evaluation experiment (on similar data), would we get the same results?" • influence of source corpus, domain, collocation type and definition, annotation guidelines, ...

  4. Evaluation of Association Measures

  5. Evaluation of Association Measures

  6. A Different Perspective • pair types are described by tables (O11, O12, O21, O22) coordinates in 4-D space • O22 is redundant becauseO11 + O12 + O21 + O22 = N • can also describe pair type by joint and marginal frequencies(f, f1, f2) = "coordinates" coordinates in 3-D space

  7. A Different Perspective • data set = cloud of points in three-dimensional space • visualisation is "challenging" • many association measures depend on O11and E11 only(MI, gmean, t-score, binomial) • projection to (O11,E11) coordinates in 2-D space(ignoring the ratio f1 / f2)

  8. The Parameter Space of Collocation Candidates

  9. The Parameter Space of Collocation Candidates

  10. The Parameter Space of Collocation Candidates

  11. The Parameter Space of Collocation Candidates

  12. The Parameter Space of Collocation Candidates

  13. N-best Lists in Parameter Space • N-best List for AM  includes all pair types where score   c(threshold c obtained from data) • {  c} describes a subset of the parameter space • for a sound association measure isoline { = c} is lower boundary(because scores should increase with O11 for fixed value of E11)

  14. N-Best Isolines in the Parameter Space MI

  15. N-Best Isolines in theParameter Space MI

  16. N-Best Isolines in theParameter Space t-score

  17. N-Best Isolines in theParameter Space t-score

  18. 95% Confidence Interval

  19. 99% Confidence Interval

  20. 95% Confidence Interval

  21. Comparing Precision Values • number of TPs and FPs for 1000-best lists

  22. McNemar's Test + = in 1000-best list – = not in 1000-best list • ideally: all TPs in 1000-best list (possible!) • H0: differences between AMs are random

  23. McNemar's Test + = in 1000-best list – = not in 1000-best list > mcnemar.test(tbl) • p-value < 0.001  highly significant

  24. Significant Differences

  25. Significant Differences

  26. = significant = relevant (2%) Significant Differences

  27. Lowest-Frequency Data: Samples • Too much data for full manual evaluation  random samples • AdjN data • 965 pairs with f = 1 (15% sample) • manually identified 31 TPs (3.2%) • PNV data • 983 pairs with f < 3 (0.35% sample) • manually identified 6 TPs (0.6%)

  28. Lowest-Frequency Data: Samples • Estimate proportion p of TPs among all lowest-frequency data • Confidence set from binomial test • AdjN: 31 TPs among 965 items • p  5% with 99% confidence • at most  320 TPs • PNV: 6 TPs among 983-items • p  1.5% with 99% confidence • there might still be  4200 TPs !!

  29. N-best Lists for Lowest-Frequency Data • evaluate 10,000-best lists • to reduce manual annotation work,take 10% sample from each list(i.e. 1,000 candidates for each AM) • precision graphs for N-best lists • up to N = 10,000 for the PNV data • 95% confidence estimates for precision of best-performing AM (from binomial test)

  30. Random Sample Evaluation

  31. Random Sample Evaluation

  32. Random Sample Evaluation

More Related