Sampling Uncertainty in Verification Measures for Binary Deterministic Forecasts

Sampling Uncertainty in Verification Measures for Binary Deterministic Forecasts Ian Jolliffe and David Stephenson Sampling uncertainty and sampling schemes for (2x2) tables Hit rate Extensions – other measures and serial correlation EMS September 2013

Binary deterministic forecasts • Such forecasts are fairly common – forecast whether or not an event will occur • Their format leads to a (2x2) contingency table EMS September 2013

(2 x 2) table and some verification measures • a/(a+c) Hit rate (H) = probability of detection • b/(b+d) False alarm rate (F) = probability of false detection • H-F - Peirce’s (1884) skill score (PSS) • (a+d)/n Proportion correct (PC) • (a+b)/(a+c) Frequency bias • a/(a+b+c). Critical success index (CSI) = threat score … many more -18 in Chapter 3 (by Hogan & Mason) in Jolliffe and Stephenson (2012) Forecast Verification. A Practitioner’s Guide in Atmospheric Science, 2nd edition, Wiley. EMS September 2013

Uncertainty/inference for verification measures • Given the value of some verification measure, some idea of its uncertainty is needed to make inferences e.g. construct confidence intervals • The example is a subset of the well-known Finlay tornado data for May 1884. The figure resamples from these data. EMS September 2013

Sampling schemes • Could have: • a, b, c, d all independent Poisson • n fixed; a, b, c, d multinomial • Row totals fixed or column totals fixed – independent binomials • Row totals and column totals fixed – hypergeometric Which is most plausible? Does it make much difference? EMS September 2013

MULTINOMIAL SAMPLING BINOMIAL SAMPLING • Binomial sampling has fixed a+c=10 and so hit rate is always a multiple of 1/10 • Multinomial has additional sampling variation in hit rates between 1/10ths EMS September 2013

Sampling schemes • Could have: • a, b, c, d all independent (Poisson) • n fixed; a, b, c, d multinomial • Row totals fixed or column totals fixed – independent binomials • Row totals and column totals fixed – hypergeometric • Thesecond of these is the most plausible for much climate data • Hogan & Mason (Chapter 3 of Jolliffe & Stephenson) give (approximate) variances for 16 measures, but they assume column totals fixed. EMS September 2013

Sampling schemes • Could have: • a, b, c, d all independent (Poisson) • n fixed; a, b, c, d multinomial • Row totals fixed or column totals fixed – independent binomials • Row totals and column totals fixed – hypergeometric • Thesecond of these is the most plausible for much climate data – but you may disagree!! • Hogan & Mason (Chapter 3 of Jolliffe & Stephenson) give (approximate) variances for 16 measures, but they assume column totals fixed. EMS September 2013

Variance of hit rate • Hit rate or probability of detection is H = a/(a+c) • Suppose that (a+c) is fixed (binomial sampling) and that θH is the probability that the event has been forecast, given that it occurred • Then var(H) = θH(1- θH)/(a+c) which is estimated by ac/(a+c)3 • The multinomial sampling scheme can be obtained by first sampling (a+c) from a binomial with n trials and probability of success equal to the probability of event occurring (base rate); then, given the sampled value of (a+c), sample from the binomial with (a+c) trials and probability of success θH EMS September 2013

Variances of hit rate II • It turns out that with multinomial sampling, var(H) = θH(1- θH)/(a+c) is replaced by var(H) = θH(1- θH)E[(a+c)-1] with slight abuse of notation • Using a variance expression based on fixed (a+c) ignores the variability in (a+c) that occurs under multinomial sampling • There is a complication that (a+c) can equal zero, leading to an infinite value of E[(a+c)-1], but data with (a+c) = 0 can be ignored as they provide no information on the performance of the forecasts EMS September 2013

Multinomial vs. binomial comparison for hit rate • The table gives , for n=100 , some values of the ratio of multinomial vs. binomial variances for various values of (a+c) • The diagram shows this ratio for more values of (a+c) and three values of n EMS September 2013

Multinomial vs. binomial comparison • Inflation of variance for most values of (a+c) • Exception for very small values of (a+c) – due to frequently discarded zero values? • Maximum inflation of around 30% occurs around (a+c) = 4 • Inflation decreases towards 0 as (a+c) increases • A remarkable similarity of curves for different n • For the tornado data, multinomial variance is 12.7% larger than for binomial EMS September 2013

Extensions • Only one measure (hit rate) has been examined here • Exactly the same reasoning can be used for other measures with a similar ratio formula • Modifications are needed for other measures • Serial correlation is another complication – the results given assume independence which is not necessarily true. Can have a bigger effect than choice of sampling scheme. EMS September 2013

Conclusions • When reporting values of verification measures it is important to quantify the uncertainty associated with that value • For the seemingly simple case of data in a (2x2) contingency table this is a surprisingly subtle task because • Different sampling schemes lead to different variances • Serial correlation (or other forms of dependence) also change variances • Some fairly general results can be found, but for many measures and situations tailor-made calculations may be needed • Not withstanding the difficulties, the calculations should be done EMS September 2013

Questions? Comments? i.t.jolliffe@exeter.ac.uk EMS September 2013

Other verification measures • Exactly the same reasoning can be used to obtain multinomial- based variances for measures which are proportions, with the denominator equal to a sum of cell counts and the numerator a sum of a subset of the denominator counts, for example • F = False alarm rate b/(b+d) • J = Threat score a/(a+b+c) • The variance comparison table for H can be used • For F replacing (a+c) by (b+d) • For J, replacing (a+c) by (a+b+c). The comparison here is with an unrealistic sampling scheme, which nonetheless corresponds to a variance estimate given in the literature. EMS September 2013

Other verification measures II • For proportion correct, there are exact analytic expressions for variance under both binomial and multinomial sampling, which can be compared For the tornado data, the percentage increases in variance for multinomial sampling compared to the alternative scheme assumed by the table are 12.7 (H), 3.4 (J) and 17.5 (PC) Asymptotic expressions are available for some other measures, but different considerations are needed for exact values, possibly including simulation EMS September 2013

Serial correlation – another complication • All that has been said has assumed independence of the n observations being forecast • This is not necessarily true – there may be serial correlation. Rain today may be more likely if there was rain yesterday than if there was not • Serial correlation can have a bigger effect on variance than assuming the wrong sampling scheme EMS September 2013

Serial correlation – an example • Gabriel & Neumann (1962), QJRMS, 88, 90-95, give data on wet/day days in Tel Aviv for 27 years of daily data, November-April • There is serial correlation – for example, for November the probability of a wet day following a wet (dry) day is 0.60 (0.13) • To assess how much such serial correlation affects variances of verification measures use Markov chain simulation EMS September 2013

Markov chain simulation • Wilks (2010), QJRMS, 136, 2109-2118 considers probability forecasts and builds in serial dependence between forecasts directly • We consider binary deterministic forecasts with dependence built directly into the observations and hence indirectly into the forecasts • We simulate from a two-state Markov chain for various values of n (sample size), s (base rate) and ρ, the serial correlation EMS September 2013

Multinomial vs. binomial comparison for hit rate • The table gives , for n=100 , some values of the ratio of variances with/without serial correlation for various values of (a+c) and ρ • The diagram shows this ratio for more values of (a+c) and three values of n EMS September 2013

Serial correlation – simulation results • Ratio gets bigger for increasing ρ • Largest values are bigger than when comparing sampling schemes • For given n, things get worse as (a+c) decreases • Things get worse for lower base rate EMS September 2013

Serial correlation - examples • The Gabriel/Neumann data have large n and moderate s and ρ, so the effect of serial correlation is small • For example, for November, ρ=0.47, s=0.24 and n=810 leading to only a 1% increase in variance • For the May tornado data, n is again large (540) but s is much smaller (0.02). We don’t know ρ but if it were 0.5, then variance would be increased by about 30% by serial correlation. • In reality non-independence is likely to exist in the tornado data but will be more complex with space and time both involved EMS September 2013

Sampling Uncertainty in Verification Measures for Binary Deterministic Forecasts