Forecast verification - did we get it right?

Forecast verification - did we get it right? Ian Jolliffe Universities of Reading, Southampton, Kent, Aberdeen i.t.jolliffe@reading.ac.uk

Outline of talk • Introduction – what, why, how? • Binary forecasts • Performance measures, ROC curves • Desirable properties • Of forecasts • Of performance measures • Other forecasts • Multi-category, continuous, (probability) • Value

Forecasts • Forecasts are made in many disciplines • Weather and climate • Economics • Sales • Medical diagnosis

Why verify/validate/assess forecasts? • Decisions are based on past data but also on forecasts of data not yet observed • A look back at the accuracy of forecasts is necessary to determine whether current forecasting methods should be continued, abandoned or modified

Two (very different) recent references • I T Jolliffe and D B Stephenson (eds.) (2003) Forecast verification A practitioner’s guide in atmospheric science. Wiley. • M S Pepe (2003) The statistical evaluation of medical tests for classification and prediction. Oxford.

Horses for courses • Different types of forecast need different methods of verification, for example in the context of weather hazards (TSUNAMI project): • binary data - damaging frost: yes/no • categorical - storm damage: slight/moderate/severe • discrete - how many land-falling hurricanes in a season • continuous - height of a high tide • probabilities – of tornado • Some forecasts (wordy/ descriptive) are very difficult to verify at all

Binary forecasts • Such forecasts might be • Whether temperature will fall below a threshold, damaging crops or forming ice on roads • Whether maximum river flow will exceed a threshold, causing floods • Whether mortality due to extreme heat will exceed some threshold (PHEWE project) • Whether a tornado will occur in a specified area • The classic Finley Tornado example (next 2 slides) illustrates that assessing such forecasts is more subtle than it looks • There are many possible verification measures – most have some poor properties

Forecasting tornados

Tornado forecasts • Correct decisions 2708/2803 = 96.6% • Correct decisions by procedure which always forecasts ‘No Tornado’ 2752/2803 = 98.2% • It’s easy to forecast ‘No Tornado’, and get it right but more difficult to forecast when tornadoes will occur • Correct decision when Tornado is forecast is 28/100 = 28.0% • Correct forecast of observed tornadoes 28/51 = 54.5%

Forecast/observed contingency table

Some verification measures for (2 x 2) tables • a/(a+c) Hit rate = true positive fraction = sensitivity • b/(b+d) False alarm rate = 1- specificity • b/(a+b) False alarm ratio = 1 – positive predictive value • c/(c+d) Negative predictive value • (a+d)/n Proportion correct (PC) • (a+b)/(a+c) Bias

Skill scores • A skill score is a verification measure adjusted to show improvement over some unskilful baseline, typically a forecast of ‘climatology’, a random forecast or a forecast of persistence. Usually adjustment gives zero value for the baseline and unity for a perfect forecast. • For (2x2) tables we know how to calculate ‘expected’ values in the cells of the table under a null hypothesis of no association (no skill) for a χ2 test.

More (2x2) verification measures • (PC – E)/(1- E), where E is the expected value of PC assuming no skill – the Heidke (1926) skill score = Cohen’s Kappa (1960), also Doolittle (1885) • a/(a+b+c). Critical success index (CSI) = threat score • Gilbert’s (1884) skill score - a skill score version of CSI • (ad –bc)/(ad +bc) Yule’s Q (1900). A skill score version of the odds ratio ad/bc • a(b+d)/b(a+c); c(b+d)/d(a+c) Diagnostic likelihood ratios • Note that neither the list of measures nor the list of names is exhaustive – see, for example, J A Swets (1986), Psychological Bulletin, 99, 100-117

The (Relative Operating Characteristic) ROC curve • Plots hit rate (proportion of occurrences of the event that were correctly forecast) against false alarm rate (proportion of non-occurrences that were incorrectly forecast) for different thresholds • Especially relevant if a number of different thresholds are of interest • There are a number of verification measures based on ROC curves. The most widely used is probably the area under the curve

Desirable properties of measures: hedging and proper scores • ‘Hedging’ is when a forecaster gives a forecast different from his/her true belief because he/she believes that the hedged forecasts will improve the (expected) score on a measure used to verify the forecasts. Clearly hedging is undesirable. • For probability forecasts, a (strictly) proper score is one for which the forecaster (uniquely) maximises the expected score by forecasting his/her true beliefs, so that there is no advantage in hedging.

Desirable properties of measures: equitability • A score for a probability forecast is equitable if it takes the same expected value (often chosen to be zero) for all unskilful forecasts of the type • Forecast the same probability all the time or • Choose a probability randomly from some distribution on the range [0,1]. • Equitability is desirable – if two sets of forecasts are made randomly, but with different random mechanisms, one should not score better than the other.

Desirable properties of measures III • There are a number of other desirable properties of measures, both for probability forecasts and other types of forecast, but equitability and propriety are most often cited in the meteorological literature. • Equitability and propriety are incompatible (a new result)

Desirable properties (attributes) of forecasts • Reliability. Conditionally unbiased. Expected value of the observation equals the forecast value. • Resolution. The sensitivity of the expected value of the observation to different forecasts values (or more generally the sensitivity of this conditional distribution as a whole). • Discrimination. The sensitivity of the conditional distribution of forecasts, given observations, to the value of the observation. • Sharpness. Measures spread of marginal distribution of forecasts. Equivalent to resolution for reliable (perfectly calibrated) forecasts. • Other lists of desirable attributes exist.

A reliability diagram • For a probability forecast of an event based on 850hPa temperature. Lots of grid points, so lots of forecasts (16380). • Plots observed proportion of event occurrence for each forecast probability vs. forecast probability (solid line). • Forecast probability takes only 17 possible values (0, 1/16, 2/16, … 15/16, 1) because forecast is based on proportion of an ensemble of 16 forecasts that predict the event. • Because of the nature of the forecast event, 0 or 1 are forecast most of the time (see inset sharpness diagram).

Weather/climate forecasts vs medical diagnostic tests • Quite different approaches in the two literatures • Weather/climate. Lots of measures used. Literature on properties, but often ignored. Inference (tests, confidence intervals, power) seldom considered • Medical (Pepe). Far fewer measures. Little discussion of properties. More inference: confidence intervals, complex models for ROC curves

Multi-category forecasts • These are forecasts of the form • Temperature or rainfall ‘above’, ‘below’ or ‘near’ average (a common format for seasonal forecasts) • ‘Very High Risk’, High Risk’, ‘Moderate Risk’, ‘Low Risk’ of excess mortality (PHEWE) • Different verification measures are relevant depending on whether categories are ordered (as here) or unordered

Multi-category forecasts II • As with binary forecasts there are many possible verification measures • With K categories one class of measures assigns scores to each cell in the (K x K) table of forecast/outcome combinations • Then multiply the proportion of observations in each cell by its score, and sum over cells to get an overall score • By insisting on certain desirable properties (equitability, symmetry etc) the number of possible measures is narrowed

Gerrity (and LEPS) scores for 3 ordered category forecasts with equal probabilities • Two possibilities are Gerrity scores or LEPS (Linear Error in Probability Space) • In the example, LEPS rewards correct extreme forecasts more, and penalises badly wrong forecasts more, than Gerrity (divide Gerrity (LEPS) by 24 (36) to give the same scaling – an expected maximum value of 1)

Verification of continuous variables • Suppose we make forecasts f1, f2, …, fn; the corresponding observed data are x1, x2, …, xn. • We might assess the forecasts by computing • [ |f1-x1| + |f2-x2| + … + |fn-xn |]/n (mean absolute error) • [ (f1-x1)2 + (f2-x2)2 + … + (fn-xn)2 ]/n (mean square error) – or take its square root • Some form of correlation between the f’s and x’s • Both MSE and correlation can be highly influenced by a few extreme forecasts/observations • No time here to explore other possibilities

Skill or value? • Our examples have looked at assessing skill • Often we really want to assess value • This needs quantification of the loss/cost of incorrect forecasts in terms of their ‘incorrectness’

Value of Tornado Forecasts • If wrong forecasts of any sort costs $1K, then the cost of forecasting system is $95K, but the naive system costs only $51K • If a false alarm costs $1K, but a tornado missed costs $10K, then the system costs $302K, but naivety costs $510K • If a false alarm costs $1K, but a tornado missed costs $1million, then the system costs $23.07million , with naivety costing $51 million

Concluding remarks • Forecasts should be verified • Forecasts are multi-faceted; verification should reflect this • Interpretation of verification results needs careful thought • Much more could be said, for example, on inference, wordy forecasts, continuous forecasts, probability forecasts, ROC curves, value, spatial forecasts etc.

Continuous variables – LEPS scores • Also for MSE a difference between forecast and observed of, say, 2oC is treated the same way, whether it is • a difference between 1oC above and 1oC below the long-term mean or • a difference between 3oC above and 5oC above the long-term mean • It can be argued that the second forecast is better than the first because the forecast and observed are closer with respect to the probability distribution of temperature.

LEPS scores II • LEPS (Linear Error in Probability Space) are scores that measure distances with respect to position in a probability distribution. • They start from the idea of using | Pf – Pv |, where Pf, Pv are positions in the cumulative probability distribution of the measured variable for the forecast and observed values, respectively • This has the effect of down-weighting differences between extreme forecasts and outcomes e.g. a forecast/outcome pair 3 & 4 standard deviations above the mean is deemed ‘closer’ than a pair 1 & 2 SDs above the mean. Hence it gives greater credit to ‘good’ forecasts of extremes.

LEPS scores III • The basic measure is normalized and adjusted to ensure • the score is doubly equitable • no ‘bending back’ • a simple value for unskilful and for perfect forecasts • We end up with • 3(1-|Pf-Pv|+Pf2-Pf+Pv2-Pv)-1

LEPS scores IV • LEPS scores can be used on both continuous and categorical data. • A skill score, taking values between –100 and 100 (or –1 and 1) for a set of forecasts can be constructed based on the LEPS score but it is not doubly equitable. • Cross-validation (successively leaving out one of the data points and basing the prediction for that point from a rule derived from all the other data points) can be used to reduce the optimistic bias which exists when the same data are used to construct and to evaluate a rule. It has been used in some applications of LEPS, but is relevant more widely.

Forecast verification - did we get it right?

Forecast verification - did we get it right?

Presentation Transcript

How We Did It

Where did we get it? How did we get it? Is it trustworthy?

How We Did It

How did we get here?

Why We Did It

How did we get here?

How DID We Get Here?

We Did It!

Sampling seafood for assessment of TBT exposure : Did we get it right?

How did we get here?

“If we have it – you get it, if we don’t have it – we get it”

How did we get here?

How Did We Get the Bible, and is it Reliable?

How did we get here?

How did we get here?

Get it right

How did we get here?

Did Einstein and Plato get it right?

How did we get here?

What Did We Get Right, What Did We Get Wrong, and What Does the Future Hold ?

Second generation outsourcing: Did the LAC get it right?

How did we get here?