330 likes | 430 Views
Forecast verification - did we get it right?. Ian Jolliffe Universities of Reading, Southampton, Kent, Aberdeen i.t.jolliffe@reading.ac.uk . Outline of talk. Introduction – what, why, how? Binary forecasts Performance measures, ROC curves Desirable properties Of forecasts
E N D
Forecast verification - did we get it right? Ian Jolliffe Universities of Reading, Southampton, Kent, Aberdeen i.t.jolliffe@reading.ac.uk
Outline of talk • Introduction – what, why, how? • Binary forecasts • Performance measures, ROC curves • Desirable properties • Of forecasts • Of performance measures • Other forecasts • Multi-category, continuous, (probability) • Value
Forecasts • Forecasts are made in many disciplines • Weather and climate • Economics • Sales • Medical diagnosis
Why verify/validate/assess forecasts? • Decisions are based on past data but also on forecasts of data not yet observed • A look back at the accuracy of forecasts is necessary to determine whether current forecasting methods should be continued, abandoned or modified
Two (very different) recent references • I T Jolliffe and D B Stephenson (eds.) (2003) Forecast verification A practitioner’s guide in atmospheric science. Wiley. • M S Pepe (2003) The statistical evaluation of medical tests for classification and prediction. Oxford.
Horses for courses • Different types of forecast need different methods of verification, for example in the context of weather hazards (TSUNAMI project): • binary data - damaging frost: yes/no • categorical - storm damage: slight/moderate/severe • discrete - how many land-falling hurricanes in a season • continuous - height of a high tide • probabilities – of tornado • Some forecasts (wordy/ descriptive) are very difficult to verify at all
Binary forecasts • Such forecasts might be • Whether temperature will fall below a threshold, damaging crops or forming ice on roads • Whether maximum river flow will exceed a threshold, causing floods • Whether mortality due to extreme heat will exceed some threshold (PHEWE project) • Whether a tornado will occur in a specified area • The classic Finley Tornado example (next 2 slides) illustrates that assessing such forecasts is more subtle than it looks • There are many possible verification measures – most have some poor properties
Tornado forecasts • Correct decisions 2708/2803 = 96.6% • Correct decisions by procedure which always forecasts ‘No Tornado’ 2752/2803 = 98.2% • It’s easy to forecast ‘No Tornado’, and get it right but more difficult to forecast when tornadoes will occur • Correct decision when Tornado is forecast is 28/100 = 28.0% • Correct forecast of observed tornadoes 28/51 = 54.5%
Some verification measures for (2 x 2) tables • a/(a+c) Hit rate = true positive fraction = sensitivity • b/(b+d) False alarm rate = 1- specificity • b/(a+b) False alarm ratio = 1 – positive predictive value • c/(c+d) Negative predictive value • (a+d)/n Proportion correct (PC) • (a+b)/(a+c) Bias
Skill scores • A skill score is a verification measure adjusted to show improvement over some unskilful baseline, typically a forecast of ‘climatology’, a random forecast or a forecast of persistence. Usually adjustment gives zero value for the baseline and unity for a perfect forecast. • For (2x2) tables we know how to calculate ‘expected’ values in the cells of the table under a null hypothesis of no association (no skill) for a χ2 test.
More (2x2) verification measures • (PC – E)/(1- E), where E is the expected value of PC assuming no skill – the Heidke (1926) skill score = Cohen’s Kappa (1960), also Doolittle (1885) • a/(a+b+c). Critical success index (CSI) = threat score • Gilbert’s (1884) skill score - a skill score version of CSI • (ad –bc)/(ad +bc) Yule’s Q (1900). A skill score version of the odds ratio ad/bc • a(b+d)/b(a+c); c(b+d)/d(a+c) Diagnostic likelihood ratios • Note that neither the list of measures nor the list of names is exhaustive – see, for example, J A Swets (1986), Psychological Bulletin, 99, 100-117
The (Relative Operating Characteristic) ROC curve • Plots hit rate (proportion of occurrences of the event that were correctly forecast) against false alarm rate (proportion of non-occurrences that were incorrectly forecast) for different thresholds • Especially relevant if a number of different thresholds are of interest • There are a number of verification measures based on ROC curves. The most widely used is probably the area under the curve
Desirable properties of measures: hedging and proper scores • ‘Hedging’ is when a forecaster gives a forecast different from his/her true belief because he/she believes that the hedged forecasts will improve the (expected) score on a measure used to verify the forecasts. Clearly hedging is undesirable. • For probability forecasts, a (strictly) proper score is one for which the forecaster (uniquely) maximises the expected score by forecasting his/her true beliefs, so that there is no advantage in hedging.
Desirable properties of measures: equitability • A score for a probability forecast is equitable if it takes the same expected value (often chosen to be zero) for all unskilful forecasts of the type • Forecast the same probability all the time or • Choose a probability randomly from some distribution on the range [0,1]. • Equitability is desirable – if two sets of forecasts are made randomly, but with different random mechanisms, one should not score better than the other.
Desirable properties of measures III • There are a number of other desirable properties of measures, both for probability forecasts and other types of forecast, but equitability and propriety are most often cited in the meteorological literature. • Equitability and propriety are incompatible (a new result)
Desirable properties (attributes) of forecasts • Reliability. Conditionally unbiased. Expected value of the observation equals the forecast value. • Resolution. The sensitivity of the expected value of the observation to different forecasts values (or more generally the sensitivity of this conditional distribution as a whole). • Discrimination. The sensitivity of the conditional distribution of forecasts, given observations, to the value of the observation. • Sharpness. Measures spread of marginal distribution of forecasts. Equivalent to resolution for reliable (perfectly calibrated) forecasts. • Other lists of desirable attributes exist.
A reliability diagram • For a probability forecast of an event based on 850hPa temperature. Lots of grid points, so lots of forecasts (16380). • Plots observed proportion of event occurrence for each forecast probability vs. forecast probability (solid line). • Forecast probability takes only 17 possible values (0, 1/16, 2/16, … 15/16, 1) because forecast is based on proportion of an ensemble of 16 forecasts that predict the event. • Because of the nature of the forecast event, 0 or 1 are forecast most of the time (see inset sharpness diagram).
Weather/climate forecasts vs medical diagnostic tests • Quite different approaches in the two literatures • Weather/climate. Lots of measures used. Literature on properties, but often ignored. Inference (tests, confidence intervals, power) seldom considered • Medical (Pepe). Far fewer measures. Little discussion of properties. More inference: confidence intervals, complex models for ROC curves
Multi-category forecasts • These are forecasts of the form • Temperature or rainfall ‘above’, ‘below’ or ‘near’ average (a common format for seasonal forecasts) • ‘Very High Risk’, High Risk’, ‘Moderate Risk’, ‘Low Risk’ of excess mortality (PHEWE) • Different verification measures are relevant depending on whether categories are ordered (as here) or unordered
Multi-category forecasts II • As with binary forecasts there are many possible verification measures • With K categories one class of measures assigns scores to each cell in the (K x K) table of forecast/outcome combinations • Then multiply the proportion of observations in each cell by its score, and sum over cells to get an overall score • By insisting on certain desirable properties (equitability, symmetry etc) the number of possible measures is narrowed
Gerrity (and LEPS) scores for 3 ordered category forecasts with equal probabilities • Two possibilities are Gerrity scores or LEPS (Linear Error in Probability Space) • In the example, LEPS rewards correct extreme forecasts more, and penalises badly wrong forecasts more, than Gerrity (divide Gerrity (LEPS) by 24 (36) to give the same scaling – an expected maximum value of 1)
Verification of continuous variables • Suppose we make forecasts f1, f2, …, fn; the corresponding observed data are x1, x2, …, xn. • We might assess the forecasts by computing • [ |f1-x1| + |f2-x2| + … + |fn-xn |]/n (mean absolute error) • [ (f1-x1)2 + (f2-x2)2 + … + (fn-xn)2 ]/n (mean square error) – or take its square root • Some form of correlation between the f’s and x’s • Both MSE and correlation can be highly influenced by a few extreme forecasts/observations • No time here to explore other possibilities
Skill or value? • Our examples have looked at assessing skill • Often we really want to assess value • This needs quantification of the loss/cost of incorrect forecasts in terms of their ‘incorrectness’
Value of Tornado Forecasts • If wrong forecasts of any sort costs $1K, then the cost of forecasting system is $95K, but the naive system costs only $51K • If a false alarm costs $1K, but a tornado missed costs $10K, then the system costs $302K, but naivety costs $510K • If a false alarm costs $1K, but a tornado missed costs $1million, then the system costs $23.07million , with naivety costing $51 million
Concluding remarks • Forecasts should be verified • Forecasts are multi-faceted; verification should reflect this • Interpretation of verification results needs careful thought • Much more could be said, for example, on inference, wordy forecasts, continuous forecasts, probability forecasts, ROC curves, value, spatial forecasts etc.
Continuous variables – LEPS scores • Also for MSE a difference between forecast and observed of, say, 2oC is treated the same way, whether it is • a difference between 1oC above and 1oC below the long-term mean or • a difference between 3oC above and 5oC above the long-term mean • It can be argued that the second forecast is better than the first because the forecast and observed are closer with respect to the probability distribution of temperature.
LEPS scores II • LEPS (Linear Error in Probability Space) are scores that measure distances with respect to position in a probability distribution. • They start from the idea of using | Pf – Pv |, where Pf, Pv are positions in the cumulative probability distribution of the measured variable for the forecast and observed values, respectively • This has the effect of down-weighting differences between extreme forecasts and outcomes e.g. a forecast/outcome pair 3 & 4 standard deviations above the mean is deemed ‘closer’ than a pair 1 & 2 SDs above the mean. Hence it gives greater credit to ‘good’ forecasts of extremes.
LEPS scores III • The basic measure is normalized and adjusted to ensure • the score is doubly equitable • no ‘bending back’ • a simple value for unskilful and for perfect forecasts • We end up with • 3(1-|Pf-Pv|+Pf2-Pf+Pv2-Pv)-1
LEPS scores IV • LEPS scores can be used on both continuous and categorical data. • A skill score, taking values between –100 and 100 (or –1 and 1) for a set of forecasts can be constructed based on the LEPS score but it is not doubly equitable. • Cross-validation (successively leaving out one of the data points and basing the prediction for that point from a rule derived from all the other data points) can be used to reduce the optimistic bias which exists when the same data are used to construct and to evaluate a rule. It has been used in some applications of LEPS, but is relevant more widely.