240 likes | 266 Views
Evaluation of Model Performance. 69EG6517 – Impacts & Models of Climate Change. Dr Mark Cresswell. Lecture Topics. The need to evaluate performance Simple spatial correlation Bias Reliability Accuracy Skill Kolmogorov Smirnov. Why Evaluate?.
E N D
Evaluation of Model Performance 69EG6517 – Impacts & Models of Climate Change Dr Mark Cresswell
Lecture Topics • The need to evaluate performance • Simple spatial correlation • Bias • Reliability • Accuracy • Skill • Kolmogorov Smirnov
Why Evaluate? We must try and objectively and quantitatively assess the performance of any forecasting system If a climate model performs badly and it is used in critical decision-making processes then poor decisions will be made End users want an estimate of how confident they can be in predictions made Climate scientists need to know what models perform best and why
Why Evaluate? #2 A simple deterministic solution has only one outcome – so evaluation is easy (either the event occurred or it did not) Probabilistic forecasts (giving a percentage chance of an event) requires a method that compares what occurred against the probability weighting that was forecast Values of model forecast fields may be correlated against reanalysis (observed) fields. This is a preferred method by some climate scientists – but is bad science as discontinuous variables such as rainfall often have skewed distributions
Simple Spatial Correlation Model grid-points may be matched by corresponding reanalysis grid-points when historical (hindcast) forecasts/simulations are created The values of each forecast field (temperature, rainfall, humidity, wind velocity etc) may then be compared between model and reanalysis (observed) We can quantify the degree of association between forecast and reanalysis fields by performing a simple Pearson correlation test – on a grid-point basis
Simple Spatial Correlation HadAM3 Ens. Mean values
Simple Spatial Correlation HadAM3 Ens. Mean values
Bias Each climate model has its own specific “climatology” Sometimes, due to the nature of the physics schemes used, a model may often be wetter than reality or drier than reality when estimating precipitation When a model generates anomalously high precipitation it is said to have a wet bias – and conversely when it underestimates moisture it has a dry bias Bias relates to the similarity between the average forecast (µf) and the average observation (µx). This measure of model quality is generally measured in terms of the difference between µf and µx (Katz and Murphy, 1997).
Bias If the bias is systematic, then we can say that it is always present – and the magnitude of the difference between µf and µx is always the same. A systematic bias can be easily removed and corrected for Sometimes, the bias is not systematic. In this case the difference between µf and µx in not always to the same magnitude. This type of bias is hard to eliminate The HadAM3 model has a non-systematic wet bias is some regions of the tropics
Bias Differential mean bias correction scheme
Reliability Reliability is a measure of association between forecast probability and observed frequency Suppose a forecast system always gave a probability for above normal rainfall of 80% and below normal of 20%. If after 100 years, 80 of those years actually experienced above normal rainfall and 20 experienced below normal then the model would be 100% reliable. Notice that reliability is NOT the same as either accuracy or skill
Accuracy Accuracy is a measure of association between the probability weighting of an event and whether the event occurred. The test statistic is known as the Brier Score: n = Sample of ensemble forecasts pi = Probability of E occurring in the ith forecast vi = 1 (occurred), or 0 (did not occur)
Accuracy The Brier Score has a range of 0 to 1. The lower the score, the better the accuracy of the forecasting system, with a value of 0 providing perfect deterministic skill Example: W. Africa Stage 1
Accuracy Example: W. Africa Stage 2 BS for “Normal”
Accuracy Next, we have to calculate 1/n where n is 18 (years), so 1/n = 0.055 Finally, we have to sum all of the values of (pi – vi)2 and multiply this by 1/n (0.055): BS (normal) = 0.055 * 6.762 = 0.371 Therefore the Brier Score for the forecast of the “normal” event (monsoon rainfall onset is normal) for grid-point region two is 0.371 over the 18 years of data for a single grid square.
Skill Skill is a measure of how much more or less accurate our forecast system is compared with climatology. The test statistic is known as the Brier Skill Score: BSS = Brier Skill Score Bc = Brier Score (achieved with climatology) Bf = Brier Score (model simulation)
Skill A Brier Skill Score of zero denotes a forecast having the same skill as “climatology”. Positive scores are increasingly better than “climatology” i.e. have skill Negative scores are increasingly worse than “climatology” i.e. have no skill.
Skill As a worked example (using a West Africa grid-point for 18 years) we can produce a Brier Skill Score: Brier Score (HadAM3) for “normal” event = 0.371 Brier Score (Xie-Arkin) for “normal” event = 0.246 (calculations not shown here) Brier Skill Score = (0.246 – 0.371) / 0.246 = -0.508 Therefore, the Brier Skill Score for the forecast of the “normal” event (monsoon rainfall onset is normal) is -0.508 therefore has no skill.
Kolmogorov Smirnov The Kolmogorov Smirnov (KS) test tells us how similar two population distributions are In climate prediction, we might expect our model forecast field distributions to have similar characteristics to climatology – BUT if there is no difference the model is incapable of estimating inter-annual variability Ideally, we want a model that provides similar, but different, population distributions to climatology.
Kolmogorov Smirnov The KS test is assessed on a threshold being reached (critical value) – the value of which is determined by population size, and level of significance desired. If the observed maximum difference between the two cumulative distribution functions exceeds this critical value, then the null hypothesis that the two population distributions are identical is rejected. Example: HadAM3 in W. Africa: