390 likes | 622 Views
Quantile regression as a means of calibrating and verifying a mesoscale NWP ensemble. Tom Hopson 1 Josh Hacker 1 , Yubao Liu 1 , Gregory Roux 1 , Wanli Wu 1 , Jason Knievel 1 , Tom Warner 1 , Scott Swerdlin 1 , John Pace 2 , Scott Halvorson 2. 1. 2 U.S. Army Test and Evaluation Command.
E N D
Quantile regression as a means of calibrating and verifying a mesoscaleNWP ensemble Tom Hopson1 Josh Hacker1, Yubao Liu1, Gregory Roux1, Wanli Wu1, Jason Knievel1, Tom Warner1, Scott Swerdlin1, John Pace2, Scott Halvorson2 1 2U.S. Army Test and Evaluation Command
Outline • Motivation: ensemble forecasting and post-processing • E-RTFDDA for Dugway Proving Grounds • Introduce Quantile Regression (QR; Kroenker and Bassett, 1978) • Post-processing procedure • Verification results • Warning: dynamically finding ensemble dispersion at risk ensemble mean utility • Conclusions
Goals of an EPS • Predict the observed distribution of events and atmospheric states • Predict uncertainty in the day’s prediction • Predict the extreme events that are possible on a particular day • Provide a range of possible scenarios for a particular forecast
More technically … • Greater accuracy of ensemble mean forecast (half the error variance of single forecast) • Likelihood of extremes • Non-Gaussian forecast PDF’s • Ensemble spread as a representation of forecast uncertainty => All rely on forecasts being calibrated • Further … • -- Argue calibration essential for tailoring to local application: NWP provides spatially- and temporally-averaged gridded forecast output • -- Applying gridded forecasts to point locations requires location specific calibration to account for local spatial- and temporal-scales of variability ( => increasing ensemble dispersion)
Dugway Proving Grounds, Utah e.g. T Thresholds • Includes random and systematic differences between members. • Not an actual chance of exceedance unless calibrated.
Challenges in probabilistic mesoscale prediction • Model formulation • Bias (marginal and conditional) • Lack of variability caused by truncation and approximation • Non-universality of closure and forcing • Initial conditions • Small-scales are damped in analysis systems, and the model must develop them • Perturbation methods designed for medium-range systems may not be appropriate • Lateral boundary conditions • After short time periods the lateral boundary conditions can dominate • Representing uncertainty in lateral boundary conditions is critical • Lower boundary conditions • Dominate boundary-layer response • Difficult to estimate uncertainty in lower boundary conditions
RTFDDA andEnsemble-RTFDDA yliu@ucar.edu Liu et al. 2010 AMS Annual Meeting, 14th IOAS-AOLS, Atlanta, GA. January 18 – 23, 2010
Perturbations RTFDDA Member 1 36-48h fcsts Post processing observations Perturbations RTFDDA Member 2 36-48h fcsts Input to decision support tools observations Perturbations RTFDDA Member 3 36-48h fcsts observations … Archiving and verification Perturbations RTFDDA Member N 36-48h fcsts observations The Ensemble Execution Module yliu@ucar.edu Liu et al. 2010 AMS Annual Meeting, 14th IOAS-AOLS, Atlanta, GA. January 18 – 23, 2010
T Mean and SD Surface and X-sections – Mean, Spread, Exceedance Probability, Spaghetti, … T-2m Mean T & Wind Wind Rose D2 Likelihood for SPD > 10m/s D3 Pin-point Surface and Profiles – Mean, Spread, Exceedance probability, spaghetti, Wind roses, Histograms … Wind Speed D1 Real-time Operational Products for DPG • Operated at • US Army DPG • since Sep. 2007
Forecast “calibration” or “post-processing” “bias” obs Forecast PDF Probability Probability Forecast PDF obs “spread” or “dispersion” calibration Flow rate [m3/s] Flow rate [m3/s] • Post-processing has corrected: • the “on average” bias • as well as under-representation of the 2nd moment of the empirical forecast PDF (i.e. corrected its “dispersion” or “spread”) • Our approach: • under-utilized “quantile regression” approach • probability distribution function “means what it says” • daily variation in the ensemble dispersion directly relate to changes in forecast skill => informative ensemble skill-spread relationship
Example of Quantile Regression (QR) Our application Fitting T quantiles using QR conditioned on: Ranked forecast ens ensemble mean ensemble median 4) ensemble stdev 5) Persistence
Step 2: For each quan, use “forward step-wise cross-validation” to iteratively select best subset Selection requirements: a) QR cost function minimum, b) Satisfy binomial distribution at 95% confidence If requirements not met, retain climatological “prior” Step I: Determine climatological quantiles Probability/°K climatological PDF 1. Regressor set: 1. reforecast ens 2. ens mean 3. ens stdev 4. persistence 5. LR quantile (not shown) 3. T [K] 2. 4. Temperature [K] observed forecasts Time Step 3: segregate forecasts into differing ranges of ensemble dispersion and refit models (Step 2) uniquely for each range Final result: “sharper” posterior PDF represented by interpolated quans forecasts Forecast PDF posterior I. II. III. II. I. Probability/°K prior T [K] Temperature [K] Time
Utilizing Verification measures near-real-time … Measures Used: Rank histogram (converted to scalar measure) Root Mean square error (RMSE) Brier score Rank Probability Score (RPS) Relative Operating Characteristic (ROC) curve New measure of ensemble skill-spread utility => Using these for automated calibration model selection by using weighted sum of skill scores of each
Problems withSpread-Skill Correlation … 4 day ECMWF r = 0.33 “Perfect” r = 0.68 ECMWF r= “Perfect” r = 0.56 1 day • ECMWF spread-skill (black) correlation << 1 • Even “perfect model” (blue) correlation << 1 and varies with forecast lead-time 7 day 10 day ECMWF r = 0.39 “Perfect” r = 0.53 ECMWF r = 0.36 “Perfect” r = 0.49
3-hr dewpoint time series Station DPG S01 Before Calibration After Calibration National Security Applications Program Research Applications Laboratory
42-hr dewpoint time series Station DPG S01 Before Calibration After Calibration
PDFs: raw vs. calibrated Blue is “raw” ensemble Black is calibrated ensemble Red is the observed value Notice: significant change in both “bias” and dispersion of final PDF (also notice PDF asymmetries) obs
3-hr dewpoint rank histograms Station DPG S01 National Security Applications Program Research Applications Laboratory
42-hr dewpoint rank histograms Station DPG S01 National Security Applications Program Research Applications Laboratory
Skill Scores • Single value to summarize performance. • Reference forecast - best naive guess; persistence, climatology • A perfect forecast implies that the object can be perfectly observed • Positively oriented – Positive is good
Skill Score Verification CRPS Skill Score RMSE Skill Score Reference Forecasts: Black -- raw ensemble Blue -- persistence National Security Applications Program Research Applications Laboratory
Computational Resource Questions: How best to utilize a multi-model simulations (forecast), especially if under-dispersive? • Should more dynamical variability be searched for? Or • Is it better to balance post-processing with multi-model utilization to create a properly dispersive, informative ensemble?
3-hr dewpoint rank histograms Station DPG S01 National Security Applications Program Research Applications Laboratory
RMSE of ensemble members Station DPG S01 42hr Lead-time 3hr Lead-time National Security Applications Program Research Applications Laboratory
Significant calibration regressors Station DPG S01 42hr Lead-time 3hr Lead-time National Security Applications Program Research Applications Laboratory
Questions revisited: How best to utilize a multi-model simulations (forecast), especially if under-dispersive? • Should more dynamical variability be searched for? Or • Is it better to balance post-processing with multi-model utilization to create a properly dispersive, informative ensemble? Warning: adding more models can lead to decreasingutilityof the ensemble mean (even if the ensemble is under-dispersive)
Summary • Quantile regression provides a powerful framework for improving the whole (potentially non-gaussian) PDF of an ensemble forecast – differentregressors for different quantiles and lead-times • This framework provides an umbrella to blend together multiple statistical correction approaches (logistic regression, etc., not shown) as well as multiple regressors • As well, “step-wise cross-validation” basedcalibration provides a method to ensure forecast skill no worse than climatological and persistence for a variety of cost functions • As shown here, significant improvements made to the forecast’s ability to represent its own potential forecast error (while improving sharpness): • uniform rank histogram • significant spread-skill relationship (new skill-spread measure) • Care should be used before “throwing more models” at an “under-dispersive” forecast problem • Further questions: hopson@ucar.edu or yliu@ucar.edu
other options … Assign dispersion bins, then: 2) Average the error values in each bin, then correlate 3) Calculate individual rank histograms for each bin, convert to a scalar measure
Before Calibration => underdispersive Example: French Broad River Black curve shows observations; colors are ensemble
Raw full ensemble After calibration Rank Histogram Comparisons After quantile regression, rank histogram more uniform (although now slightly over-dispersive)
What Nash-Sutcliffe (RMSE) implies about Utility Frequency Used for Quantile Fitting of Method I: Best Model=76% Ensemble StDev=13% Ensemble Mean=0% Ranked Ensemble=6%
Note: obs Probability Forecast PDF Take home message: For a “calibrated ensemble”, error variance of the ensemble mean is 1/2 the error variance of any ensemble member (on average), independent of the distribution being sampled Discharge
What Nash-Sutcliffe (RMSE) implies about Utility (cont) -- degredation with increased ensemble size Sequentially-averaged models (ranked based on NS Score) and their resultant NS Score => Notice the degredation of NS with increasing # (with a peak at 2 models) => For an equitable multi-model, NS should rise monotonically => Maybe a smaller subset of models would have more utility? (A contradiction for an under-dispersive ensemble?)
What Nash-Sutcliffe implies about Utility (cont) …using only top 1/3 of models To rank and form ensemble mean … … earlier results … Initial Frequency Used for Quantile Fitting: Best Model=76% Ensemble StDev=13% Ensemble Mean=0% Ranked Ensemble=6% Reduced Set Frequency Used for Quantile Fitting: Best Model=73% Ensemble StDev=3% Ensemble Mean=32% Ranked Ensemble=29% => Appears to be significant gains in the utility of the ensemble after “filtering” (except for drop in StDev) … however “proof is in the pudding” … => Examine verification skill measures …
Skill Score Comparisonsbetween full- and “filtered” ensemble sets Points: -- quite similar results for a variety of skill scores -- both approaches give appreciable benefit over the original raw multi-model output -- however, only in the CRPSS is there improvement of the “filtered” ensemble set over the full set => post-processing method fairly robust => More work (more filtering?)! GREEN -- full calibrated multi-model BLUE -- “filtered” calibrated multi-model Reference – uncalibrated set