Model Performance Metrics, Ambient Data Sets and Evaluation Tools

Model Performance Metrics, Ambient Data Sets and Evaluation Tools Gail Tonnesen, Chao-Jung Chien, Bo Wang Youjun Qin, Zion Wang, Tiegang Cao USEPA PM Model Evaluation Workshop, RTP, NC February 9-10, 2004

Acknowledgments • Funding from the Western Regional Air Partnership Modeling Forum and VISTAS. • Assistance from EPA and others in gaining access to ambient data. • 12km Plots and analysis from Jim Boylan at State of Georgia

Outline • UCR Model Evaluation Software • Problems we had to solve • Choice of metrics for clean conditions. • Judging performance for high resolution nested domains.

Motivation • Needed to evaluate model performance for WRAP annual regional haze modeling: • Required a very large number of sites, and days. • For several different ambient monitoring networks • Evaluation would be repeated many times: • Many iterations on the “base case” • Several model sensitivity/diagnostic cases to evaluate • Limited time and resources were available to complete the evaluation.

Solution • Develop model evaluation software to: • Compute 17 statistical metrics for model evaluation. • Generate graphical plots in a variety of formats: • Scatter Plots • all sites for one month • All sites for full year • One site for all days • One day for all sites • Time series for each site

IMPROVE (The Interagency Monitoring of Protected Visual Environments) CASTNET (Clean Air Status and Trend Network) EPA’s AQS (Air Quality System) database EPA’s STN (Speciation Trends Network) NADP (National Atmospheric Deposition Program) SEARCH daily & hourly data PAMS (Photochemical Assessment Monitoring Stations) PM Supersites. Ambient Monitoring Networks

Number of Sites Evaluated by Network

PAMS O3, NOx VOCs EPA PM Sites IMPROVE PM25, PM10 Sp. PM25, Visibility PM25, PM10 Other monitoring stations from state, local agencies CASTNet HNO3, NO3, SO4, O3, NOx, CO, Pb, etc O3, SO2 Overlap Among Monitoring Networks AQS (AIRS)

Specify how to compare model with data for each network. Unique species mapping for each air quality model. Species Mapping

Model vs. Obs. Species Mapping Table

No EPA guidance available for PM. Everyone has their personal favorite metric. Several metrics are non-symmetric about zero causing over predictions to be exaggerated compared to under-predictions. Is coefficient of determination (R2) a useful metric? Recommended Performance Metrics?

Statistical measures used in model performance evaluation

Mean Normalized Bias (MNB) from -100% to + inf. Normalized Mean Bias (NMB) from -100% to + inf. Fractional Bias (FB) from –200% to +200% Fractional Error (FE) from 0% to +200% Bias Factor (Knipping ratio) is MNB + 1, reported as a ratio, for example: 4:1 for over prediction 1:4 for under-prediction. Most Used Metrics

UCR Java-based AQM Evaluation Tools

SAPRC99 vs. CB4 NO3; IMPROVE cross comparisons

SAPRC99 vs. CB4 SO4; IMPROVE cross comparisons

Time series plot for CMAQ vs. CAMx at SEARCH site – JST (Jefferson St.)

1 With 60 ppb ambient cutoff 2Using 3*elemental sulfur 3No data available in WRAP domain 4Measurements available at 3 sites

Problem: Model performance metrics and time-series plots do not identify cases where the model is “off by one grid cell”. Process ambient data in the I/O API format so that data can be compared to model using PAVE. Viewing Spatial Patterns

IMPROVE SO4, Jan 5

IMPROVE SO4, June 10

IMPROVE NO3, Jan 5

IMPROVE NO3, July 1

IMPROVE SOA, Jan 5

IMPROVE SOA, June 25

PAVE plots qualitatively indicate error relative to spatial patterns, but do we also need to quantify this? Wind error of 30 degrees can cause model to miss peak by one or more grid cells. Interpolate model using surrounding grid cells? Use average of adjacent grid cells? Within what distance? Spatially Weighted Metrics

Many plots and metrics – but what is the bottom line? Need to stratify the data for model evaluation Evaluate seasonal performance. Group by related types of sites. Judge model for each site or similar groups. How best to group or stratify sites? Want to avoid wasting time analyzing plots and metrics that are not useful. Judging Model Performance

12km vs. 36km, Winter SO4

12km vs. 36km, Winter NO3

Comparing performance metrics is not enough: Performance metrics show mixed response. Possible for better model to have poorer metrics Diagnostic analysis is needed to compare nested grid to coarse grid model. Recommended Evaluation for Nests

Some sites had worse metrics for 12km. Analysis by Jim Boylan comparing differences in 12 km and 36 km results showed major effects from: Regional precipitation Regional transport (wind speed & direction) Plume definition Example Diagnostic Analysis

Sulfate Change (36 km – 12 km)

Wet Sulfate on July 9 at 01:00 36 km Grid 12 km Grid

Regional Transport (Wind Speed)

Sulfate on July 9 at 05:00 36 km Grid 12 km Grid

Plume Definition and Artificial Diffusion

Sulfate Change (36 km – 12 km)

Model Performance Metrics, Ambient Data Sets and Evaluation Tools