540 likes | 736 Views
Model Performance Metrics, Ambient Data Sets and Evaluation Tools. Gail Tonnesen, Chao-Jung Chien, Bo Wang Youjun Qin, Zion Wang, Tiegang Cao. USEPA PM Model Evaluation Workshop, RTP, NC February 9-10, 2004. Acknowledgments.
E N D
Model Performance Metrics, Ambient Data Sets and Evaluation Tools Gail Tonnesen, Chao-Jung Chien, Bo Wang Youjun Qin, Zion Wang, Tiegang Cao USEPA PM Model Evaluation Workshop, RTP, NC February 9-10, 2004
Acknowledgments • Funding from the Western Regional Air Partnership Modeling Forum and VISTAS. • Assistance from EPA and others in gaining access to ambient data. • 12km Plots and analysis from Jim Boylan at State of Georgia
Outline • UCR Model Evaluation Software • Problems we had to solve • Choice of metrics for clean conditions. • Judging performance for high resolution nested domains.
Motivation • Needed to evaluate model performance for WRAP annual regional haze modeling: • Required a very large number of sites, and days. • For several different ambient monitoring networks • Evaluation would be repeated many times: • Many iterations on the “base case” • Several model sensitivity/diagnostic cases to evaluate • Limited time and resources were available to complete the evaluation.
Solution • Develop model evaluation software to: • Compute 17 statistical metrics for model evaluation. • Generate graphical plots in a variety of formats: • Scatter Plots • all sites for one month • All sites for full year • One site for all days • One day for all sites • Time series for each site
IMPROVE (The Interagency Monitoring of Protected Visual Environments) CASTNET (Clean Air Status and Trend Network) EPA’s AQS (Air Quality System) database EPA’s STN (Speciation Trends Network) NADP (National Atmospheric Deposition Program) SEARCH daily & hourly data PAMS (Photochemical Assessment Monitoring Stations) PM Supersites. Ambient Monitoring Networks
PAMS O3, NOx VOCs EPA PM Sites IMPROVE PM25, PM10 Sp. PM25, Visibility PM25, PM10 Other monitoring stations from state, local agencies CASTNet HNO3, NO3, SO4, O3, NOx, CO, Pb, etc O3, SO2 Overlap Among Monitoring Networks AQS (AIRS)
Specify how to compare model with data for each network. Unique species mapping for each air quality model. Species Mapping
No EPA guidance available for PM. Everyone has their personal favorite metric. Several metrics are non-symmetric about zero causing over predictions to be exaggerated compared to under-predictions. Is coefficient of determination (R2) a useful metric? Recommended Performance Metrics?
Mean Normalized Bias (MNB) from -100% to + inf. Normalized Mean Bias (NMB) from -100% to + inf. Fractional Bias (FB) from –200% to +200% Fractional Error (FE) from 0% to +200% Bias Factor (Knipping ratio) is MNB + 1, reported as a ratio, for example: 4:1 for over prediction 1:4 for under-prediction. Most Used Metrics
SAPRC99 vs. CB4 NO3; IMPROVE cross comparisons
SAPRC99 vs. CB4 SO4; IMPROVE cross comparisons
Time series plot for CMAQ vs. CAMx at SEARCH site – JST (Jefferson St.)
1 With 60 ppb ambient cutoff 2Using 3*elemental sulfur 3No data available in WRAP domain 4Measurements available at 3 sites
Problem: Model performance metrics and time-series plots do not identify cases where the model is “off by one grid cell”. Process ambient data in the I/O API format so that data can be compared to model using PAVE. Viewing Spatial Patterns
PAVE plots qualitatively indicate error relative to spatial patterns, but do we also need to quantify this? Wind error of 30 degrees can cause model to miss peak by one or more grid cells. Interpolate model using surrounding grid cells? Use average of adjacent grid cells? Within what distance? Spatially Weighted Metrics
Many plots and metrics – but what is the bottom line? Need to stratify the data for model evaluation Evaluate seasonal performance. Group by related types of sites. Judge model for each site or similar groups. How best to group or stratify sites? Want to avoid wasting time analyzing plots and metrics that are not useful. Judging Model Performance
Comparing performance metrics is not enough: Performance metrics show mixed response. Possible for better model to have poorer metrics Diagnostic analysis is needed to compare nested grid to coarse grid model. Recommended Evaluation for Nests
Some sites had worse metrics for 12km. Analysis by Jim Boylan comparing differences in 12 km and 36 km results showed major effects from: Regional precipitation Regional transport (wind speed & direction) Plume definition Example Diagnostic Analysis
Wet Sulfate on July 9 at 01:00 36 km Grid 12 km Grid
Sulfate on July 9 at 05:00 36 km Grid 12 km Grid
Sulfate on July 9 at 06:00 36 km Grid 12 km Grid
Sulfate on July 9 at 07:00 36 km Grid 12 km Grid
Sulfate on July 9 at 08:00 36 km Grid 12 km Grid
Sulfate on July 10 at 00:00 36 km Grid 12 km Grid
Sulfate on July 10 at 06:00 36 km Grid 12 km Grid
Sulfate on July 10 at 09:00 36 km Grid 12 km Grid
Sulfate on July 10 at 12:00 36 km Grid 12 km Grid
Sulfate on July 10 at 16:00 36 km Grid 12 km Grid
Sulfate on July 10 at 21:00 36 km Grid 12 km Grid
Sulfate on July 11 at 00:00 36 km Grid 12 km Grid