Benchmark dataset processing

Benchmark dataset processing P. Štěpánek, P. Zahradníček Czech Hydrometeorological Institute (CHMI), Regional Office Brno, Czech Republic, E-mail: petr.stepanek@chmi.cz COST-ESO601 meeting, Tarragona, 9-11 March 2009

Outline • Outliers (daily data) • Detection (monthly data) + correction (daily data) (description of methodology) • Benchmark dataset (monthly data)

Processing before any data analysis Software AnClim, ProClimDB

Data Quality Control Finding Outliers Two main approaches: • Using limits derived from interquartile ranges(time series) • comparing values to values of neighbouring stations(spatial analysis)

Example of outputs for outliers assessment Suspicious values Expected value Neighbour stations values Altitudes and distances of neighbours List of neighbours

Quality control Spatial distribution • Run for period 1961-2007, daily data (measured values in observation hours) • All stations (200 climatological stations, 800 precipitation stations) • All meteorological elements (T, TMA, TMI, TPM, SRA, SCE, SNO, E, RV, H, F) – parameters set individually • Historical records will follow now

Air temperature, number of outliers 1961-2007, from 3.431.000 station-days T – air temperature at obs. hour, TMA – daily maximum temp., TMI – daily min. temp., TPM – daily ground minimum temp.

Air temperature, number of outliers 1961-2007, from 3.431.000 station-days Air temperature at obs. hour, AVG – daily average temp.

Air temperature, number of outliers 1961-2007, from 3.431.000 station-days TMA – daily maximum temp., TMI – daily min. temp., TPM – daily ground minimum temp.

Air temperature, number of outliers 1961-2007, Number of outliers per one station(all observation hours, AVG)

Spatial distribution of precipitation stations Spatial distribution • period 1961-2007 • 600 stations • mean minimum distance: 7.5 km

Problematic detections (heavy rainfall)

Problematic detections (heavy rainfall), Radar information

Precipitation, number of outliers 1961-2007, from 13.724.000station-days .

Precipitation, number of outliers 1961-2007, Number of outliers per one station

Conclusions • Only combination of several methods for outliers detection leads to satisfying results (“real” outliers detection, supressing fault detection -> Emsemble approach) • Parameters (settings) has to be found individually for each meteorological element, maybe also region (terrain complexity) and part of a year (noticeable annual cycle in number of outliers) • Similar to homogenization of time series, it is important to use measured value (e.g. from observation hours) - outliers are masked in daily average (and even more in monthly or annual ones) • Errors found in all elements and investigated countries(AT, CZ, SK, HU)

Experiences with homogenization in the Czech Republic

Homogenization • Change of measuring conditions inhomogeneities

DetectingInhomogeneities by SNHT (p=0.05, 950 series) Detection A

we depend upon statistical tests results Assessing Homogeneity - Problems • most of metadata incomplete • uncertainty in test results - right inhomogeneity detection is problematic (for smaller amount of change)

„Ensemble“ approach - processing of big amount of test resultsfor each individual series Proposed solution SOLUTION • most of metadata incomplete • uncertainty in test results - the right inhomogeneity detection is problematic • To get as many test results for each candidate series as possible • For each year, group of years and whole series: portion of number of detected inhomogenities in number of all possible (theoretical) detections

How to increase number of test results Diagram Days, Months, seasons, year

Creating Reference Series • for monthly, daily data (each month individually) • weighted/unweighted mean from neighbouring stations • criterions used for stations selection (or combination of it): • best correlated / nearest neighbours (correlations – from the first differenced series) • limit correlation, limit distance • limit difference in altitudes • neighbouring stations series should be standardized to test series AVG and / or STD (temperature - elevation, precipitation - variance) - missing data are not so big problem then

Relative homogeneity testing • Available tests: • Alexandersson SNHT • Bivariate test of Maronna and Yohai • Mann – Whitney – Pettit test • t-test • Easterling and Peterson test • Vincent method • … 20 year parts of the daily series(40 for monthly series with 10 years overlap), in SNHT splitting into subperiods in position of detected significant changepoint (30-40 years per one inhomogeneity)

Homogeneity assessment Homogeneity assessment Output example: Station Čáslav, 3rd segment, 1911-1950, n=40 • Quality control • Homogenization • Data Analysis

Homogeneity assessment, Output II example: Homogeneity assessment Summed numbers of detections for individual years

Homogeneity assessment Homogeneity assessment • combining several outputs (sums of detections in individual years, metadata, graphs of differences/ratios, …)

Adjusting monthly data • using reference series based on correlations • adjustment: from differences/ratios 20 years before and after a change, monhtly • smoothing monthly adjustments (low-pass filter for adjacent values)

Example: Adjusting values - evaluation Example combining series

Iterative homogeneity testing • several iteration of testing and results evaluation • several iterations of homogeneity testing and series adjusting (3 iterations should be sufficient) • question of homogeneity of reference series is thus solved: • possible inhomogeneities should be eliminated by using averages of several neighbouring stations • if this is not true: in next iteration neighbours should be already homogenized

Dependence of tested series on reference series Filling missing values • Before homogenization: influence on right inhomogeneity detection • After homogenization: more precise - data are not influenced by possible shifts in the series

Homogenization of the series in the Czech Republic

Correlations between tested and reference series Air temperature Graphs - selection Boxplots: - Median - Upper and lower quartiles (for 200 testes series)

Correlations between tested and reference series Precipitation, snow depth, new snow Graphs - selection Boxplots: - Median - Upper and lower quartiles (for 800 testes series)

Correlations between tested and reference series Wind speed Graphs - selection Boxplots: - Median - Upper and lower quartiles (for 200 testes series)

Homogeneity testing resultsAir temperature Number of significant inhomogeneities (0.05) detected by used tests (A, B tests, c and d reference series, alltogether)

Homogeneity testing resultsAir temperature Amount of adjustments, averages of absolute values, T_AVG

Homogeneity testing resultsPrecipitation • 4 tests, 4 reference series, 12 months + 4 seasons and year • Number of detected inhomogeneities (significant)

Amount of change (ratios – standardized to be >1.0), precipitation (reference series calculation based on correlations) Graphs - adjsust Boxplots: - Median - Upper and lower quartiles (for 589 testes series) Correlation improvement

HomogenizationFinal remarks, recommendations 1/2 • data quality control before homogenization is of very importance (if it is not part of it) • Using series of observation hours (complementarily to daily AVG) is highly recommended(different manifestation of breaks) • be aware of annual cycle of inhomogeneities, adjustments, … • to know behavior of spatial correlations (of element being processed) to be able to create reference series of sufficient quality …

HomogenizationFinal remarks, recommendations 2/2 • Because of Noise in the time seriesit makes sense: • - „Ensemble“ approach to homogenization (combining information from different statistical tests, time frames, overlapping periods, reference series, meteorological elements, …) • - more information for inhomogeneities assessment – higher quality of homogenization in case metadata are incomplete

Software used for data processing • LoadData - application for downloading data from central database (e.g. Oracle) • ProClimDB software for processing whole dataset (finding outliers, combining series, creating reference series, preparing data for homogeneity testing, extreme value analysis, RCM outputs validation, correction, …) • AnClim software for homogeneity testing http://www.climahom.eu

AnClim software AnClim software

ProClimDB software ProcData software

Testing of benchmark dataset • Fully automated (just for one click), but it should not work like this in reality: detection phase can be fully automated, but desicion about breaks to be corrected should be man-made (comparison with metadata, plots of differences – ratios, …)

Fully automatic detection phase in ProClimDB softwareselecting neighbours for reference series

Fully automatic detection phase in ProClimDB softwarereference series calculation and launching AnClim

Automation in AnClim, tests can be selected in advance

Test results processed back in ProClimDB

Non-automated desicion about breaks before correction

Benchmark dataset processing

Benchmark dataset processing

Presentation Transcript

Precipitation Dataset for Statistical Post Processing and Downscaling

Benchmark:

Benchmark

Dataset:

Dataset

NDTMS Dataset

Dataset Usability

DATASET

Dataset Tracht6A

BENCHMARK

Dataset Review

Dataset One

DataSet

Dataset introduction

COST benchmark dataset homogenisation: issues and remarks of the “Slovenian team”

Quality control and homogenization of the COST benchmark dataset

BENCHMARK DATASET HOMOGENIZATION. A first test.

Dataset

System Model for Processing on Multi Format of Dataset