1 / 34

Benchmark database inhomogeneous data, surrogate data and synthetic data

Benchmark database inhomogeneous data, surrogate data and synthetic data. Victor Venema. Content. Introduction to benchmark dataset Some results Some questions about exercise Question about future work Analyse and publish the results. Benchmark dataset.

nira
Download Presentation

Benchmark database inhomogeneous data, surrogate data and synthetic data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Benchmark databaseinhomogeneous data, surrogate data and synthetic data Victor Venema

  2. Content • Introduction to benchmark dataset • Some results • Some questions about exercise • Question about future work • Analyse and publish the results

  3. Benchmark dataset • Real (inhomogeneous) climate records • Most realistic case • Investigate if various HA find the same breaks • Synthetic data • For example, Gaussian white noise • Insert know inhomogeneities • Test performance • Surrogate data • Empirical distribution and correlations • Insert know inhomogeneities • Compare to synthetic data: test of assumptions

  4. Creation benchmark – Outline talk • Start with homogeneous data • Multiple surrogate and synthetic realisations • Mask surrogate records • Add global trend • Insert inhomogeneities in station time series • Published on the web • Homogenize by COST participants and third parties • Analyse the results and publish

  5. 1) Start with homogeneous data • Monthly mean temperature and precipitation • Later also daily data (WG4), maybe other variables (pressure, wind) • Homogeneous, no missing data • Longer surrogates are based on multiple copies • Generated networks are 100 a

  6. 2) Multiple surrogate realisations • Multiple surrogate realisations • Temporal correlations • Station cross-correlations • Empirical distribution function • Annual cycle removed before, added at the end • Number of stations, 5, 9 or 15 • Cross correlation varies as much as possible

  7. 5) Insert inhomogeneities in stations • Independent breaks • Determined at random for every station and time • 5 Breaks per 100 a • Monthly slightly different perturbations • Temperature • Additive • Size: Gaussian distribution, σ=0.8°C • Rain • Multiplicative • Size: Gaussian distribution, <x>=1, σ=10%

  8. Example break perturbations station

  9. Example break perturbations network

  10. 5) Insert inhomogeneities in stations • Correlated break in network • One break in 10 % of networks • In 30 % of the station simultaneously • Position random • At least 10 % of data points on either side

  11. Example correlated break

  12. 5) Insert inhomogeneities in stations • Outliers • Size • Temperature: < 1 or > 99 percentile • Rain: < 0.1 or > 99.9 percentile • Frequency • 50 % of networks: 1 % • 50 % of networks: 3 %

  13. Example outlier perturbations station

  14. Example outliers network

  15. 5) Insert inhomogeneities in stations • Local trends (only temperature) • Linear increase or decrease in one station • Duration: between 30 and 60a • Maximum size: Gaussian distribution, σ=0.8°C • Frequency: once in 10 % of the stations

  16. Example local trends

  17. 6) Published on the web • Inhomogeneous data are published on the COST-HOME homepage • Everyone is welcome to download and homogenize the data • http://www.meteo.uni-bonn.de/ venema/themes/homogenisation

  18. 7) Homogenize by participants • Return homogenised data • Should be in COST-HOME file format (next slide) • For real data including quality flags • Return break detection file • BREAK • OUTLI • BEGTR • ENDTR • Multiple breaks at one data possible

  19. Typical errors • The file format needs to be perfect! • Forgetting the station-file that describes which stations belong to the homogenised network • Changing the file names in this station file to homogeneous data files ► • (Forgetting to return the files with the quality flags) • The sizes of the breaks are not in the break file • Please, keep directory structure of the benchmark like it is, also for partial contributions • The only difference is the main directory • All files are tab-delimited ASCII files

  20. COST-HOME file format – network file

  21. Typical errors • The file format needs to be perfect! • Forgetting the station-file that describes which stations belong to the homogenised network • Changing the file names in this station file to homogeneous data files • (Forgetting to return the files with the quality flags) • The sizes of the breaks are not in the break file ► • Please, keep directory structure of the benchmark like it is, also for partial contributions • The only difference is the main directory • All files are tab-delimited ASCII files

  22. Detected breaks file

  23. Typical errors – see discussion • The file format needs to be perfect! • Forgetting the station-file that describes which stations belong to the homogenised network • Changing the file names in this station file to homogeneous data files • (Forgetting to return the files with the quality flags) • The sizes of the breaks are not in the break file • Please, keep directory structure of the benchmark like it is, also for partial contributions • The only difference is the main directory • All files are tab-delimited ASCII files ►

  24. COST-HOME file format – monthly data

  25. Contributions

  26. No. homogenised networks - algorithm

  27. No. homogenised networks – input data

  28. Mean no. outliers per station

  29. Mean no. breaks per station

  30. Homogenising the exercise • Tab-delimited files: also space-delimited? • Mixture of strings and numbers • Data quality files only for real data section • Do we want to use the Diurnal Temperature Range (DTR)? • Not useful for surrogate and synthetic data! • If we do, everyone should do it • End or begin uncorrected? • Compute statistics independent of absolute level? • Filling missing values part exercise? • Human quality control or raw algorithm output? • Homogenise all or homogenisable networks, times

  31. Contributions – who is missing?

  32. Analysing the results • What measures define a well homogenised dataset? • Real data vs. data with known truth • Ensemble mean for real data? • Breaks • Position, hit rate • size distribution • detection probability as function of size • Data itself • Root mean square error (RMSE) • RMSE (without outliers) • RMSE (bias corrected) • Uncertainty in the network mean trend • How to study which components are best?

  33. Deadline(s) • Agreed on 09/2009, September this year • Multiple deadlines • For example: synthetic data, real data, surrogate data • After deadline the truth can be revealed • After deadline the other contributions can be revealed(?) • Start earlier analysing the results • For example: May, July, September • Bologna, 25 – 26 May, EGU, 19 – 24 April

  34. Articles • Articles • Overview COST Action & benchmark with very basic analysis results • Performance difference between synthetic (Gaussian, white noise) and surrogate data • How to deal multiple contributions per algorithm? • Do we have references to all algorithms? • What should the others be about • Analysing results, which components are best • Who will organise, coordinate it? • Not everyone should do the same analysis • How to subdivide the work? • After deadline: sensitivity analysis

More Related