250 likes | 355 Views
Environmental Data Archival: Practices and Benefits Graham Parton graham.parton@stfc.ac.uk Royal Meteorological Society SIG Meeting, BAS, 5 th October 2011: Transmission, presentation and archiving of meteorological data. Overview. What is data archival Why do it?
E N D
Environmental Data Archival: Practices and BenefitsGraham Parton graham.parton@stfc.ac.ukRoyal Meteorological Society SIG Meeting, BAS, 5th October 2011: Transmission, presentation and archiving of meteorological data
Overview • What is data archival • Why do it? • How do we do it within CEDA?
What do we call “data archival” • Placing data into a repository which is: • Backed up • Robust (identify data corruptions) • Catalogued • Recognised repository
Why archive data • Making data public - Openness of the result and repeatability are essential for scientific rigor • Place to share data with project participants • Re-purposing data • Additional services (often for free!) • Maybe required for legal reasons • Secure • Get credit • And because if you don’t….
Scale of CEDA operations >100,000,000 files holding ~ 1 Pb of data ~38,000,000 files downloaded since October 2010 19,000+ register users of which ~3600 are currently ‘active’ users 250+ datasets 26 staff Responsible for + other services and projects (e.g. UKCIP, CMIP5 partner) … i.e.. We are highly reliant on scripted systems and a well structured archive
External discovery service External Users Data Suppliers discovery Catalogue Arrivals metadata view 3rd Party Data providers Ingest Web service Backup Archive Archive Backup Archive Backup download
Data Preparation Data Suppliers Arrivals 3rd Party Data providers Ingest Archive Archive Archive
Data Preparation • Data Management Plans • including delivery schedules • Conditions of Use/Licensing • Support suppliers in data preparation • Capture supporting documentation • (formats, calibration information, flight logs, etc.) • File naming and archive structure • Set up ingest routes
Data Preparation - File structure • Take the bad data challenge…. File “sw010203” • What are these data? Guess surface winds, but on what day? • What are the units? Any convention? • How do we read the file? • Is this spatial or temporal data?... 1440 pairs of data in a file 4.31 155.3 3.92 136.1 5.15 140.2 4.23 137.1 4.75 150.2 4.71 137.9 4.35 146.5 4.52 138.0 4.83 153.7 5.40 145.8 4.63 141.0 4.90 137.3 4.31 143.3 4.58 157.0 4.94 141.7 4.65 143.1 4.63 143.0 4.88 149.5 5.42 148.5 4.92 140.4 4.04 146.7 3.92 151.5 5.02 135.3 5.06 151.6 4.65 152.3 4.31 168.8 3.79 145.3 5.92 152.9 5.02 145.8 4.77 161.6 4.79 144.1 4.60 147.5 5.33 150.1 4.81 141.0 6.02 146.9 4.38 149.0 4.42 142.5 4.58 133.4 4.35 150.5 4.96 149.8 5.56 143.4 5.08 148.5 5.19 141.6 4.40 142.4 4.10 152.6 5.02 134.0 4.94 142.9 5.27 144.4 5.38 141.5 5.88 144.8 6.00 140.1 4.75 158.3 5.08 148.1 5.46 163.5 4.27 150.8 4.69 138.8 5.71 144.0 5.21 138.8 5.00 132.4 5.06 144.4
Supported Formats Highly structured metadata Standard Names
External discovery service Data Discovery External Users Data Suppliers discovery Catalogue Arrivals metadata 3rd Party Data providers Ingest Web service Archive Archive Archive
CEDA Document Repository • cedadocs.badc.rl.ac.uk
Citations for Data Creators: DOIs Citation (and DOI) Data Citation and DOI… but only if in a recognised repository
External discovery service External Users Data Suppliers discovery Catalogue Arrivals metadata view 3rd Party Data providers Ingest Web service Archive Archive Archive download Data Services
Processing ServicesCEDA WPS: ceda-wps2.badc.rl.ac.uk/ui/home Chain services together Job either run straight away Or sent to run on backend service Download result
OPeNDAP Service • With security layer • Navigable and scriptable interface to archive • CEDA has applied security shell using “Open ID” technology • Give powerful sub-setting service for large datasets
What’s on the horizon? • Continue to develop visualisation and data processing services • Increasing data volumes becoming too large to move around • Hosting services – provide virtual environments for people to work on the data without downloading • From Petascale to Exoscale • But all this NEEDS well data that uses standards driven metadata and formats
Take Home Messages • Plan for data management • Tap into standards when preparing data • Get data catalogued for data discovery • Data in supported repositories leads to recognition for efforts preparing data • A suite of additional services add value to existing data Team Digial Preservation Video