180 likes | 196 Views
This publication discusses the challenges and potential solutions for data publication and citation in atmospheric science, with a focus on the British Atmospheric Data Centre (BADC). Topics covered include the role of BADC, data sets available, the CLADDIER project, and the need for improved data citation practices. The conclusion highlights the importance of formalizing data packaging and implementing an external data review process.
E N D
Data Publication at the British Atmospheric Data Centre CLADDIER S J Pepler, B Lawrence, P Simpson, J Hey, C Jones
Overview • Work Context • BADC • CLADDIER • Citing Data sets • The encapsulation problem. • The publisher problem. • Conclusions DCC 2006
What is the BADC • NERC’s designated data centre for atmospheric science. • "The role of the British Atmospheric Data Centre (BADC) is to assist UK atmospheric researchers to locate, access and interpret atmospheric data and to ensure the long-term integrity of atmospheric data produced by Natural Environment Research Council (NERC) projects.“ • Curation and Facilitation. • http://badc.nerc.ac.uk/ • Part of NCAS DCC 2006
Data Sets “A collection of files with a common theme and administration” • Ground based observation networks Met Office surface stations • Model output NWP, ECMWF reanalyses & Climate models • Satellite data TOMS, Envisat & MSG • NERC programmes data UTLS, CWVC & URGENT DCC 2006
MST radar data • One dataset • 444GB, 322,000 files • Lots of docs • Multiple formats • Multiple version • Multiple products • More data every hour DCC 2006
CLADDIER • Citation, Location, And Deposition In Discipline & Institutional Repositories • Aims: to provide discovery and citation of data and documents between repositories. • Provide inter-repository communication of citation information. DCC 2006
How do scientists want to cite data? • We asked a range of scientists what they Scientists and data providers what do you want to cite? • Unambiguous and persistent. Identifiers good. • Readable. Just Identifiers bad. • Should look like a paper reference. Author, publication date, etc. • Broad scale to avoid reference bloat, but… • … refer to subsets of data by product type, version, or other specific semantics. • Probably put specifics in the text of the article. • Dataset should be defined by instruments, activities of observation platforms. DCC 2006
Edition could also be “Mesosphere, Spectra Widths” Publisher? Citation of BADC data set Natural Environment Research Council, Mesosphere-Stratosphere-Troposphere Radar Facility [Thomas, L.; Vaughan, G.] Mesosphere-Stratosphere-Troposphere Radar Facility at Aberystwyth, [Internet]. Version 2, Cartesian products. British Atmospheric Data Centre (BADC), 1990- [cited 2006 Apr 25]. Available from http://badc.nerc.ac.uk/data/mst. DCC 2006
Problem 1: Edition • Is it OK for the author to make up the edition? • No, otherwise the data referred to is not clearly defined. • What we need it a way of referencing the semantics. • The NERC DataGrid is already developing the Climate Sciences Mark-up Language (CSML) to do this. Their aim is data manipulation. DCC 2006
CSML Feature types • defined on basis of geometric and topologic structure DCC 2006
ProfileSeriesFeature ProfileFeature GridFeature Climate ScienceModelling Language • CSML feature types • examples... Collections of features are allowed. DCC 2006
Problem 2: Publisher • The Publisher makes the items avaliable and performs quality control measures; most notably peer review. • Can we do peer review of our data sets? • Option1: The BADC becomes a proper publishing organisation and organise peer review. • This is a lot of work, who is going to pay? • Option 2: A existing publisher does the peer review for us using existing processes. • This is a lot of work, who is going to pay? DCC 2006
Using the RMetS for Peer-review • Using the RMetS would accelerate acceptance of datasets as peers of papers. • Needs to fit in with current practices. • Needs to fit in with current software tools for managing the peer review process. • Needs a sustainable business model. • Overlay journal? DCC 2006
Citation in Data Journal Natural Environment Research Council, Mesosphere-Stratosphere-Troposphere Radar Facility [Thomas, L.; Vaughan, G.] . Mesosphere-Stratosphere-Troposphere Radar Facility at Aberystwyth, [Internet]. Version 2, Cartesian products. RMetS Data Publications, 1990-[cited date] . Available from http://badc.nerc.ac.uk/data/mst/v2cart200602.xml. [doi:10233/23498234] DCC 2006
Conclusions • A formalised packaging of data needs to be put in place to clarify the boundaries of these multi-object datasets. These not only help authors to reference the data, but also the data creators to track the use of the data, and archive managers as data is stored, reviewed and collected. • An external data review process needs to be put in place to elevate the status of data sets. Using an existing publisher to coordinate the reviews may accelerate the acceptance for data publication by authors. DCC 2006
(SAX) demarshalling <CSML> Climate ScienceModelling Language • Provides semantic abstraction layer • Provides ‘wrapper’ architecture for legacy data files • Composite design pattern for aggregation instantiateNetCDF(DatasetID, FeatureID) DCC 2006