1 / 27

Who we are

ivan
Download Presentation

Who we are

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Expanding the scientific record: Data citation and publication by NERC’s environmental data centresSarah Callaghan and the NERC Data Citation and Publication Project Team [sarah.callaghan@stfc.ac.uk]The Role of Data Archives and Data Centers in Identifying and Preserving DataDataCite Summer meeting, August 24-25th 2011

  2. Who we are The UK’s Natural Environment Research Council (NERC) funds six data centres which between them have responsibility for the long-term management of NERC's environmental data holdings. As part of the NERC Science Information Strategy (SIS) several projects have been created to provide the framework for NERC to work more closely and effectively with its scientific communities in delivering data and information management services. One of these is the Data Citation and Publication Project

  3. What sort of data do we deal with? A variety of environmental measurements, along with the results of model simulations

  4. Data preservation and curation in history Phaistos Disk, 1700BC The writers of these documents did a brilliant job of preserving for thousands of years the bits-and-bytes of their time! But they’ve both been translated many times, and it’s a shame the meanings are different. => Data Preservation is not enough, we need “Active Curation” to preserve Information

  5. The role of the data centres NERC fund research projects, which produce data. It is essential that these data are properly managed to ensure their long-term availability. NERC’s network of data centres provide support and guidance in data management to those funded by NERC, are responsible for the long-term curation of data and provide access to NERC's data holdings. The NERC Data Policy details their commitment to support the long-term management of data and also outlines the roles and responsibilities of all those involved in the collection and management of data. We are also involved in externally funded projects in informatics, e-Science and domain specific areas.

  6. Choosing what data to archive • NERC has a new data policy (http://www.nerc.ac.uk/research/sites/data/policy.asp) which came into force in January 2011. • The policy requirements for identifying data of long‐term value and for developing data management plans have yet to be formally implemented.   This is to allow time for NERC to develop the necessary process and mechanisms for reviewing and managing data management plans, and to work with the NERC community to develop criteria to help identify data of long‐term value • BADC have a model data policy (http://badc.nerc.ac.uk/data/BADC_Model_Data_Policy.pdf) which gives criteria for selecting simulated/model data for management

  7. Why do we want to cite and publish data? • Pressure from the UK government to make all data from publicly funded research available to the public for free. • Scientists still want to receive attribution and credit for their work • General public want to know what the scientists are doing (c.f. Climategate) • Research funders want reassurance that they’re getting value for money from their funding • Relies on peer-review of science publications (well established) and data (not done yet!) • Allows the wider research community to find and use datasets outside their immediate domain, confident that the data is of reasonable quality • From a strict data-centric point of view, citation and publication provides an extra incentive for scientists to submit their data to us in appropriate formats and with full metadata!

  8. Publishing data for the scholarly record • Scientific journal publication mainly focuses on the analysis, interpretation and conclusions drawn from a given dataset. • Examining the raw data that forms the dataset is more difficult, as datasets are usually stored in digital media, in a variety of (proprietary or non-standard) formats. • Peer-review is generally only applied to the methodology and final conclusions of a piece of work, and not the underlying data itself. But if the conclusions are to stand, the data must be of good quality. • A process of data publication, involving peer-review of datasets would be of benefit to many sectors of the academic community.

  9. (Scientific) Communication through the ages • Science, as a process, requires the exchange of information and ideas. • We can make this exchange face-to-face (conferences, meetings, seminars) or through another medium (text, video, images), or both. • No matter what method we use, we wind up telling each other stories about what we’ve discovered. • Technology has given us new tools, but it’s also provided new challenges http://www.intoon.com/#68559

  10. Journals – a 17th century technology • The first scientific journal, Journal des sçavans (later renamed Journal des savants), was first published on Monday, 5 January 1665. • It also carried a proportion of material that would not now be considered scientific, such as obituaries of famous men, church history, and legal reports. • It still exists, but is more of a literary journal • The first edition of the Philosophical Transactions of the Royal Society of London was on 6 March 1665. That still exists, and continues to publish scientific information to this day.

  11. Journals work, but... • ... They’re not enough now to communicate everything we need to know about a scientific event • - whether that’s an observation, simulation, development of a theory, or any combination of these. • Data always has been the foundation of scientific progress – without it, we can’t test any of our assertions. • Previously data was hard to capture, but could be (relatively) easily published in image or table format • But now... Suber cells and mimosa leaves. Robert Hooke, Micrographia, 1665

  12. The Data Deluge “the amount of data generated worldwide...is growing by 58% per year; in 2010 the world generated 1250 billion gigabytes of data” The Digital Universe Decade – Are You Ready? IDCC White Paper, May 2010

  13. SI Prefixes Stuart Feldman, Google

  14. Serving, Citing and Publishing Data • Citation forms an important part of the scientific record. • We draw a clear distinction between: • publishing/serving = making available for consumption (e.g. on the web), and • Publishing = publishing after some formal process which adds value for the consumer: • e.g. PloS ONE type review, or • EGU journal type public review, or • More traditional peer review. • AND • provides commitment to persistence Doi:10232/123ro This involves the peer-review of data sets, and gives “stamp of approval” associated with traditional journal publications. Can’t be done without effective linking/citing of the data sets. 2. Publication of data sets Doi:10232/123 1. Data set Citation This is our first step for this project – formulate and formalise a way of citing data sets. Will provide benefits to our users – and a carrot to get them to provide data to us! 0. Serving of data sets This is what data centres do as our day job – take in data supplied by scientists and make it available to other interested parties.

  15. Publication versus sharing of data There’s a lot of discussion about Open Data. Some scientists are quite wary of it, for a number of reasons, mainly revolving around not getting credit for the work involved in creating the dataset. Scientists are used to journal publications, so piggybacking on that for data could bridge the gap between open and closed data. What we’re aiming for Reader pays publications Restricted Open Closed Published Webpage CD in a drawer Not published

  16. Data Citation and Publication Project Aims • To implement publication and citation of datasets held within the NERC data centres. • To increase NERC’s influence on work to provide and cite data outputs from scientific work in similar ways to scientific papers. • To demonstrate to the NERC community that data citation and publication is both personally and scientifically advantageous. • To form partnerships with other organisations with the same goal of data publication to exploit common activities and achieve a wider community buy-in. To this end, project team members are involved with both the SCOR/IODE/MBL WHOI Library Data Publication Working Group, the CODATA-ICSTI Task Group on Data Citation Standards and Practises and the DataCite Working Group on Criteria for Datacentres.

  17. What data centres can do and what we can’t Doi:10232/123ro The scientific quality of a dataset has to be evaluated by peer-review by scientists with domain knowledge. This peer-review process has already been set up by academic publishers, so it makes sense to collaborate with them for peer-review publishing of data. 2. Publication of data sets (scientific quality) Doi:10232/123 1. Data set Citation (technical quality) When we cite (i.e. assign a DOI to) a dataset, we’re confirming that, in our opinion, the dataset meets a level of technical quality (metadata and format) and that we will make it available and keep it frozen for the forseeable future. 0. Serving of data sets The day job – take in data and metadata supplied by scientists (often on a on-going basis). Make sure that there is adequate metadata and that the data files are appropriate format. Make it available to other interested parties.

  18. How we’re going to cite (and publish) data • We decided to use digital object identifiers (DOIs) because: • They are actionable, interoperable, persistent links for (digital) objects • Scientists are already used to citing papers using DOIs • Pangaea assign DOIs, and ESSD use DOIs to link to the datasets they publish • The British Library and DataCite gave us an allocation of 500 DOIs to assign to datasets as we saw fit.

  19. What sort of data can we/will we cite? • Dataset has to be: • Stable (i.e. not going to be modified) • Complete (i.e. not going to be updated) • Permanent – by assigning a DOI we’re committing to make the dataset available for posterity • Good quality – by assigning a DOI we’re giving it our data centre stamp of approval, saying that it’s complete and all the metadata is available • When a dataset is cited that means: • There will be bitwise fixity • With no additions or deletions of files • No changes to the directory structure in the dataset “bundle” A DOI should point to a html representation of some record which describes a data object. Upgrades to versions of data formats will result in new editions of datasets.

  20. A short digression: Citation vs. referencing • Citation – data centre commitment regarding fixity, stability, permanence etc. of a dataset. Demonstrated by assignment of DOI • e.g. Darwin, Charles Robert. The Origin of Species. Vol. XI. The Harvard Classics. New York: P.F. Collier & Son, 1909–14; • Referencing – no data centre commitment regarding fixity, stability, permanence etc. of a dataset. Dataset can still be referenced by URL – but link might be broken • e.g. Paragraph 3, page 42, Darwin, Charles Robert. The Origin of Species, 1859 • We want to be able to reference the individual part of the dataset (word/line/paragraph) without having to commit to assigning a DOI to everything but the dataset (book) • If the dataset is properly frozen, then the reference to a part of it should work fine. • People citing the dataset might want to reference a part of it, and we should make this possible. But we don’t want to commit to DOI-ing every single bit of a dataset! • And just because someone can (and will) reference something in a dataset that’s not DOI-ready – this act should not trigger a DOI-citation

  21. Data publication • We have the ability now (thanks to the British Library and DataCite) to mint our own DOIs • We can therefore cite our datasets, giving academic credit to those scientists who get cited – making them more likely to give us good quality data to archive. • Publication – and peer-review – is the next step • We are working with recognized academic journals to do this. The timescales for this are quite tight, as we want to tie in with the timescales for the next Intergovernmental Panel on Climate Change (IPCC) report • Data journals already exist: • Earth System Science Data (http://earth-system-science-data.net/) • Geochemistry, Geophysics, Geosystems (G3 http://www.agu.org/journals/gc/ )

  22. Conclusions • The NERC data citation and publication project has been running for 1 year. • We’re now entering phase 2 of the project (which will take 2 years) • At the end of this phase, all the NERC data centres will have: • At least 1 dataset with associated DOI • Guidelines for the data centre on what is an appropriate dataset to cite • Guidelines for data providers about data citation and the sort of datasets we will cite • Our users are already expressing an interest in data citation - this is an idea whose time has come! http://www.keepcalm-o-matic.co.uk/default.aspx#createposter

  23. Landing page(s) • We don’t want to have the DOI resolve right at the archive level where you can get the files in the dataset because: • this dumps the user with the only information being a list of filenames • If we change the archive structure, that requires re-mapping all the DOIs So, we need a landing page. Users are used to this, as that’s what on-line journals do For example – clicking 10.1049/iet-map:20060126  will bring you to a html page with a link to a pdf of the referenced paper.

  24. Landing page rules • We can change the landing page any time we like, but you had better be able to get to your digital object from there! • If there is a new version of the dataset, a new DOI is needed. • and the original landing page can indicate that a newer version of the dataset exists, but it should still point to the older version of the dataset. • Landing pages can have query based links to other things (papers which cite this dataset) etc ... • The (DOI-mandatory) metadata describing the dataset shouldn't change • It describes the digital object and represents it faithfully.

  25. DOI assignment • The British Library (acting on behalf of DataCite) set us up with an account. Even though NERC as a whole is acting as the publisher for this data, each data centre has it’s own assignment account. These may well be administered centrally by NERC. • We decided to use GUIDs (Globally Unique Identifier) as the unique string. • The value of a GUID is represented as a 32-character hexadecimal string, such as {21EC2020-3AEA-1069-A2DD-08002B30309D}, and is usually stored as a 128-bit integer. The total number of unique keys is 2128 or 3.4×1038 — roughly 2 trillion per cubic millimeter of the entire volume of the Earth. This number is so large that the probability of the same number being generated twice is extremely small. • The disadvantage is that they don’t look pretty, and there’s no branding in the string. • The advantage is that the opaqueness makes them easily transferable between data centres (if needed), and people won’t be tempted to type them in (risking typos) but will copy and paste them. • Our DOIs will look something like this: 10.5285/e8f43a51-0198-4323-a926-fe69225d57dd

  26. Human-readable citation string • We worked on this in the JISC and NERC-funded CLADDIER project (http://claddier.badc.ac.uk/trac) and will follow those rules • For example a non-DOI-ed dataset can be referenced as: • Science and Technology Facilities Council (STFC), Chilbolton Facility for Atmospheric and Radio Research, [Wrench, C.L.]. Chilbolton Facility for Atmospheric and Radio Research (CFARR) data, [Internet]. British Atmospheric Data Centre, 2003-,Date of citation. Available from http://badc.nerc.ac.uk/data/chilbolton/. • A DOI-ed dataset will be cited as: • Science and Technology Facilities Council (STFC), Chilbolton Facility for Atmospheric and Radio Research, [Callaghan, S. A., J. Waight, C. J. Walden, J. Agnew and S. Ventouras]. GBS 20.7GHz slant path radio propagation measurements, Sparsholt site, [Internet]. British Atmospheric Data Centre, 2003-2005, doi:10.5285/E8F43A51-0198-4323-A926-FE69225D57DD

More Related