CLADDIER Citation, Location, and* Deposit in Discipline and Institutional Repositories

CLADDIERCitation, Location, and* Deposit in Discipline and Institutional Repositories Bryan Lawrence (obviously et.al.) * Annotation

Outline “Full and open access to scientific data must also be ensured. Archiving of, and open access to, data will be a major challenge.” Statement to the Second Earth Observation Summit Tokyo,25 April 2004 by Prof. Thomas Rosswall, Executive Director on behalf of the International Council for Science (ICSU)] • Data Publication, Why, Why Now? • The CLADDIER use case • The story • The consequences • The Future

Data Publication – Why 1? • Data provides evidence that supports, vindicates or disproves scientific theory. • Data underpins everything. • We teach school children to record all experimental results, but in most scientific disciplines we discard those records after “the result” is published. • Even then “the result” is actually “the interpretation” and “the raw result” is often left to lie fallow and to be forgotten. • The (raw and processed) data should be as much a part of the scientific record as the conclusions.

Data Publication – Why 2? • In most sciences, data production is expensive, and interpretation is cheap! • It is a rare scientist who squeezes all the scientific fruit from their data, and it is a rare science that doesn’t benefit from data aggregation. • “One persons noise is another persons signal” (anyone who can give me a reliable source of the original quote can have a free beer). BUT: It’s one thing to make data available, it’s another thing to make it available with quality control, provenance, and sufficient detail for it to be used without reference to the original author …

Data Publication – Why now? Because we can! The technology (in particular the software) is up to it. • We have the machinery to describe data adequately. • AI may not have delivered clever robots yet, but it has delivered much of what we need for data publication! • We have the machinery to find it. • We have the machinery to display it. Because we should! The chain from data production to “traditional” publication is now so long, that many good scientists never get to publish “traditional” papers. • We need to recognise the excellence of “data scientists” within academia using metrics understood by their employers (publication and citation). • Like complex mathematics, complex data interpretation needs to be repeatable, which means the sources need to be available.

The CLADDIER use case, part 1 Joanna, at the University of Southampton, has done some work on the biology of seawater at a location off the coast of Cornwall. As part of her analysis she needs to acquire (from a number of locations): • Publications and data describing prior or similar work. • Oceanic profiles of salinity and temperature from the closest cruise in time and space, • Meteorological data to accompany both her own sampling and the oceanic data, • Remotely sensed ocean colour imagery (to add additional information on the biota). When her analysis is complete, she will publish a paper that cites the above datasets and lodge the paper in her own institutional repository. She will also deposit her datasets in one or more appropriate data repositories (probably in her case, both the SOC data archive, and the British Oceanographic Data Centre, BODC). Ideally, in the process of doing this, the archives holding the datasets and publications she cites would be notified that a paper citing them had been submitted, and the metadata associated with those records would be updated to reflect the citations. The metadata in the publication repository should also link to the data in the data archives and vice versa.

The CLADDIER use case, part 2 It turns out that the work Joanna has done is of significant interest in calibrating a global earth system model where one might need to compare simulations of oceanic carbon dioxide production with the scenarios used in the model. Fred, at Reading University needs to be able to find Joanna’s paper and data either via citations or directly from publication repositories. Having found the paper, the data should be obtainable via the citation and the data archive. As part of his work he is likely to check back through the other datasets used and cited as inputs to Joanna’s data, as before he uses Joanna’s data, he suspects Joanna’s work could be recalibrated by using later, better quality, meteorological re-analyses. Meanwhile, Joanna, and all the dataset authors will be pleased that the citation of not only the publication, but also the datasets, will be reflected in the 2012 RAE.

Requirements • Location and acquisition of both papers and data. Implies we need a “discovery engine” (more than Google!) • Creation of personal metadata (out of scope). • Citation mechanism. How do we cite data? (What does a citation look like, what exists at the citation target?) • What does publishing data mean? What would a referee do to referee data? • How do we deal with persistence of citations. Our expectation is that a citation should exist in perpetuity. • Linking mechanisms between data and publication repositories. • Support for annotation. • Support for metrics

The Future, Part 1 There are a number of “data publication” initiatives under way: • Some are represented here, some are not. • Two key absentees are • The Earth System Atlas http://www.lehigh.edu/~inesa/ (initial funding from NSF, still immature, but concentrating thus far on refereeing procedures) • “Publication and Citation of Scientiﬁc Primary Data” http://www.std-doi.de (initially funded by the German Research Forum, relatively mature, delivering persistence via reliable repositories and DOIs, but issues of citation and refereeing not fully resolved).

The Future, Part 2 • OJMS: Overlay Journal for Meteorological Science (or something similar). • New JISC funded project NCAS with Royal Met Soc, to deliver a new journal prototype. Success will depend on • Availability and quality of data i.e. on the technology, and on the sociology of the review process • Interaction between “traditional” journal world, and data publication world. • Multiple projects a good thing! • Data Publication is an idea whose time has come! • Crucial to get critical mass (across projects) on • Acceptable methods of citing data

CLADDIER Citation, Location, and* Deposit in Discipline and Institutional Repositories