180 likes | 197 Views
December 2009 John Kunze, University of California Curation Center, Oakland, CA Robert Cook, Environmental Sciences Division, Oak Ridge National Laboratory, TN Patricia Cruse, University of California Curation Center, Oakland, CA
E N D
December 2009 John Kunze, University of California Curation Center, Oakland, CA Robert Cook, Environmental Sciences Division, Oak Ridge National Laboratory, TN Patricia Cruse, University of California Curation Center, Oakland, CA Carol Tenopir, School of Information Sciences, University of Tennessee, Knoxville, TN Todd Vision, Department of Biology, University of North Carolina, Chapel Hill, NC William Michener, University Libraries, University of New Mexico, Albuquerque, NM Defining the Data Citation Problem in the DataNet Context
Data’s shameful neglect “Research cannot flourish if data are not preserved and made accessible. All concerned must act accordingly.” 10 September 2009
The scientific record is at risk • Incompatible formats, models, semantics • Poor preservation practice • Dispersed sources • Science needs this record to verify findings and test new hypotheses • Record at risk planet at risk
Data preservation is hard;start small with data publication The risk is complex, with social and technical dimensions – can we start small? • Insight: data that drive much scientific journal literature is produced in islands of practice, resulting in unshared, incompatible datasets • Hypothesis: establishing a system of datapublishing will promote data sharing and re-use by providing standards and producer incentives Publishing Sharing Use Preservation
Data publishing challenges • Datasets encompass everything • Data plus documents, images, audio, video, etc. • Tension between standardization and innovation • Data is similar to software, but even more specialized • OK to maintain in-house, but tedious to prepare for release • Technical dependence complicates long-term maintenance • Internal consistency requirements, plus provenance • Some built-in instability: long-term value of some data can depend on change, such as annotation
Data publication is hard; start small with data citation Published data, from outset, will call for citations • Need links from journal articles to data used Hypothesis: establishing simple, easy conventions for data citation will encourage its practice, hence data publishing, hence data preservation data citation data publishing data preservation
Data citation leads to data set Luyssaert, S., I. Inglima and M. Jung. 2009. Global Forest Ecosystem Structure and Function Data for Carbon Balance Research. Data set. Available on-line [http://daac.ornl.gov/] from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi:10.3334/ORNLDAAC/949 http://dx.doi.org/10.3334/ORNLDAAC/949 http://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=949 Leads often to one or more surrogates If data set is archived, leads to data files allspice1
Small surrogate Citation target Smaller surrogate Smallest surrogate
Data citation examples World Data Center for Paleoclimatology Data (NOAA) Anderson, D.W., W.L. Prell, and N.J. Barratt. 1989. Estimates of sea surface temperature in the Coral Sea at the last glacial maximum. Paleoceanography 4(6):615-627. Data archived at the World Data Center for Paleoclimatology, Boulder, Colorado, USA. no identifier Publishing Network for Geoscientific & Environmental Data in Germany Nishioka, J et al. (2008): Profiles of iron concentration from GoFlow bottles during the CARUSO-EISENEX experiment, doi:10.1594/PANGAEA.701305, Supplement to: Nishioka, Jun; Takeda, Shigenobu; de Baar, Hein JW; Croot, Peter L; Boyé, Marie; Laan, Patrick; Timmermans, Klaas R (2005): Changes in the concentration of iron in different size fractions during an iron enrichment experiment in the open Southern Ocean, Marine Chemistry, 95(1-2), 51-63, doi:10.1016/j.marchem.2004.06.040 2 identifiers: 1 for publication,1 for data
More data citation examples ICPSR Kessler, Ronald C. National Comorbidity Survey: Baseline (NCS-1), 1990-1992 (Restricted Version) [Computer file]. ICPSR25381-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2009-05-11. doi:10.3886/ICPSR25381 Economic Modeling Figure 3. Change of relative agricultural producer prices since 1998. Middle-income CIS show average for Russia, Kazakhstan, and Ukraine. …. Source: OECD, 2004 and CIS Statistics, 2003. archival data center? 2 organizations listed, but which of their 100s of datasets were used?
Contrasting citation styles Some commonalities (who, when, where), but • Prose is interspersed with metadata elements • Standard citation format/recipe would be easy to read • Not every citation had an actionable identifier • Name of dataset and data subset used (what) unclear • Archival commitment unclear • Date of publication vs date of collection unclear • One citation contained another citation (for publication)
What we want from data citation • Precise identification of dataset • At level of version, file, table, etc., or groups thereof • So that readers can find and understand the data • Credit to data producers and data publishers • Vital incentive for data sharing and archiving • A link from the traditional literature to the data • Gives intellectual legitimacy to creation of data sets • Research metrics for datasets • Sponsors want publication and retention numbers
Starter data citation wish list • Any dataset, database, data file • All levels of granularity (table, row, cell) • For any snapshot (version, e.g., in time) • Any formatted view: XML, HTML, CSV, etc. • With and without annotations • Links to older, newer, and latest versions • Actionability (“Click-through”) • Persistence (validity into the future)
Data citation wish list possibilitiesWe want it all, but might settle for initial partial solutions All datasets? Well, maybe just archived datasets* All levels of granularity? For any snapshot? All views? Publisher-defined granules, versions, and views* Plus older/newer version, and latest version? Surrogate-based pointer to extant version chain* With and without annotations? Annotation as publication* What about actionability and persistence? Yes and yes* (* Standards and archives needed for all)
Initiatives and outfits to watch • DataCite initiative: to encourage data publishing via global data citation support: standards, persistent reference to datasets in regional archives • Supplemental materials publishing standards for data, surrogates, and extended descriptions and methods, e.g., technical data application appendices • Publishers: increased volume of submission • Community standards (so many to choose from!): ORNL DAAC, Pangaea, GCMD, ESIP, GBIF, TDWG, OECD, NISO/NFAIS, IPYDIS, Dataverse, etc.
Data citation summary Data citation helps publication and sharing, which helps preservation and re-use, which saves the planet • Gives credit to data producers and data publishers • Vital incentive for data sharing and archiving • Provides a link from traditional literature to data • Gives intellectual legitimacy to creation of data • Research metrics for datasets • Sponsors want publication and retention numbers • Need recipes and stuff, i.e., standards and archives