1 / 18

Defining the Data Citation Problem in the DataNet Context

December 2009 John Kunze, University of California Curation Center, Oakland, CA Robert Cook, Environmental Sciences Division, Oak Ridge National Laboratory, TN Patricia Cruse, University of California Curation Center, Oakland, CA

aclemens
Download Presentation

Defining the Data Citation Problem in the DataNet Context

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. December 2009 John Kunze, University of California Curation Center, Oakland, CA Robert Cook, Environmental Sciences Division, Oak Ridge National Laboratory, TN Patricia Cruse, University of California Curation Center, Oakland, CA Carol Tenopir, School of Information Sciences, University of Tennessee, Knoxville, TN Todd Vision, Department of Biology, University of North Carolina, Chapel Hill, NC William Michener, University Libraries, University of New Mexico, Albuquerque, NM Defining the Data Citation Problem in the DataNet Context

  2. Data’s shameful neglect “Research cannot flourish if data are not preserved and made accessible. All concerned must act accordingly.” 10 September 2009

  3. The scientific record is at risk • Incompatible formats, models, semantics • Poor preservation practice • Dispersed sources • Science needs this record to verify findings and test new hypotheses • Record at risk  planet at risk

  4. Collage: J. Callaway, USF

  5. Data preservation is hard;start small with data publication The risk is complex, with social and technical dimensions – can we start small? • Insight: data that drive much scientific journal literature is produced in islands of practice, resulting in unshared, incompatible datasets • Hypothesis: establishing a system of datapublishing will promote data sharing and re-use by providing standards and producer incentives Publishing  Sharing  Use  Preservation

  6. Data publishing challenges • Datasets encompass everything • Data plus documents, images, audio, video, etc. • Tension between standardization and innovation • Data is similar to software, but even more specialized • OK to maintain in-house, but tedious to prepare for release • Technical dependence complicates long-term maintenance • Internal consistency requirements, plus provenance • Some built-in instability: long-term value of some data can depend on change, such as annotation

  7. Data publication is hard; start small with data citation Published data, from outset, will call for citations • Need links from journal articles to data used Hypothesis: establishing simple, easy conventions for data citation will encourage its practice, hence data publishing, hence data preservation data citation  data publishing  data preservation

  8. Data citation leads to data set Luyssaert, S., I. Inglima and M. Jung. 2009. Global Forest Ecosystem Structure and Function Data for Carbon Balance Research. Data set. Available on-line [http://daac.ornl.gov/] from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi:10.3334/ORNLDAAC/949  http://dx.doi.org/10.3334/ORNLDAAC/949  http://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=949  Leads often to one or more surrogates  If data set is archived, leads to data files allspice1

  9. Small surrogate Citation target Smaller surrogate Smallest surrogate

  10. Data citation examples World Data Center for Paleoclimatology Data (NOAA) Anderson, D.W., W.L. Prell, and N.J. Barratt. 1989. Estimates of sea surface temperature in the Coral Sea at the last glacial maximum. Paleoceanography 4(6):615-627. Data archived at the World Data Center for Paleoclimatology, Boulder, Colorado, USA. no identifier Publishing Network for Geoscientific & Environmental Data in Germany Nishioka, J et al. (2008): Profiles of iron concentration from GoFlow bottles during the CARUSO-EISENEX experiment, doi:10.1594/PANGAEA.701305, Supplement to: Nishioka, Jun; Takeda, Shigenobu; de Baar, Hein JW; Croot, Peter L; Boyé, Marie; Laan, Patrick; Timmermans, Klaas R (2005): Changes in the concentration of iron in different size fractions during an iron enrichment experiment in the open Southern Ocean, Marine Chemistry, 95(1-2), 51-63, doi:10.1016/j.marchem.2004.06.040 2 identifiers: 1 for publication,1 for data

  11. More data citation examples ICPSR Kessler, Ronald C. National Comorbidity Survey: Baseline (NCS-1), 1990-1992 (Restricted Version) [Computer file]. ICPSR25381-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2009-05-11. doi:10.3886/ICPSR25381 Economic Modeling Figure 3. Change of relative agricultural producer prices since 1998. Middle-income CIS show average for Russia, Kazakhstan, and Ukraine. …. Source: OECD, 2004 and CIS Statistics, 2003. archival data center? 2 organizations listed, but which of their 100s of datasets were used?

  12. Contrasting citation styles Some commonalities (who, when, where), but • Prose is interspersed with metadata elements • Standard citation format/recipe would be easy to read • Not every citation had an actionable identifier • Name of dataset and data subset used (what) unclear • Archival commitment unclear • Date of publication vs date of collection unclear • One citation contained another citation (for publication)

  13. What we want from data citation • Precise identification of dataset • At level of version, file, table, etc., or groups thereof • So that readers can find and understand the data • Credit to data producers and data publishers • Vital incentive for data sharing and archiving • A link from the traditional literature to the data • Gives intellectual legitimacy to creation of data sets • Research metrics for datasets • Sponsors want publication and retention numbers

  14. Starter data citation wish list • Any dataset, database, data file • All levels of granularity (table, row, cell) • For any snapshot (version, e.g., in time) • Any formatted view: XML, HTML, CSV, etc. • With and without annotations • Links to older, newer, and latest versions • Actionability (“Click-through”) • Persistence (validity into the future)

  15. Datasets and documents have much in common

  16. Data citation wish list possibilitiesWe want it all, but might settle for initial partial solutions All datasets? Well, maybe just archived datasets* All levels of granularity? For any snapshot? All views? Publisher-defined granules, versions, and views* Plus older/newer version, and latest version? Surrogate-based pointer to extant version chain* With and without annotations? Annotation as publication* What about actionability and persistence? Yes and yes* (* Standards and archives needed for all)

  17. Initiatives and outfits to watch • DataCite initiative: to encourage data publishing via global data citation support: standards, persistent reference to datasets in regional archives • Supplemental materials publishing standards for data, surrogates, and extended descriptions and methods, e.g., technical data application appendices • Publishers: increased volume of submission • Community standards (so many to choose from!): ORNL DAAC, Pangaea, GCMD, ESIP, GBIF, TDWG, OECD, NISO/NFAIS, IPYDIS, Dataverse, etc.

  18. Data citation summary Data citation helps publication and sharing, which helps preservation and re-use, which saves the planet • Gives credit to data producers and data publishers • Vital incentive for data sharing and archiving • Provides a link from traditional literature to data • Gives intellectual legitimacy to creation of data • Research metrics for datasets • Sponsors want publication and retention numbers • Need recipes and stuff, i.e., standards and archives

More Related