1 / 34

The importance of having data-sets

The importance of having data-sets. Datasets as the crown jewels of an institutes scientific infrastructure 2006 IATUL CONFERENCE Porto Ronald Dekker . free from Oscar Wilde. Data-set importance. • Verification of publications (results= analysis + data) • Longitudinal research

kaycee
Download Presentation

The importance of having data-sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The importance of having data-sets Datasets as the crown jewels of an institutes scientific infrastructure 2006 IATUL CONFERENCEPorto Ronald Dekker free from Oscar Wilde

  2. Data-set importance • Verification of publications (results= analysis + data) • Longitudinal research (long periods, meta-research) • Interdisciplinary use of data (reuse/innovation) • Valorisation (get new projects based on data set ownership)

  3. Data Model Result Article Parameters A scientific workflow

  4. Manuscript Data Metadata Data publication today Modified after Helly et al. (2003) Library Private Files Publication Research

  5. “In archival terms the last quarter of the 20th century has some similarities to the dark ages. Only fragments or written descriptions of the digital maps produced exist. The originals have disappeared or can no longer be accessed.” Taylor

  6. The hydrological research DARELUX Data Archiving River Environment LUXemburg The relation between rainfall and discharge Rain And everything in between Discharge Long term Direct Modeling discharge prediction

  7. Data ArchivingRiver Environment LUXemburg Huelerbach 1.6 km2 Paris basin (Sandstone, Lime) Maisbich 1.2 km2 Ardennes Massif (Slate)

  8. Measurements Interception transpiration Rainfall Surface Sub-surface soil Neerslag: ASTA, Administration des services techniques de l'Agriculture Deep soil Diepe bodem River Interceptie: TUDelft, Gabriel Lippmann Inst.

  9. Measurements TUD: Road run-off University Utrecht: Soil moisture Interception transpiration Rainfall Surface Sub-surface Gabriel Lippmann, TUD Piezometers University Luxemburg Gabriel Lippmann, TUD V Notch Deep soil Diepe bodem University Luxemburg Gabriel Lippmann, TUD Tracers River

  10. Research pilot DARELUX Why is archiving important: the researchers view Data Archiving River Environment LUXemburg • Direct: • organized storage and meta data assignment • data exchange, closed user groups • elaborating raw data

  11. Research pilot DARELUX Why is archiving important: the researchers view Data Archiving River Environment LUXemburg • Long term: • - (long) time series • - reuse of data in education and research • - verification • - enhanced continuity

  12. Research pilot DARELUX Why is archiving important: the researchers view Data Archiving River Environment LUXemburg It can be done otherwise….. Records of floods in Koblenz

  13. The DARELUX approach Capture, Publish and Preserve

  14. Capture and use sensor Data- acquisition Data- correction and enrichment Data- Ingest (OAIS) Archive Data- storage (OAIS) Retrieval (OAIS) Model user sensor user sensor user sensor user sensor

  15. A DARELUX community

  16. Use and re-use of the DARELUX archive data • Primary users • Working with the archive (Delft and Utrecht) • Secondary users • Store data in the archive, use data from the archive (ASTA and Gabriel Lippmann) • Tertiary users • Use data from the archive (the world)

  17. Publish • Data publication today suffers from several flaws in the publication process: • Data are not published in journals due to economic constraints • There is little merit in data publication for the author because data are not citeable • Data are not citeable due to their often transient web locations (URLs) Klump et al.

  18. Publish: what needs to be done • Data publications must be citeable to be “valuable” • Reputation is the “currency” of science • Authors will only take this effort if it is easy enough and worthwhile doing so • Preparing data for publication takes a lot of effort • Data must be accessible • Use of persistent identifiers and long-term storage Klump et al.

  19. How to make data citeable • To become persistent, data sets need persistent identifiers (e.g. DOI, URN). • Piotrowska et al. (2005): Extraction and AMS radiocarbon dating of pollen from Lake Baikal Sediments. Scientific Drilling Database. doi:10.1594/GFZ.SDDB.1014 • This dataset relates to: • Piotrowska, N., Bluszcz, A., Demske, D., Granoszewski, W. & Heumann, G. (2004): Extraction and AMS radiocarbon dating of pollen from Lake Baikal sediments. Radiocarbon, 46 (1), 181-187. Klump et al.

  20. Publish: how to make data citeable • Data publications need more than persistent identifiers, they also need to be stored in trusted long-term archives. • Several initiatives are working on criteria to certify trusted long-term archives. • Centralised archives stand a better chance to exist for a long time, but this does not rule out small specialised repositories. Klump et al.

  21. And Preserve….. Take care of the preservation of the data-sets providing: eternal access to scientific heritage

  22. Quid aeternis minorum consiliis animum fatigas?Why burden your humble mind with plans for eternity?Horatio

  23. The risks in a nutshell Physical decay of storage media Loss of descriptive (meta) data: inability to retrieve data and context Loss of ”rendering” functionality caused by the inability to run old software on new computers and operating systems Who pays the ferryman?

  24. Our approach: the e-Archive project (2000-2002) Digital preservation: The findings of the e-Archive project, Ronald Dekker, Kees van der Meer,Eugène Dürr, Project's Final Report, September 2003 http://durr.dhs.org/EArchive/publications/e-Archivefindingsfinal13.pdf • Preventing physical degradation • replicate on new medium every 10? Years • store items as simple bit streams • Preventing irretrievability • store metadata and information inseparable together; special attention to context and provenance metadata • Provide perpetual rendering • use emulation or conversion strategies

  25. Requirements for objects to be stored • Atomic (indivisible unit) = one file • Self descriptive via metadata; no references needed for basic content • Implemented as a XML container • Independent of any context • Data archaeology argument: readable ASCII

  26. Descriptive Info. AIP Open Archival Information System:Six Functional Entities Preservation Planning P R O D U C E R C O N S U M E R Data Management queries result sets SIP Ingest Access orders Archival Storage DIP Administration SIP = Submission Information Package MANAGEMENT DIP = Dissemination Information Package AIP = Archival Information Package

  27. The e-Archive building blocks: XML containers as a logical storage structure AIP : • Use pure character streams ASCII/UTF • Keep meta data together with content • Store the original and one or more other representations. • Use set of files in lightweight hierarchy • Archive items in containers with XML. • Archive also Viewers in the archive : programs which give meaning to the content representation in the containers

  28. DARELUX Architecture

  29. Who pays the ferryman?A business plan for DARELUX • Costs • Breakdown of cost • €6000 per study/20datasets/20year • Revenues • Future use is not perceived as a source of income • Funding • Funding by research project (ingest, use, publication): primary users • Institutional or governmental funding needed for long term preservation • Added value services might create additional income

  30. The projects status: things to do • The DARELUX project is at midterm: another year to go (midterm review resulted in a “green light” for the project) • Goals for the second year: • Involve secondary and tertiary users in the project • Seek collaboration with OA periodical e.g. HESS • Store data from secondary users in the archive (or make data available via the archive) • Further work on the business plan • Upscale DARELUX to a (trusted repository) service • Embedded in (3) TU (NL) en EU frameworks

  31. And we are very interested in co-operation Ronald Dekker r.dekker@tudelft.nl

  32. Library Library Data Center Publication Scientific Data Network Manuscript Data Metadata Data publication tomorrow? Modified after Helly et al. (2003) Data publication Research

  33. Quid aeternis minorum consiliis animum fatigas?Why burden your humble mind with plans for eternity?That’s why….Horatio

More Related