1 / 1

Figure 1: DataONE Idealized Data Lifecycle

The Data Lifecycle Flow: For Me, This Time Erica Johns, Bob Dattore , & Sam Levis. Future Lifecycle Components (figure 2) . Steps for Curating This Data (figure 2). 12. PRESERVATION Format Conversion Historical data: still in usable format, but a less updated version

tejana
Download Presentation

Figure 1: DataONE Idealized Data Lifecycle

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Data Lifecycle Flow: For Me, This Time Erica Johns, Bob Dattore, & Sam Levis Future Lifecycle Components (figure 2) Steps for Curating This Data (figure 2) • 12. PRESERVATION • Format Conversion • Historical data: still in usable format, but a less updated version • Active Migration System: every 2-3 years to keep data usable • Often conversion or translation of content to keep file readable • 13. ACCESS, USE, AND REUSE • Free Access through RDA • Must Register: acts as a form of tracking data use • RESEARCH • Find Free Level 4 Ameriflux Data + Variables Scientist Wants • http://public.ornl.gov/ameriflux/index.html • DATA APPRAISAL • Pervasive throughout Lifecycle • Are the contents the scientists’ asked within the data? • 3. DATA ACQUISITION • Sign User Agreement with ORNL • Download CSV files for Ameriflux Sites • Should have asked ORNL Permission for Ingestion here • 4. DATA ACCUMULATION • Compile CSV files into multi-year lists of Hourly, Daily, Weekly, Monthly • 5. DATA APPRAISAL • Decide against Weekly data- week not defined as 7 days • Decide to gap fill missing years with -9999 • Decide on ascending year order • 6. DATA REFORMAT • CSV to Excel • Fill in missing data with -9999 • Excel to CSV • CSV to NetCDF • Computer Program written to convert in C++ • Standardized CF conventions in production of NetCDF added to increase usability • 7. DATA APPRAISAL • Does the NetCDF work with our scientist’s script? • This is when I asked Permission for Ingestion from ORNL • Review for Metadata Creation • 8. METADATA CREATION • First step towards Data Ingestion into Archive • Helps User find Data • Some documentation visible with dataset, some used only for faceted browsing in RDA • 9. DATA INGESTION • Through dsarch program data files are ingested into archive and information about the files are recorded into a database called RDA DB, within RDAMS • Requires scripting within C-shell language to ingest • 10. ARCHIVE • RDA = CISL Research Data Archive • http://rda.ucar.edu/datasets/ds387.0/ • 11. DATA APPRAISAL User Feedback • Affects EVERY stage of Lifecycle • Data Scientists depend on User Feedback to understand Data Integrity • Users • Find Mistakes • Report Errors in Data • Effects • Data manager checks error : with original or ingested? • Data Reappraisal & Reacquisition • Update Data Accumulated, Reformat, Amend Metadata • Reingest & Rearchive Figure 1: DataONE Idealized Data Lifecycle Considerations within Data Lifecycle • DATA INTEGRITY • From a curators perspective, based largely on user feedback • However, being familiar with the type of data and variables • can allow the curator to notice if values are off • Depend on the reliability of the data provider • ANCILLARY DATA • Know audience but accessible for interdisciplinary science • Unless data from NCAR, links provided to ancillary data • Located within Documentation tab if ReadMe file provided • Problems with dataset reported in Documentation section • Goal: anticipate user needs, not lead users with metadata • REPRODUCIBILITY • Must be able to track original data requests • Easiest to reproduce if derived product is computer based • Disaster program ensures data exists in 2 geographic locations Figure 2: Actual Lifecycle for Level 4 AmerifluxDataset

More Related