90 likes | 217 Views
Best Practices for Preparing Data Sets. Adapted from: Best Practices for Preparing Environmental Data Sets to Share and Archive , by L.A. Hook, T.W. Beaty, S. Santhana Vannan, L. Baskaran, and R.B. Cook. June 2007. http://daac.ornl.gov/PI/bestprac.html. Non-CO2 Synthesis Workshop
E N D
Best Practices for Preparing Data Sets Adapted from: Best Practices for Preparing Environmental Data Sets to Share and Archive, by L.A. Hook, T.W. Beaty, S. Santhana Vannan, L. Baskaran, and R.B. Cook. June 2007. http://daac.ornl.gov/PI/bestprac.html Non-CO2 Synthesis Workshop Boulder, Colorado 22-23 October 2008 Compiled by: A. Dayalu, Harvard University
Seven Best Practices—Summary • Assign Descriptive File Names • Use Consistent and Stable File Formats • Define the Contents of Your Data Files • Use Consistent Data Organization • Perform Basic Quality Assurance • Assign Descriptive Data Set Titles • Provide Documentation
I. Assign Descriptive File Names File names should reflect the contents of the file and include enough information to uniquely identify the data file. • File names can contain identifiers such as project acronym, study title, location, investigator, year(s), version, and file type/extensions. • File names should contain only numbers, letters, dashes, and underscores—no spaces or special characters. • When compressing files, acceptable compression formats are *.zip, *.gz, or *.tar Examples: cobra_2003_flasks.csv (From COBRA 2003 aircraft mission flask data) data 2003.dat
II. Use Consistent and Stable File Formats for Tabular Data In choosing a file format, data collectors should select a consistent format that can be read well into the future and is independent of changes in applications. • Use ASCII file formats delimited using commas, tabs, or semicolons (in order of preference). • Use the same format throughout the file • Use a consistent format across all data files for a study • Figures and analyses should be reported in companion documents—don’t place figures/summary stats in the data file. • Include header rows at the top of the data file • First row: descriptors linking file to set (file name, data set title, author, today’s date, date of last data modification, companion file names) • Remaining rows: describe content of each column, including one row for parameter names and units • Column headings should contain only numbers, letters, and underscores—no spaces or special characters • In the data set documentation, include: • Descriptions of data file names (expand acronyms, site abbreviations, etc) • Expanded parameter descriptions • Missing value codes • Example data file records • Other data file documentation , useful to a secondary user (See Section 7)
III. Define the contents of your data files In order for others to use your data, they must fully understand the contents of the data set, including the parameter names, units of measure, formats, and definitions of coded values. • Parameter names should describe the contents; accompanying documentation should completely describe the parameter. Use consistent capitalization and use only letters, numerals, and underscores. • Units need to be explicitly stated in the data file and in the documentation. • Formats for each parameter should be consistent across data sets, particularly for dates, times, and spatial coordinates. • Dates: YYYYMMDD format • Time: Report in UTC, using 24 hour notation • Spatial Coordinates: Record in decimal degrees to at least 4 significant digits past the decimal point. Be consistent with and document coordinate type, datum, and spheroid. Mixing coordinate systems (e.g., NAD83 and NAD27) will cause errors in subsequent geographical analysis. • Elevation: Provide elevation in meters, with information on the vertical datum used (e.g., NAVD 1988). • Coded Fields such as data quality flags or data qualifiers should be consistent across parameters and files. These should be explained in detail in accompanying documentation. • Missing values should be specified using an extreme value not likely to ever be confused with a measured value (e.g., -9999.99 or NA). Except in the case of NA, do not use character codes in an otherwise numeric field.
IV. Use Consistent Data Organization Each observation should be placed in a separate line (row). Most often, each row in a file represents a complete record and the columns represent all the parameters that make up the record. This leads to an arrangement that is similar to a spreadsheet or matrix. • Keep similar information together. Do not break up your data set into many small files (e.g., by month); instead, make the month a parameter and have all the data in one large file. This minimizes researchers having to process too many files. • Size Limitations. Some applications currently have size restrictions. For example, Excel 2003 limits file size to 65,000 rows and 256 columns. Large files may have to be broken down into logical smaller files to accommodate this. Note that Excel 2007 does not appear to have a row or column limit.
V. Perform Basic Quality Assurance (QA) In addition to scientific QA, we suggest that you perform basic data QA on the data files. • Check file format. Make sure data are delimited/line up in the proper column. • Check file organization and descriptors to ensure there are no missing values for key parameters (e.g., location, time, sample ID). • Review documentation to ensure accurate content/parameter descriptions. • Check the content of measured or derived values to detect impossible or anomalous values (e.g., negative mixing ratios). Generate basic plots to aid in QA. • Perform statistical summaries and review results. • Map locations to see if there are coordinate errors. • Verify Data transfers from notebooks, instruments, etc. For data transfers done by hand, consider double data entry and compare the two data sets. Where possible, compare summary statistics before and after data transformation.
VI. Assign Descriptive Data Set Titles We recommend that data set titles be as descriptive as possible. When giving titles to your data sets and associated documentation, please be aware that these data sets may be accessed many years in the future by people unaware of the project details. • Data set titles should include the following: • Type of data • Date range • Location • Instruments • Parent project • Limit title length to 80 characters, including spaces. Names should contain only numbers, letters, dashes, underscores, and spaces. The data set title should be similar to the name(s) of the data file(s). A data set might contain one to thousands of data files. • Examples SAFARI 2000 Upper Air Meteorological Profiles, Skukuza, Dry Seasons 1999-2000 The Aerostar 100 Data Set
VII. Provide Data Set Documentation The documentation accompanying your data set should be written for a user 20 years into the future. Therefore, you should consider what that investigator needs to know to use your data. Write the documentation for a user who is unfamiliar with your project, sites, methods, or observations. Documentation can never be too complete! The following information should be considered essential for data documentation: • Name of data set, names of files in the set • Why and what data were collected • Instruments used, including model and serial number • Who collected the data, who to contact regarding data. • How to cite the data • Where and with what spatial resolution data were collected • Definitions of any codes used in the documentation • Frequency of data collection • How each parameter was measured (methods), and units • Environmental conditions at time of sampling (temperature, cloud cover…) • Data processing/screening methods • Standards or calibrations used • Details on QA/QC that was applied • Known issues that limit the data’s use • Software (including version number) used to prepare and read the file • Date of last modification • Pertinent notes • Summary statistics generated from the final file to verify future file transformations and transfers