130 likes | 242 Views
Data quality control, Data formats and preservation, Versioning and authenticity, Data storage. Managing research data well workshop London, 30 June 2009 Manchester, 1 July 2009. Good data management. good research high quality data needs to be planned specific for purpose
E N D
Data quality control,Data formats and preservation,Versioning and authenticity,Data storage Managing research data well workshop London, 30 June 2009 Manchester, 1 July 2009
Good data management • good research • high quality data • needs to be planned • specific for purpose • data can be understood and used now and in future • data can then be shared and re-used
Quality control Data quality control at various stages: • data collection • e.g. instrument calibration; expert opinion; multiple measurements; computer assisted interviews • data entry, digitisation, transcription and coding - standardised and consistent procedures • e.g. set up validation rules for data entry; use input masks; detailed variable labelling; missing value coding; use controlled vocabularies or choice lists; best structure to organise data and data files • data checking and verifying - automated and/or manual • e.g. double entry; check for out-of-range values; apply random sample validation; statistical analyses (descriptives, frequencies, means, range, clustering) to detect errors or find anomalous values; verify data completeness
Data formats • choice of software format for digital data: • planned data analyses • software availability • hardware used • discipline specific standards and customs • digital data software dependent • digital data endangered by obsolescence of software/hardware • best formats for long-term preservation - standard formats, interchangeable formats, open formats • e.g. tab-delimited; comma-delimited (CSV); ASCII; OpenDocument format; SPSS portable; XML
Data format conversions • convert data for preservation or back-up, e.g. export, save as • beware of conversion errors: • loss of internal metadata • e.g. convert MS Access to tab-delimited tables • loss of editing, formatting, formulae • e.g. convert MS Word to RTF • truncation or loss of data • e.g. string variables lost in SPSS – STATA conversion • check for errors and changes after conversion Example 1: MS Excel to tab-delimited Example 2: Word to XML Example 3: Proprietary audio file (DVF) to WAV
MS Excel format Tab–delimited text format
Version control • keep track of different copies or versions of data files • which methods: • single site vs. across locations • single vs. multiple users • different versions to be stored vs. files to be synchronised • single user of data files: • file naming – unique file names with date or version number (avoid spaces!) e.g. FoodInterview_1_draft; FoodInterview_1_final; HealthTests_06-04-2008; BGHSurveyProcedures_00_04 • version control table or file history within or alongside data file • version control facility within software, e.g. MS Windows software • multiple users of data files • same as above • control rights to file editing: read/write permissions, e.g. Windows Explorer • versioning/file sharing software: check files out/in, e.g. SVN, VSS, Google Docs, Amazon S3 • manual merging of multiple entries/edits • synchronise files, e.g. MS SyncToy software
Authenticity of data • master files • assign responsibility for master files • record changes to master files
Data storage • digital storage media unreliable • file formats and physical storage media ultimately become obsolete • optical (CD, DVD) and magnetic media (hard drive, tapes) vulnerable and subject to physical degradation Best practice: • use data formats with long-term readability • storage strategy with at least two different forms of storage • copy/migrate data files to new media between two and five years after first created • check data integrity of stored data files at regular intervals (checksum) • know your back-up strategy: institutional/personal; network server/PC/laptop • maintain original copy, external local copy and external remote copy • test file recovery • Data Protection Act and data back-up – may require minimal data copies for personal data; secure storage
Example: data storage and preservation at UKDA • preservation copy (UKDA) • shadow copy (UKDA) • dissemination copy to reduce load on main system • near-site online copy (on campus) • off-site online copy • tape-based offline copy (UKDA) Multi-copy, multi-storage media and multi version resilience: scheduled nightly robotic 3-monthly
Good data management practice • plan data management early • assign roles and responsibilities • design data management according to needs and purpose of research • data management throughout research
Resources • ESDS (2008). Guide to good practice: micro data handling and security. http://www.esds.ac.uk/news/publications/microDataHandlingandSecurity.pdf • Finch, L. & Webster, J. (2008). Caring for CDs and DVDs. NPO Preservation Guidance. Preservation in Practice Series. London, National Preservation Office. Available at http://www.bl.uk/npo/pdf/cd.pdf • UK Data Archive (2009). Manage and Share Data. http://www.data-archive.ac.uk/sharing/ See: http://www.data-archive.ac.uk/sharing/furtherstorage.asp