120 likes | 222 Views
Data Processing A simple model and current UKDA practice Alasdair Crockett, Data Standards Manager, UKDA. An idealised throughflow of routine work into and out of a ‘data processing’ section. Acquisitions Stage/Section 1. Orderly Ingest and Review of datasets Data Processing Stage/Section
E N D
Data ProcessingA simple model and current UKDA practice Alasdair Crockett, Data Standards Manager, UKDA
An idealised throughflow of routine work into and out of a ‘data processing’ section Acquisitions Stage/Section 1. Orderly Ingest and Review of datasets Data Processing Stage/Section 2. Orderly ‘release’ of datasets to data processing section 3. Clear data processing policies and effective tools forprocessing data and creating metadata (esp. the DDI). 4. Recording when processing tasks are completed and how long they took 5. Generation of ‘processing metadata’ 6. Post-processing and reprocessing Preservation and User Support Stages/Sections
1. Orderly Ingest and Review of datasets Lies within the ‘acquisitions’ rather than the ‘data processing’ section, though smaller archives may not make such a distinction • Review incoming materials prior to processing: • UKDA’s ‘Acquisition Review Committee’ – useful for high volume archive • Find out about materials as early as possible, don’t wait for them to arrive at the Archive. • the UKDA’s data submission form • Get data creator to create catalogue record/provide metadata -UKDA about to start using electronic ‘deposition programme’ rather than traditional ‘deposit form’
2. Orderly ‘release’ of datasets to the data processing section • Study Reviewed by senior staff member before release • (particularly useful if being processed by junior/temporary staff member). • Study assigned to a primary individual. • Study assigned a level/standard of processing • [if Archive has differential standards UKDA has 4 levels ranging from A* (most rigorous/value added) thru A and B to C (least rigorous/value added)] • Clear prioritization of tasks when many datasets are released to processing at the same time • -UKDA has ‘service level definition’ performance targets, which include processing times, imposed by its funders
3. Clear data processing policies and effective data processing tools. At the UKDA data processing falls into three mains sets of activities, all of which need clear documentation (so that staff know what activities are expected) and effective tools to automate procedures as far as is possible: i) Processing of data: validation, checking values and labels, confidentiality, congruence of data and documentation, data format translation (for preservation and dissemination) - UKDA uses Sax Basic Scripts to manipulate SPSS objects ii) Creation of metadata: catalogue record and/or DDI (both study level and in many instances variable level) - UKDA has New User Friendly Catalogue input Program iii) ‘Publication’ of data and metadata in a specialist online browsing environment - Improvements to Nesstar Publisher
4. Recording when processing tasks are completed and how long they took • Now essential for the UKDA since we provide quarterly reporting of performance indicators (inc. processing times) to our funders (ESRC and JISC) • Advisable even if funders don’t require information - archive management often distanced from the coalface of data processing and may have little idea which processing activities take most time.
5. Generation of ‘processing metadata’ • i) Internal record of what was done to the materials provided to create the dataset ready for dissemination - this can prove very useful if problems arise with the study years afterwards • ii) External record of what was done to the materials provided to create the dataset ready for dissemination - to tell users what the Archive has done in getting the dataset ready for their use [useful as they tend to assume it was simply copy it from a CD onto a server!]
6. Post-processing and reprocessing Release of the study to the outside world may not be the end of processing: i) All ‘legacy’ issues, i.e. old datasets that need reprocessing, now logged centrally, since staff time doesn’t always permit immediate action, we need to make sure we don’t forget what needs doing (in piecemeal fashion when we do have time) ii) All dissemination copies of data are now linked dynamically to preservation server: download file bundles on the UKDA’s download service server are recreated if any constituent files are changed on the preservation server [still have a problem with Nesstar though, data have to be manually republished if new data files are supplied to the Archive]
How does one achieve these 6 data processing steps efficiently • ‘Tracking and logging’ database to log major processing events, store and generate processing metadata (helps achieve points 1,2,4,5 and 6) … And predominantly to achieve point 3: • Clear documentation of data processing procedures and standards • Effective and easy to use data processing tools to automate any tasks which can be automated
Facilitating stages 1,2,4, 5 and 6:A ‘tracking and logging’ Database.UKDA’s new acquisitions and Processing database will: • Alert Acquisitions staff when follow up action is required (e.g. promised data has not been sent in within 3 months) • Log all major processing activities: when completed, by whom and how long they took • Record and generate ‘processing metadata’ • Generate performance indicator figures for funders • Currently takes about 2 hours, should reduce to 2 minutes
Facilitating stage 3:Clear Documentation and Data processing Tools i) Documentation of Procedures and Standards - the UKDA produces what it terms Process Guides to attempt to define procedures and standards. These are available via the SIDOS Website: http://www.sidos.ch/DP/html/20.htm ii) Effective tools - UKDA uses SAX BASIC to manipulate the SPSS command processor and SPSS ‘objects’. SAX BASIC gives a Visual Basic for Applications style programming interface to SPSS. Very useful if you use SPSS as your core processing package as the UKDA does. For more details see: http://pages.infinit.net/rlevesqu/SampleScripts.htm
Example of a SAX Basic processing script i) Makes a tab-delimited and STATA version of each SPSS file ii) Makes a UKDA data dictionary in rich text (rtf) format, marking up differences between the STATA and SPSS files iii) Generates a report in rtf format that documents all unavoidable loss of data/labels upon conversion from SPSS to STATA iv) Imposes UKDA directory structure on all files