280 likes | 406 Views
Documentation and Workflows. Paul Lambert, 24-25 August 2009 Talk to the ‘Data Management for Social Survey Research’ training workshop, part of the Data Management through e-Social Science research Node of the National Centre for e-Social Science www.dames.org.uk / www.ncess.acuk.
E N D
Documentation and Workflows Paul Lambert, 24-25 August 2009 Talk to the ‘Data Management for Social Survey Research’ training workshop, part of the Data Management through e-Social Science research Node of the National Centre for e-Social Science www.dames.org.uk / www.ncess.acuk
Manipulating data • Operations performed on datasets by researchers and/or data distributors • At any stage of the research lifecycle • Of considerable consequence to analytical results • DAMES Node: • ‘Data Management’ = manipulation of data, and documenting/assisting the processes of manipulation • E-Social Science approach to facilitating data manipulation (metadata resources; data access facilities; ‘workflow models’)
‘Documentation for replication’ ..as a reasonable expectation for scientific research that is cumulative and based upon empirical observation… Steuer, M. (2003). The Scientific Study of Society. Boston: Kluwer Academic. Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), 143-158. Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology? Sociological Methods and Research, 36(2), 153-171.
What needs replication? • Your own analysis (in response to comments, revisions, requests for access) • Others’ analysis • To build upon – cumulative science • To critique / cross-examine • In secondary survey research • Complex data is often updated (new related records; revised and re-released; re-weighted or re-standardardised; new levels of access/linkage) • New analysis feasible - variable operationalisations; new statistical methods
J. Scott Long (2009) • Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press. 1-5: Programming in Stata 6: Cleaning your data 7: Analysing data and presenting results 8: Protecting your work
Treiman (2009) • Treiman, D. J. (2009). Quantitative Data Analysis: Doing Social Research to Test Ideas. New York: Jossey Bass. Good professional practice = • Suitable choice of analytical methods to test ideas • Documentation of choices and data operations
How to approach Documentation for Replication in social survey research? • Made easy by secondary access to datasets and standardised software • Careful syntactical documentation • Metadata documents • Metadata standards
Keep clear records of your DM activities! Reproducible (for self) Replicable (for all) Paper trail for whole lifecycle Cf. Dale 2006; Freese 2007 • In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata) Syntax Examples: www.longitudinal.stir.ac.uk
1) Syntax documentation Long (2009) is highly prescriptive {may not be wholly attainable} Key issues: • Organisation of syntax files Master files and subfiles (and macros) • Setting consistent paths to source data • Reasonable level of manual annotation of files • Use a text editor!!
The idea of workflows • Workflow modelling is exciting future.. • Workflow documentation • MyExperiment [http://www.myexperiment.org/] • Social survey analysis [Dale, 2006; Freese, 2007; Long, 2009] • At present… • Waiting for tool development • Depositing workflows might impose constraints/burdens
Model1: Analytical file Spouse CAMSIS BHPS, wave A individuals Graphics Spouse SOC Current job RGSC Gender BHPS wave B individuals. Age (yrs) Wave C Age bands Text interface Invoked manually or in response to manipulating graphs
..good levels of documentation are new in the social sciences... • “…Little or nothing is systematically archived from these electronic sources. How many of us routinely keep copies of our old word-processing files once they are no longer of current relevance for research or teaching activities. We have been reminded…of the insecurity and non-survival of departmental and professional files stored in broom cupboards, but how many electronic files even get into that cupboard in the first place?” Scott (2005: 142)
2) Metadata documents.. …for documentation for replication • Metadata documents can/should be stored / distributed / disseminated • Main relevant types of metadata documents: • Annotated syntax files • Handwritten workbooks • Codebooks and data file metadata
a) Annotated syntax files • Storage: • Supply authorship details, conditions of access, origins and context of data, software version • ‘Robustify’ your programme (generic locations; ‘capture drop’) • Dissemination: • Available from authors archive • Repec – http://ideas.repec.org/ (Economics) • GEODE/DAMES – www.dames.org.uk (Occupations, Education) • UKDA/ESDS and related data providers (monitored) • Personal webpages – e.g. www.camsis.stir.ac.uk/downloads/data/other/casoc_isco.do
b) Handwritten workbooks • Key here is that they must be published.. • Technical papers • Websites • …. • An emerging payoff - citation indexing! • Croxford, L. (2004). Construction of Social Class Variables. Edinburgh: Working Paper 4 of the ESRC research project on Education and Youth Transitions in England, Wales and Scotland, 1984-2002, Centre for Educational Sociology, University of Edinburgh, and http://www.ces.ed.ac.uk/eyt/EYT_papers/WP04.pdf.
“Because claims in published papers that additional materails are “available from author” usually prove false, at least after a few months, the California Center for Population Research at UCLA recently implemented a mechanism by which additional materials, for example, -do- and –log- files, can be attached to papers posted in its Population Working Paper archive. Other research centers are to be encouraged to do the same” (Treiman, 2009: 404)
E-Science and workflow documentation tools.. • …seek to capture the full record of the work process and all files relevant for documentation (e.g. http://www.myexperiment.org/)
c) Codebooks and data file metadata • Codebook log using data_file_name_codebook.log, replace text disp "DateTime: $S_DATE $S_TIME" notes datasignature codebook, compress codebook describe labelbook, detail log close • See UKDA: data_dictionary.rtf
3) Metadata standards • Formal standards for recording data exist • most widely used is the ‘DDI’, Data Documentation Initiative, http://www.icpsr.umich.edu/DDI/) • Xml format typewritten or software derived, can be read by software / browsers • Includes options for variable labels, recodes, text descriptions • See UKDA, study_information.htm • NESSTAR
Summary: Documentation and workflows • Achieving good documentation is facilitated by effective workflows • File locations / stamps / transferability • Variable metadata • Structured logs of all operations – syntax programs • …Documentation - Is it worth it..?