150 likes | 278 Views
Documenting Research Project Process for Reproducibility. Larry Hoyle Institute for Policy & Social Research University of Kansas. The challenges. Large (or complex) multi-disciplinary projects Multiple sites, data streams, standards, and practices Complex data preparation procedures
E N D
Documenting Research Project Process for Reproducibility Larry Hoyle Institute for Policy & Social Research University of Kansas Dagstuhl Presentationn 2012 - Larry Hoyle
The challenges • Large (or complex) multi-disciplinary projects • Multiple sites, data streams, standards, and practices • Complex data preparation procedures • Point and click software used • Documenting as overhead Dagstuhl Presentationn 2012 - Larry Hoyle
Example Project • Farmer's land use decisions related to climate change (e.g. biofuel related crops) • One component of larger NSF grant • Multiple teams, multiple universities • The two main sites are 135 km apart • Multi-disciplinary • Economists, geographers, agronomists, biologists, engineers, climate scientists, anthropologist, sociologist, political scientists, urban planner, GIS experts, photographer Dagstuhl Presentationn 2012 - Larry Hoyle
Example Project Data • Develop substantial geodatabase (ARC SDE) • ground cover, soils, crop statistics, facility locations (e.g. purchaser, processing plant). Weather, climate, watershed and aquifer models, • Sub-(farmer’s) field geographic level • Climate models at different scales • Focus groups and multi wave survey (geocoded) • Interviews coded in NVIVO (geocoded) • Photographs • Large proprietary dataset with time-limited use Challenge - put it all together and document how it was done and how everything relates. Other example: Iassist posting Dagstuhl Presentationn 2012 - Larry Hoyle
Spatial Aspects • Reconciling different spatial schemes at multiple scales across time • Raster images, • model grids at different scales, • weather point sources, other point locations (e.g. biorefineries), • political entity polygons (state, county), • farm field and sub-field polygons, • Attribute data at all these levels, imputed and aggregated data • Harmonizing data from different geographic schemes • Producing new spatial objects • E.G. corners as separate from circle with center-pivot irrigation Dagstuhl Presentationn 2012 - Larry Hoyle
New Polygons Polygons to be extracted from remote sensing imagery Subfield areas sometimes grow different crops (corners are 21% of the square) Dagstuhl Presentationn 2012 - Larry Hoyle
Need to Capture Process Example 1 • Project member with expertise volunteered to process data to produce a spatial dataset (soils data). • Users of the dataset discover anomalies • Expert no longer available, can’t remember quite what he did and has no documentation (used point and click tools) • Ouch Dagstuhl Presentationn 2012 - Larry Hoyle
Process Example 2 • Qualitative analysis • Transcription • Multiple coders, common coding scheme • Coding scheme evolves (capture this?) • Training • Paired coders code each interview • Testing of coder reliability • Integrate this after the fact with geodatabase Dagstuhl Presentationn 2012 - Larry Hoyle
Point and Click • Some tools are only point and click and don’t create a log. • E.g. Some procedures in ArcGIS • How do you document process • Screen capture pasted into Word? • Action recording software • Discoverable? Machine actionable? Dagstuhl Presentationn 2012 - Larry Hoyle
An ArcGIS process (different project) NSFCHEMAnnualDataProcedure.docx AnnualLinksByTime4.avi Dagstuhl Presentationn 2012 - Larry Hoyle
Need Tools • There is a need for tools built on top of standards that make it easy to capture and annotate process Dagstuhl Presentationn 2012 - Larry Hoyle
Need Tools to Capture ProcessOne example – SAS Enterprise Guide • Can modify nodes during development. • Can run the process from any point • But – overall process may involve multiple tools - in this case also R and ArcGIS. In other cases, multiple people in different settings. • Scott Long - The Workflow of Data Analysis Using Stata • http://www.indiana.edu/~jslsoc/web_workflow/wf_home.htm Datasets – Permanent and temporary Dagstuhl Presentationn 2012 - Larry Hoyle
Capturing Process as it is Being Developed • False starts and blind alleys • Does the whole process matter or only a process that reproduces the final result? (learn from my mistakes?) • Description of process gets edited as it evolves • Adding minimal overhead • If the tool requires a lot of attention it won’t get used. • Combining sub-processes • Filling in pieces of overall planned project • Parallel parts • Time as ordinal or interval (or ratio?) Dagstuhl Presentationn 2012 - Larry Hoyle
Tools – The Fantasy • Annotated screen capture – works on top of any software • Text (or audio/video?) annotation • Dealing with IP in captured images • Flow diagram with popups? • Editable • Time stamped Sub process edited separately Persistent identifiers allow (re-)linking Planned overall process Dagstuhl Presentationn 2012 - Larry Hoyle
Final thoughts • Metadata for the audience • Documentation for reproducibility • Documentation in cases of disputed results • Sometimes the researcher is the audience • One researcher commented that having documentation at this level would be very helpful in writing methods sections of papers. • Teaching tool - critique students process • Assists refining methods • Also useful in future similar projects Dagstuhl Presentationn 2012 - Larry Hoyle