Replicating Results- Procedures and Pitfalls

Replicating Results- Procedures and Pitfalls June 1, 2005

The JMCB Data Storage and Evaluation Project • Project summary • Part 1- July 1982 JMCB started requesting programs/data from authors • Part 2- attempt replication of published results based on submissions • Review of results from Part 2 in Replication in Empirical Economics: The Journal of Money, Credit and Banking Project; The American Economic Review, Sept 1986, by Dewald, Thursby, Anderson

The JMCB Data Storage and Evaluation Project/ Dewald et al • The paper focuses on Part 2 • How people responded to the request • Quality of the data that was submitted • The actual success (or lack thereof) of replication efforts

The JMCB Data Storage and Evaluation Project/ Dewald et al • Three groups: • Group 1: Papers submitted and published prior to 1982. These authors did not know upon submission that they would be subsequently asked for programs/data. • Group 2: Authors whose papers were accepted for publication beginning July, 1982 • Group 3: Authors whose papers were under review beginning July, 1982

Summary of Responses/Datasets Submitted, Dewald et al, p 591

Summary of Examined Datasets Dewald et al, p 591-592

“Our findings suggest that inadvertent errors in published empirical articles are a commonplace rather than a rare occurrence.” – Dewald et al, page 587-588 “We found that the very process of authors compiling their programs and data for submission reveals to them ambiguities, errors, and oversights which otherwise would be undetected.” – Dewald et al, page 589

Raw data to finished product Raw data Analysis data Runs/results Finished product

Raw Data -> Analysis Data • Always have two distinct data files- the raw data and analysis data • A program should completely re-create analysis data from raw data • NO interactive changes!! Final changes must go in a program!!

Raw Data -> Analysis Data • Document all of the following: • Outliers? • Errors? • Missing data? • Changes to the data? • Remember to check- • Consistency across variables • Duplicates • Individual records, not just summary stats • “Smell tests”

Analysis Data -> Results • All results should be produced by a program • Program should use analysis data (not raw) • Have a “translation” of raw variable names -> analysis variable names -> publication variable names

Analysis Data -> Results • Document- • How were variances estimated? Why? • What algorithms were used and why? Were results robust? • What starting values were used? Was convergence sensitive? • Did you perform diagnostics? Include in programs/documentation.

Thinking ahead • Delete or archive old files as you go • Use a meaningful directory structure (/raw, /data, /programs, /logfiles, /graphs etc.) • Use relative pathnames • Use meaningful variable names • Use a script to sequentially run programs

Example script to sequentially run programs 1. #! /bin/csh 2. #File location: /u/machine/username/project/scripts/myproj.csh 3. #Author: your name 4. #Date: 9/21/04 5. #This script runs a do-file in Stata which produces and saves a dta-file 6. #in the data directory. Stat-transfer converts the .dta file to .sas7bdat 7. #and saves the file in the data folder. The program analyze.sas is run on 8. #the new sas data-file. 9. cd /u/machine/username/project/ 10. stata -b do programs/cleandata.do 11. st data/H00x_B.dta data/$file.sas7bdat 12. sas programs/analyze.sas

Log files • Your log file should tell a story to the reader. • As you print results to the log file, include words explaining the results • Don’t output everything to the log-file- use quietlyand noisily in a meaningful way. • Include not only what your code is doing, but your reasoning and thought process

Project Clean-up • Create a zip file that contains everything necessary for complete replication • Delete/archive unused or old files • Include any referenced files in zip • When you have a final zip archive containing everything- • Open it in it’s own directory and run the script • Check that all the results match

When there are data restrictions… • Consider releasing: • the subset of the raw data used • your analysis data as opposed to raw data • (at a minimum) notes on process from raw to analysis data PLUS everything pertaining to the data analysis • Consider “internal” and “external” version of your log-file: • Do this via a variable at the top of your log-files: local internal = 1 … list if `internal’ == 1

Ethical Issues • All authors are responsible for proper clean-up of the project • Extremely important whether or not you plan on releasing data and programs • Motivation • self-interest • honest research • the scientific method • allowing others to be critical of your methods/results • furthering your field

Ethical Issues – for discussion • What if third-party redistribution of data is not allowed? • Solutions for releasing data while protecting your time investment in data collection • Is it unfair to ask people to release data after a huge time investment in the collection?

Replicating Results- Procedures and Pitfalls

Replicating Results- Procedures and Pitfalls

Presentation Transcript

Using MyEconLab: Results, Motivations, and Potential Pitfalls

SELF-REPLICATING ROBOTS

ISTC Commercialization Support Procedures and Results

Dermatologic Procedures: Pearls and Pitfalls

Conclusion Techniques and Pitfalls

Pitfalls !

Self Replicating Systems

REPLICATING EFFECTIVE INTERVENTIONS

DKA: Management and Pitfalls

DOWNSIZING IN PUBLIC SCHOOLS: RIF PLANS, PROCEDURES AND PITFALLS

Libraries, pitfalls, stored procedures , triggers

Pitfalls

Self-Replicating Machine

Pitfalls

Replicating Physics Education Reforms:

Synthesis of useful replicating biosystems

Replicating Basic Components

Replicating Linguistic Resources

Fair Labor Standards Act: Practical Procedures and Potential Pitfalls

Prolactin Results- pitfalls in interpretation

Pitfalls !

Replicating Data