310 likes | 430 Views
CCPR Computing Services More Efficient Programming July 13, 2006. Outline. Thinking through a programming task Ways of efficiently documenting and organizing your project Naming variables, programs, files Commenting code Including file header Implementing directory structure
E N D
CCPR Computing ServicesMore Efficient ProgrammingJuly 13, 2006
Outline • Thinking through a programming task • Ways of efficiently documenting and organizing your project • Naming variables, programs, files • Commenting code • Including file header • Implementing directory structure • Programming constructs • Raw data -> finished product: are your results replicable?
Before you start coding… • Think • Clearly define the problem in writing • Write down the solution/algorithm in English • Modularity • Create test (if reasonable) • Translate one section to code • Test the section thoroughly • Translate/Test next section, etc.
Documentation - File Header • Each do-file/program/file you create should include: • Your name • Project name • Project location • Date • Software Version • Purpose of program • Inputs, Outputs • Special Notes
Naming Files, Variables, and Functions • Use language standard (if it exists) • Be aware of language-specific rules • Max length, underscore, case, reserved words • Differentiating log files: • Programs MergeHH.sas, MergeHH.do • Log files MergeHHsas.log, MergeHHsta.log • Meaningful variable names: • LogWt vs. var1 • AgeLt30 vs. x • Procedure that cleans missing values of Age: • fixMissingAge • Matrix multiplication X transpose times X • matXX
Commenting Code • Good code is self-commenting • Naming conventions, structure/formatting, header should explain 95% • Comments should explain • Purpose of code, not every detail • Tricks used • Reasons for unusual coding • Comments do not • fix sloppy code • translate syntax • If it takes longer to read the comment than to read the code, don’t add a comment!
Commenting Code - Stata example Compare formatting, comments, variable name and function names SAMPLE 2 *Convert names in dataset to lowercase. program deflowerVarNames foreach v of varlist _all { local LowName = lower("`v'") if `"`v'"' != `"`LowName'"' { rename `v' `=lower("`v'")' } } end SAMPLE 1 program def function1 foreach v of varlist _all { local x = lower("`v'") if `"`v'"' != `"`x'"' { rename `v' `=lower("`v'")' } } end
Directory Structure • A project consists of many different types of files • Use folders toseparate files in a logical way • Be consistent across projects if possible • ATTIC folder for older versions
** Paths: global parentpath "C:\Documents and Settings\piersol\Summer06\prog\progtips" global pgmsloc "$parentpath\pgms" global logsloc "$parentpath\logs" global cleandataloc "$parentpath\data\clean" global rawdataloc "$parentpath\data\raw" capture log close log using "$logsloc\test200607", text replace ********************************************************************* *INSERT FILE HEADER HERE...then it’s included in log file. ********************************************************************* macro list webuse union, clear save "$rawdataloc\union.dta", replace *keep idcode year age grade save "$cleandataloc\unionLJP.dta", replace log close Stata example: using directory structure
Programming Constructs • Tools to simplify and clarify your coding • Available in virtually all languages • Constructs • Loops - for, foreach, do, while • If/elseif/else– if, then, else, case • continue • exit
Loop Example 1 • Problem: Given 4 indicator variables (south, union, black, not_smsa) and 2 discrete variables (age, grade), generate 8 new indicator variables: • south_age21 = south and age > 21, • south_gr12 = south and grade > 12 • Similarly for union, black, not_smsa • Solution without loop • 8 lines of code similar to: • generate newvar = (south==1 & age>21 & age<.) • generate newvar = (south==1 & grade>12 & grade<.) • Solution with loop foreach j in south union black not_smsa { gen `j'_age21 = (age>21 & age<. & `j'==1) gen `j'_gr12 = (grade>12 & grade<. & `j'==1) }
Loop Example 1, cont. *CHECK GENERATED VARIABLES AGAINST ORIGINAL VARIABLES foreach j in south union black not_smsa { qui count if `j'==1 & age>21 & age<. local origCount = r(N) qui count if `j'_age21==1 if `origCount' ~= `r(N)' { display "Counts do not match for `j'_age21!" } else display "Counts match for `j'_age21." qui count if `j'==1 & grade>12 & grade<. local origCount = r(N) qui count if `j'_gr12==1 if `origCount' ~= `r(N)' { display "Counts do not match for `j'_gr21!" } else display "Counts match for `j'_gr21." }
Loop Example 2 • Given indicator variables white, black, other, and continuous variable educyrs, create interaction variables • Solution using loop: local allraces "white black other" foreach race of varlist `allraces' { generate `race'_educ=`race'*educyrs }
Loop Example 3 • Problem: • Dataset contains variables over multiple years (1970-1990) • Need to perform a number of commands separately for 1970, 1975, 1980, 1985. • Solution without loop bysort year: command1 if year==70 | year==75 | year==80 | year==85 bysort year: command2 if year==70 | year==75 | year==80 | year==85 • Solution with loop foreach year in 70 75 80 85 { di as result "***Regression for year = `year':" regress ln_wage grade tenure ttl_exp if year==`year' di as result "***Summarize for year = `year':" summarize ln_wage if year==`year' }
Loop Example 4 – pulling from 2 lists • From Stata FAQ website Code: local agrp "cat dog cow pig" local bgrp "meow woof moo oinkoink" local n : word count `agrp' forvalues i = 1/`n' { local a : word `i' of `agrp' local b : word `i' of `bgrp' di "`a' says `b'" } Resulting output: cat says meow dog says woof cow says moo pig says oinkoink
Constructs - If/then/else • Execute section of code if condition is true: ifconditionthen {execute this code if condition true} end • Execute one of two sections of code: ifconditionthen {execute this code if condition true} else {execute this code if condition false} end
If/Else Example • Problem: need to execute commands on an operating system, but only if the os is Unix…the commands will fail if os is anything else • Solution: if "`c(os)'"~="Unix" { di as err "Sorry; this section requires Unix OS." } else { ** continue with unix commands… }
Constructs - Elseif/case • Elseif - Execute one of many sections of code: ifcondition1then {execute this code if condition1 true} elseifcondition2 then {execute this code if condition2 true} else {execute this code if condition1, condition2 are all false} end • Case- same idea, different name casecondition1 then {execute this code if condition1 true} case condition2 then {execute this code if condition2 true} etc.
Elseif Example • Problem: Continue example from if…else, but execute different section of code for Unix, Windows, and Mac • Solution: if "`c(os)'"=="Unix" { di "This is a Unix environment" } else if "`c(os)'" == "Windows" { di "This is a Windows environment" } else if "`c(os)'" =="MacOSX" { di "This is a MacOS” environment." } else { di as err "`c(os)' not recognized." }
Stata- If command vs. if qualifier • ifcmd was designed to be used with a single expression • Example: • Given variable x with 5 observations: 1, 1, 2, 1, 3, • Compare the following three pieces of Stata code: if x==2 { replace x=99 } if x==1 { replace x=99 } replace x=99 if x==2
Constucts -- Continue Example from Stata online help • Continue is used to exit current iteration of loop and continue with next iteration • The following two loops produce the same result: forvalues x = 1/10 { if mod(`x',2)==1 { display "`x' is odd" continue } display "`x' is even" } forvalues x = 1/10 { if mod(`x',2)==1 { display "`x' is odd" } else { display "`x' is even" } }
Constructs – Exit • Stop execution of program • Examples: • Do-file contains a number of data checks followed by analysis commands. If data checks reveal something unacceptable, you can exit out of do-file before running analysis. • Program requires user input. If user enters “bad” information, need to quit program. • Debugging. If particular error occurs then break. • Check denominator prior to dividing. If equals zero, exit.
Raw data to finished product Raw data Analysis data Runs/results Finished product
Raw Data -> Analysis Data • Always have two distinct data files- the raw data and analysis data • A program should completely re-create analysis data from raw data • NO interactive changes!! Final changes must go in a program!!
Raw Data -> Analysis Data • Document all of the following: • Outliers? • Errors? • Missing data? • Changes to the data? • Remember to check- • Consistency across variables • Duplicates • Individual records, not just summary stats • “Smell tests”
Analysis Data -> Results • All results should be produced by a program • Program should use analysis data (not raw) • Have a “translation” of raw variable names -> analysis variable names -> publication variable names
Analysis Data -> Results • Document- • How were variances estimated? Why? • What algorithms were used and why? Were results robust? • What starting values were used? Was convergence sensitive? • Did you perform diagnostics? Include in programs/documentation.
Log files • Your log file should tell a story to the reader. • As you print results to the log file, include words explaining the results • Include not only what your code is doing, but your reasoning and thought process • Don’t output everything to the log-file- use quietlyand noisily in a meaningful way.
Project Clean-up • Create a zip file that contains everything necessary for complete replication • Use a readme.txt file to describe zip contents • Delete/archive unused or old files • Include any referenced files in zip • When you have a final zip archive containing everything- • Open it in it’s own directory and run the script • Check that all the results match