450 likes | 707 Views
Being Productive with Stata and VA Data. Give me six hours to chop down a tree and I will spend the first four sharpening the axe. --Abraham Lincoln Todd Wagner August 2008. Outline. Database manipulation in Stata Data Analysis in Stata. Working Interactively and .do files.
E N D
Being Productive with Stataand VA Data Give me six hours to chop down a tree and I will spend the first four sharpening the axe. --Abraham Lincoln Todd Wagner August 2008
Outline • Database manipulation in Stata • Data Analysis in Stata
Working Interactively and .do files • You can issue commands directly into the command line. • Unless you save your commands into a batch file (a .do file), you’ll lose your code once you close Stata. • I often work interactively and then save the “right” commands in a do file.
Editing a .do file in Stata • Any ASCII text editor will work • Stata has a built in text editor, but it is limited. • I recommend exploring your options http://fmwww.bc.edu/repec/bocode/t/textEditors.html
Handling Data • SAS processes one record at a time • Stata processes all the records at the same time • Loops are commonly used in SAS • Loops are very rarely used in Stata
Loading Data into Memory • Stata reads the data into memory • set mem 100m (before you load the data) • You must have enough memory for your dataset • With large datasets: • drop unnecessary variables • Use the compress command (but don’t compress SCRSSN)
Stata Abbreviations • Stata commands can be abbreviated with the first three letters • regression income education female could be written • reg income education female
Stata Help • Stata’s built in help is great • Help <command> • Stata manuals are great because they review theory
Stata and the Web • Stata is “web aware” • Check for updates periodically • update all • You can search for user-written programs • findit output • findit outreg (click to install)
Stata in Windows • Page up scrolls through the previous commands • There is a graphical user interface (menus) if you forget a command • In Unix, you can all Stata’s functionality if you use x-windows (e.g., Cygwin).
Sysdir, ls and cd • Stata recognizes some unix commands, such as ls and cd • Sysdir provides a listing of Stata’s working directories sysdir STATA: C:\Program Files\Stata9\ UPDATES: C:\ProgramFiles\Stata9\ado\updates\ BASE: C:\Program Files\Stata9\ado\base\ SITE: C:\Program Files\Stata9\ado\site\ PLUS: c:\ado\stbplus\ PERSONAL: c:\ado\personal\ OLDPLACE: c:\ado\ • Store your data on a VA server– not on your PC or laptop!
Delimiters • SAS recognizes “;” as a delimiter • Stata recognizes the carriage return • Always add a carriage return after your last command • You can change delimiters to ; #delimit ;
Missing Data • Stata and SAS both use “.” as missing • Stata implicitly values a missing as a very large number • SAS implicitly values a missing as a very small number
Generating and Recoding Variables • In SAS you type quality=0; If VA=1 then quality=1; • In Stata you type gen quality=0 recode quality 0=1 if VA==1 or replace quality=1 if VA==1
Boolean Logic • Stata is picky about Boolean logic gen y=x if a==b (must use two ==) gen y=x if a>b & b>10 (must use &) gen y=x if a<=b (< or > must be before =)
Creating Dummy Variables • Goal: create dummy variable for gender gen male=sex==“M” tab sex, gen(sex_) • This second command automatically creates 2 dummy variables • Be careful about missing data– missing data are assigned to 0, unless you use “if” or “recode”
Drop • Drop <varnames> (drops variables) • Drop if X==1 (drop cases where value is 1)
egen Commands • You want to generate total costs for a medical center • In SAS this is done by proc summary • In Stata, you can type collapse (sum) costs, by (stan3)or sort sta3n by sta3n: egen sumcost=total(cost)
ICD-9 Codes • Stata has capabilities to handle ICD-9 diagnosis and procedure codes • You can • check to see if codes are valid • generate identifiers based on codes or ranges of codes
Dates • Same date functions as SAS
Combining Data • Merge • this automatically creates a variable called _merge • merge==1 obs. from master data • merge==2 obs. from only one using dataset • merge==3 obs. from at least two datasets, master or using merge scrssn admitday disday using data_y • Append (stacking data)
Explicit Subscripting • Identify the most recent encounter in an encounter database gsort id -date by id : gen n=_n by id : gen N=_N gen select=n==1 Ascending sort by ID and reverse by date Record counter from 1 to N per person Total number of records per person
Set, Clear and More • Set: sets system parameters • Need to set memory size to open a database set mem 100m • Clear erases data from memory • When output is >1 page, you are asked to continue (set more off)
Summarizing Data • Sum < >, d provides more details on each variable • Tabstat provides summary info, including totals . sum gender age educ Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- gender | 4085 1.496206 .5000468 1 2 age | 4085 64.5601 9.451724 50 94 educ | 4085 4.398286 1.662883 1 9
Tabulating Data . tab gender gender | Freq. Percent Cum. ------------+----------------------------------- 1 | 2,058 50.38 50.38 2 | 2,027 49.62 100.00 ------------+----------------------------------- Total | 4,085 100.00 . table gender ---------------------- gender | Freq. ----------+----------- 1 | 2,058 2 | 2,027 ----------------------
Tabulating Data tab gender age too many values r(134); tab age gender | gender age | 1 2 | Total -----------+----------------------+---------- 50 | 49 69 | 118 51 | 72 71 | 143 … 94 | 1 0 | 1 -----------+----------------------+---------- Total | 2,058 2,027 | 4,085
. tabstat age, by (gender) gender | mean ---------+---------- 1 | 64.77454 2 | 64.34238 ---------+---------- Total | 64.5601 -------------------- . table gender, c(mean age) ----------------------- gender | mean(age) ----------+------------ 1 | 64.77454 2 | 64.34238 ----------------------- Tabstat
Graphing • Diagnostic graphics • Presenting results
Basic Analytical Functions • OLS (reg) • Logistic, probit, count data (e.g., CLAD) • Multinomials • GLM/HLM • Duration models • Semi and non-parametric models
Creating Publishable Tables • Outreg command • Outputs data to a delimited file • Delimited file can be read into Excel • Very flexible • Creates publishable tables easily
Becaplermin • June 2006, FDA issued a Boxed Warning for becaplerim (a treatment for lower extremity diabetic ulcers) • Warning raised potential risk of cancer related mortality
Analytical Goal • Case-control study for becaplermin • Sample is all patients with a diabetic ulcer of the lower extremity • Exposure is quantity of becaplermin prescriptions • Multivariate analysis, stratifying for patients with prior history of cancer
Pulling VA Data • VA utilization data extracts reside in SAS. I extract my sample using SAS and then moved the data into Stata. • VA Data: • Sample: All encounters with a diabetic ulcer principal diagnosis in NPCD and PTF (FY02-07) • Exposure: All prescriptions from DSS pharmacy FY02-07 for Becaplermin feeder code • Outcome: All encounters with a neoplasm principal diagnosis (FY97-07)
Transferring Data • Stattransfer or DBMS copy work • Stattransfer often seeks to optimize the Stata dataset by default • If transferring data with SCRSSN, FORCE Stattransfer to transfer SCRSSN as double precision • http://www.stata.com/support/faqs/data/prec.html
CLICK ON DOUBLE Stattransfer
Diabetic Ulcer Sample • Goal: turn encounter level data into person level data cd R:\twagner\customer\becap use ulcer, clear sort scrssn by scrssn: gen n=_n tab n keep if n==1 keep scrssn sort scrssn gen ulcer=1 save finder, replace
Alternative Code sort scrssn by scrssn: gen n=_n by scrssn: gen num_ulcervisits=_N sort scrssn by scrssn: gen newepisode=vizday[_n]-vizday[_n-1]>60 recode newepisode .=1 if n==1 by scrssn: egen episodes=sum(newepisode)
Step 2: Merge Ulcer Sample and Cancer Cases use neo, clear gen cancer=1 sort scrssn merge scrssn using finder drop if _m==1 Merge command creates a new variable: _m=1 data only in master data _m=2 data only in using data _m=3 data merged in both
Step 2: continued sort scrssn admitday disday sta3n adtime by scrssn: egen firstcancer=min(admitday) if cancer==1 gen diedihcan=disto==-2 & cancer==1 gen dod_can=disday if diedihcan==1 gen cancerstays=1 if cancer==1 recode cancerstays .=0 collapse (min) firstcancer (sum) cancerstays (max) diedihcan dod_can cancer, by (scrssn) sort scrssn drop _m save diabcancer, replace
Merge in Exposure data use becap, clear gen numrx=1 sort scrssn svc_dte by scrssn: egen firstbecap=min(svc_dte) by scrssn: egen lastbecap=max(svc_dte) collapse (min) firstbecap (max) lastbecap (sum) day_supply numrx , by (scrssn) gen becap=1 sort scrssn save becapsum, replace use diabcancer merge scrssn using becapsum