350 likes | 800 Views
Stata: Getting Starting and Being Productive with VA Data. Give me six hours to chop down a tree and I will spend the first four sharpening the axe. --Abraham Lincoln Todd Wagner June 2007. Outline. Getting data into Stata Editing in Stata How does Stata handle data
E N D
Stata: Getting Starting and Being Productivewith VA Data Give me six hours to chop down a tree and I will spend the first four sharpening the axe. --Abraham Lincoln Todd Wagner June 2007
Outline • Getting data into Stata • Editing in Stata • How does Stata handle data • Stata notation and help • Using Stata and Basic Stata commands
Transferring Data • Stattransfer or DBMS copy work • Stattransfer often seeks to optimize the Stata dataset by default • If transferring data with SCRSSN, FORCE Stattransfer to transfer SCRSSN as double precision
CLICK ON DOUBLE Stattransfer
Editing in Stata • Any ASCII text editor will work • Stata has a built in text editor, but it is limited. • I recommend using another text editor http://fmwww.bc.edu/repec/bocode/t/textEditors.html
Handling Data • SAS processes one record at a time • Stata processes all the records at the same time • Loops are commonly used in SAS • Loops are very rarely used in Stata
Loading Data into Memory • Stata reads the data into memory • set mem 100m (before you load the data) • You must have enough memory for your dataset • With large datasets: • drop unnecessary variables • Use the compress command (but don’t compress SCRSSN)
Stata Abbreviations • Stata commands can be abbreviated with the first three letters • regression income education female could be written • reg income education female • Can also abbreviate variables if uniquely defined • reg inc educ fem
Stata Help • Stata’s built in help is great • Help <command> • Stata manuals are great because they review theory
Stata and the Web • Stata is “web aware” • Check for updates periodically • update all • You can search for user-written programs • findit output • findit outreg (click to install)
Stata in Windows • Page up scrolls through the previous commands • There is a graphical user interface (menus) if you forget a command • We have Stata on rocky and tasha– no graphical capabilities, no menus, and loss of some shortcuts
Using Stata • Create batch files called “.do” files • I work interactively • Run Stata and create do file as I go • I can then use the do file as needed • Debugging code and exploratory data analysis is very fast in Stata
Sysdir, ls and cd • Stata recognizes some unix commands, such as ls and cd • Sysdir provides a listing of Stata’s working directories sysdir STATA: C:\Program Files\Stata9\ UPDATES: C:\ProgramFiles\Stata9\ado\updates\ BASE: C:\Program Files\Stata9\ado\base\ SITE: C:\Program Files\Stata9\ado\site\ PLUS: c:\ado\stbplus\ PERSONAL: c:\ado\personal\ OLDPLACE: c:\ado\
Delimiters • SAS recognizes “;” as a delimiter • Stata recognizes the carriage return • Always add a carriage return after your last command • You can change delimiters to ; #delimit ;
Missing Data • Stata and SAS both use “.” as missing • Stata implicitly values a missing as a very large number • SAS implicitly values a missing as a very small number
Generating and Recoding Variables • In SAS you type quality=0; If VA=1 then quality=1; • In Stata you type gen quality=0 recode quality 0=1 if VA==1 or replace quality=1 if VA==1
Boolean Logic • Stata is picky about Boolean logic gen y=x if a==b (must use two ==) gen y=x if a>b & b>10 (must use &) gen y=x if a<=b (< or > must be before =)
Creating Dummy Variables • Goal: create dummy variable for each DRG gen drgnum1=drg==1 or tab drg, gen(drgnum) • This second command automatically creates dummy variables
Drop • Drop <varnames> (drops variables) • Drop if X==1 (drop cases where value is 1)
egen Commands • You want to generate total costs for a medical center • In SAS this is done by proc summary • In Stata, you can type collapse (sum) costs, by (stan3)or sort sta3n by sta3n: egen sumcost=total(cost)
ICD-9 Codes • Stata has capabilities to handle ICD-9 diagnosis and procedure codes • You can • check to see if codes are valid • generate identifiers based on codes or ranges of codes
Dates • Same date functions as SAS
Combining Data • Merge • this automatically creates a variable called _merge • merge==1 obs. from master data • merge==2 obs. from only one using dataset • merge==3 obs. from at least two datasets, master or using merge scrssn admitday disday using data_y • Append (stacking data)
Explicit Subscripting • Identify the most recent encounter in an encounter database gsort id -date by id : gen n=_n by id : gen N=_N gen select=n==1 Ascending sort by ID and reverse by date Record counter from 1 to N per person Total number of records per person
Set, Clear and More • Set: sets system parameters • Need to set memory size to open a database set mem 100m • Clear erases data from memory • When output is >1 page, you are asked to continue (set more off)
Summarizing Data • Sum < >, d provides more details on each variable • Tabstat provides summary info, including totals . sum gender age educ Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- gender | 4085 1.496206 .5000468 1 2 age | 4085 64.5601 9.451724 50 94 educ | 4085 4.398286 1.662883 1 9
Tabulating Data . tab gender gender | Freq. Percent Cum. ------------+----------------------------------- 1 | 2,058 50.38 50.38 2 | 2,027 49.62 100.00 ------------+----------------------------------- Total | 4,085 100.00 . table gender ---------------------- gender | Freq. ----------+----------- 1 | 2,058 2 | 2,027 ----------------------
Tabulating Data tab gender age too many values r(134); tab age gender | gender age | 1 2 | Total -----------+----------------------+---------- 50 | 49 69 | 118 51 | 72 71 | 143 … 94 | 1 0 | 1 -----------+----------------------+---------- Total | 2,058 2,027 | 4,085
. tabstat age, by (gender) gender | mean ---------+---------- 1 | 64.77454 2 | 64.34238 ---------+---------- Total | 64.5601 -------------------- . table gender, c(mean age) ----------------------- gender | mean(age) ----------+------------ 1 | 64.77454 2 | 64.34238 ----------------------- Tabstat
Graphing • Diagnostic graphics • Presenting results
Basic Analytical Functions • OLS (reg) • Logistic, probit, count data (e.g., CLAD) • Multinomials • GLM/HLM • Duration models • Semi and non-parametric models
Output Linear regression Number of obs = 1306 F( 21, 1284) = 10.88 Prob > F = 0.0000 R-squared = 0.1398 Root MSE = 90.367 Robust wtp Coef. Std. Err. t P>t [95% Conf.Interval] ethn1 1.990048 8.742036 0.23 0.820 -15.16019 19.14029 Ethn2 -25.74654 11.69993 -2.20 0.028 -48.69961 -2.793467 ethn3 -35.59552 11.98309 -2.97 0.003 -59.1041 -12.08694 ethn4 -3.244168 11.16836 -0.29 0.771 -25.15441 18.66607 english -11.44402 9.699576 -1.18 0.238 -30.47277 7.584741 lifeus 37.34419 13.86037 2.69 0.007 10.15274 64.53564 age1999 -.6272524 .3097408 -2.03 0.043 -1.234906 -.0195987 income .8068256 .1714309 4.71 0.000 .4705102 1.143141 incmis 14.07434 9.404149 1.50 0.135 -4.374848 32.52352 _cons 111.3607 24.13083 4.61 0.000 64.02051 158.7009
Outreg • Outputs data to a delimited file • Delimited file can be read into Excel • Very flexible • Creates publishable tables