Research Methods Lecture 3 More STATA

Research MethodsLecture 3More STATA Ian Walker Room S2.109 i.walker@warwick.ac.uk 02475 23054 Slides available at: http://www2.warwick.ac.uk/fac/soc/economics/pg/modules/rm/notes/iw_lectures/

Housekeeping announcement

Stat-Transfer • Use STAT-TRANSFER to convert data. • Click on • Stat-transfer is “point and click”. • Just tell it the file name and format • and the format you want it in. • Click “transfer”.

Stat Transfer options • Useful options for creating a manageable dataset from a large one: • Keep or drop variables • Change variable format • E.g. float to integer • Select observations • E.g. “where (income + benefits)/famsize < 4500” • Can be used for reading a large STATA dataset and writing a smaller one • Avoids doing this in STATA itself

Customising STATA • profile.do runs automatically when STATA starts • Edit it to include commands you want to invoke every time .set mem 200m .log using justincase.log, replace • Define preferences for STATA’s look and feel • Click on Prefs in menu • Colours, graph scheme, etc. • Save window positioning

Merging data - 1 • file1 has id x1 x2 x3 , file2 has id x4 x4 x5. • You can merge using “key” in BOTH files (id) • But you need to sort both files first. use file1 . sort id sorts file1 according to id variable . save, replace . use file 2 . sort id sorts file2 according to id variable . merge using file1 . drop if _merge~=3 drops obs with any missing info . save file3

Merging data - 2 • For each row (id) all vars in file1 added to corresponding row of file2 (if there is one). • .merge creates a new variable, _merge • which =1 for those obs only in file1, =2 for those only in file2, and =3 for those in both. • So the syntax above drops those obs that don’t have data in both files • and saves the result containing x1-x6 in file3 • .append to add more obs on the same vars.

Collapsing data (use with care) • Collapse converts the data in memory into a dataset of means (or sums, medians, etc.) • This is useful when you want to provide summary info at a higher level of aggregation • For example, suppose a dataset contains data on individuals – say their reg and whether u/e • To find the average u/e rates across reg type: . collapse unemp, by(region) leaves 1 obs for each reg and mean u/e rate.

Reshaping files • Data may be “long” but thin • Eg each record is a household member • But there are few vars - say wage and hours • Data may be “wide” but short • each record is a household and has lots of vars • (eg w1 w2 w3 hours1 hours2 hours3) . reshape long inc ue, i(id) j(year) wide to long . reshape wideinc ue, i(id) j(year) back to wide . Handy for merging data together and for panel data

Syntax to remember >= means "greater or equal", & means "and", | means "or" = means “set equal to” == means “is it equal to?” ~= means “not equal” (or use != ) . means missing value For example . keep ifx1>= 1 & x1<=3 | x1==7 & x1 ~= . . gen x = log(y) . reg y x if z == 1 & y != .

Using STATA as a calculator • .display command • .dis 22/7 • disp log(250) • di exp(3.6) • di chiprob(2,6.45) (i.e. 2 df, deviance 6.45) • returns 0.398 (i.e. its significant at 5% level) • display _N • Returns the sample size • (_N is the number of the last obs)

Using the data editor • Open a datafile (eg auto.dta) • Click on the icon • Or type .edit • You can edit datapoints! • Or just browse the datafile

STATA’s editor • STATA has an editor that allows you to create do files • Enter cmds – 1 per line • Save the commands in a “do” file • Highlight commands and click the button with page (or page with text) and down arrow to “run” (or “do”) commands.

Saving output • Scroll – best to open “log file” (and close it). • Click on file, log, begin . • Or type . log using myoutput Then type some commands here and then . log close • log command allows replace and append • Default is a .smcl file extension (to “view”) • It doesn’t save graphs • Copy graphs (use cut and paste or use menus)

Saving commands • You might prefer own extension, say, .log • then you get an ASCII file that anything can edit • you can translate files to and from smcl format • click on file, log, translate and fill in the dialog box • Logging your output is a good way of developing a .do file • since it saves the commands as well as output • Or you can just log the commands • type .cmdlog using xxx • You can turn logging off and back on • .log off then .log on when ready to resume

Useful tips for .do files .#delimit ; /*makes ; the end of line char) */ .use mydata, clear ; .set more off ; .set mem 200m ; .set matsize 200 ; .log using xxxx.log, replace ; . . .log close ; .exit, clear ;

Handling string variables encode • Use encode when the original var is a character var (eg gender is "m" or "f’") • encode command does not produce dummy variables, it just assigns numbers to each group defined by the character variable. • In this example, gender was the original character var and sex is new numeric var: . encode gender, gen(sex) • decode does the opposite

Extended generate (.egen) egen • Useful when you need a new variable that is the mean, median, etc. of another variable • for all observations or for groups of observations. • Also useful when you need to simply number groups of observations based on some classification variables. • Great when you have panel data

.egen examples . egen sumvar1 = sum(var1) creates sumvar1 as sum of values of var1 . egen meanvar1= mean(var1), by(var3) creates meanvar1 as mean of all values of var1 . egen counter = count(id), by(company) creates count as the number of companies with nonmissing id’s . egen groupid = group(month year) assigns a number to each month/year group

Saving typing - 1 macro • For defining lists of vars (globally or locally). .local macvar x1 x2 x3 x4 x5 x6 .reg y1 `macvar’ .reg y2 `macvar’ .reg y3 `macvar’ .reg y4 `macvar’ • macvar becomes string “x1 x2 x3 x4 x5 x6" • Pay careful attention to the different type of quotation marks

Saving typing - 2 for • Performs same command on several vars. • It can use several types of variable lists .for var1-var25: replace @=. if @==99 replaces 99 by missings) .for var*: replace @=. if @=99 replaces 99 by missings .for 1-3, ltype(numeric): gen q@==0 creates q1=0, q2=0, q3=0 .for a b c, ltype(any): gen str2 @="x” creates a=x, b=x, c=x

Regression models - I • Linear regression and related models when the outcome variable is continuous • OLS, 2SLS, 3SLS, IV, quantile reg, Box-Cox … • Binary outcome data • the outcome variable is 0 or 1(or y/n) • probit, logit, nested logit...; • Multiple outcome data • the outcome variable is 1, 2, ..., • conditional logit, ordered probit

Regression models - II • Count data • the outcome variable is 0, 1, 2, ..., occurrences • Poisson regression, negative binomial • Choice models • multinomial choice • A, B or C • Multinomial logit, Random utility model, unordered probit, nested logit, ...etc • Selection models • Truncated, censored • Tobit, Heckman selection models; • linear regression or probit with selection

Regression models - III • STATA supports several special data types. • Once type is defined special commands work • Time series • Estimate ARIMA, and ARCH models • Estimators for autocorrelation and heteroscedasticity • Estimate MA and other smoothers • Tests for auto, het, unit roots - h, d, LM, Q, ADF, P-P ….. • TS graphs

Special data types: survey • Non-randomness induces OLS to be inefficient • STATA can handle non-random survey data • see the “syv***” commands • Example (stratified sample of medical cases): . webuse nhanes2f, clear . svyset psuid [pweight=finalwgt], strata(stratid) . svy: reg zinc age age2 weight female black orace rural . reg zinc age age2 weight female black orace rural

Special data types: duration • Survival time data • See the “st***” commands .stset failtime /*sets the var that defines duration*/ • Estimates a wide variety of models to explain duration

ST regression supports Weibull, Cox PH and other options . streg load bearings, distribution(weibull) After streg you can plot the estimated hazard with . stcurve, cumhaz STATA allows functions to be plotted by specifying the function: E.g. Weibull “hazard” model – Weibull example ….

Special data types: Panel data • STATA can handle “panel” data easily • see the “xt***” commands • Common commands are .xtdes Describe pattern of xt data .xtsum Summarize xt data .xttab Tabulate xt data .xtline Line plots with xt data .xtreg Fixed and random effects

Panel data • An xt dataset looks like this: pid yr_visit fev age sex height smokes ---------------------------------------------------------- 1071 1991 1.21 25 1 69 0 1071 1992 1.52 26 1 69 0 1071 1993 1.32 28 1 68 0 1072 1991 1.33 18 1 71 1 1072 1992 1.18 20 1 71 1 1072 1993 1.19 21 1 71 0 • xt*** cmds need vars identify person and “wave”: . iis pid . tis yr_visit • Or use the tsset command . tsset pid yr_visit, yearly

Panel regression • Once STATA has been told how to read the data it can perform regressions quite quickly: . xtreg y x, fe . xtreg y x, re

Further advice • See Stephen Jenkins’ excellent course on duration modelling in STATA • Steve Pudney’s excellent panel course • Beware his example dataset is 30mb+ • To get up and running • Just have a go - you won’t break it! • Try some of the commands in this lecture • To start to get proficient • Sign up for netcourses

Research Methods Lecture 3 More STATA

Research Methods Lecture 3 More STATA

Presentation Transcript

RESEARCH METHODS Lecture 40

RESEARCH METHODS Lecture 39

RESEARCH METHODS Lecture 41

Lecture 3 Case studies: Research Methods

Research Methods Lecture 2

Research Methods Lecture 5 Advanced STATA

RESEARCH METHODS Lecture 9

Lecture #2 Research Methods

Research Methods Lecture 1

Research Methods Lecture 3

Research Methods Lecture 4

Research Methods in the Social Sciences, Lecture 3

Research Methods: Lecture 04

Lecture 3: Embedded methods

RESEARCH METHODS Lecture 4

RESEARCH METHODS Lecture 33

RESEARCH METHODS Lecture 18

Stata 3, Regression

RESEARCH METHODS Lecture 19

RESEARCH METHODS Lecture 20

Introduction to Sociology Lecture 3 - Sociological Research Methods