STATA Tutorial- MFE

1. STATA Tutorial- MFE Elena Capatina elena.statahelp@gmail.com Office hours: Mondays 2-4pm, GE313 Wed 2-4pm, GE213 (starting Feb25th)

2. Stata 10: How to get it Buy from Stata website, pick up at Robarts: http://www.stata.com/order/new/edu/gradplans/cgpcampus-order.html or http://www.utoronto.ca/ic/software/detail/stata.html

3. Finding Data http://datacentre.chass.utoronto.ca/ For some data, need to access from campus, sometimes need to go in person to the datacentre Access off-campus using UTORvpn (download from library website) Data library � 5th floor Robars

4. Data on CHASS Canadian data: CANSIM Census Analyser International Trade and Finance data: Trade analyser: by partner, commodity, on M and X IMF trade data

5. Data on CHASS Financial markets data: CRSP Database - access NYSE/AMEX/Nasdaq daily and monthly security prices and other historical data related to over 20,000 companies Canadian Financial Markets Research Centre � Toronto stock exchange trading info about specific securities Fundata Mutual Fund Database

6. Data on CHASS Companies financial data: Financial Post Corporate Database COMPUSTAT Database - Income Statement, Balance Sheet, Flow of Funds, and supplemental data items on more than 10,000 active and 9,400 inactive companies National income statistics: OECD National Accounts Database World Bank databases Penn World Tables

7. STATA windows The command window The viewer/results window The review of commands window The variable window

8. Working with STATA From the command window Using a �do� file

9. The �do� file A text file that can be edited using any text editor (the STATA do-file editor, notepad, word, etc), but you need to save it as �filename.do� for STATA to read it From the STATA do-file editor, click �do� for STATA to execute all commands Can highlight and click �do� to execute only the highlighted command lines

10. Data editor/data browser Shows you your data Check this frequently, especially after commands you are unsure about

11. Type of commands 1. Administrative commands that tell STATA where to save results, how to manage computer memory, and so forth 2. Commands that tell STATA to read and manage datasets 3. Commands that tell STATA to modify existing variables or to create new variables 4. Commands that tell STATA to carry out the statistical analysis

12. Example: �stata1.do� clear log using "C:\Users\Elena\Documents\Various\STATA-TA\stata1.log", replace use "C:\Users\Elena\Documents\Various\STATA-TA\caschool.dta" describe generate income = avginc*1000 summarize income log close exit

13. The �log using� command The log file is an �output file� Creates and saves a log with all the actions performed by STATA and all the results How to view it later? In Stata, go to �File�, then �log�, �view�, and search for your filename, keeping in mind it has extension �.log�

14. Loading your data If your data is in STATA format, ie �filename.dta�, then enter: use �filename.dta� If your data is a comma delimited file: insheet using �filename.txt� For other formats, can use �StatTransfer� to convert to STATA format

15. Using a dictionary file A dictionary file reads data with extensions �.raw� (�.dat� too) �infile� command i.e. infile using �dictionaryname.dct� �infix� command Use if you prefer to copy and paste your dictionary in a do file

16. Useful Commands: �describe�: STATA will list all the variables, their labels, types, and tell you the # of observations Two types of variables: Numerical String (usually appear in red in the data browser) You can convert a string variable to numerical using the �destring� command: ie. �destring var1, replace� or �destring var1, force replace�

17. More commands: �generate� or �gen� Creates a new variable i.e. generate income = avginc*1000 i.e. generate log_inc = log(income) i.e. gen inc_sq = (income)^2

18. More commands: �summarize� tells STATA to compute summary statistics (mean, standard deviations, and so forth) for all variables Useful to identify outliers and get an idea of your data i.e. summarize i.e. summ income inc_sq

19. Ending the do file log close closes the file stata1.log that contains the output. The command exit tells STATA that the program has ended.

20. Example: stata2.do # delimit ; * Administrative Commands; set more off; clear; log using "C:\Users\Elena\Documents\Various\STATA-TA\stata1.log", replace * Read in the Dataset; use "C:\Users\Elena\Documents\Various\STATA-TA\caschool.dta" describe; * Transform data and Create New Variables; **** Construct average district income in $'s; generate income = avginc*1000; * Carry Out Statistical Analysis; ***** Summary Statistics for Income; summarize income; * End the Program ; log close; exit;

21. Comments in your do file: Asterisk: STATA ignores the text that comes after * (does not execute them) these lines can be used to describe what the commands are doing, or allows you to write comments. i.e. * Administrative Commands

22. Useful commands # delimit ; tells STATA that each STATA command ends with a semicolon. Useful for long commands Do not forget the �;� and write this even after the comment lines that start with *.

23. Useful Commands set more off Ensures STATA executes all commands. Otherwise, if your code is too long, the output window might be filled, and STATA will display --more-- at the bottom, not executing all commands

24. Increasing memory �set memory 600m� You may also need to increase the number of variables allowed by Stata: cannot be done with IC Stata

25. My typical admin commands clear #delimit ; set more off; set mem 1200000; cap log close; cd "C:\NLSY_1\mainfiles\"; log using "Logs\Calibration_statistics.log", replace;

26. Other commands tabulate i.e. tabulate county Shows the frequency and percent of each value of �county� in the dataset

27. The �if� command i.e. generate teachers_new= teachers if teachers<=10 replace teachers_new=0 if teachers>10 i.e. summarize teachers if county==�Nevada�

28. Operators < less than > greater than <= less than or equal to >= greater than or equal to == equal to ~= not equal to

29. Sorting the data sort i.e. sort income i.e. sort county income

30. The �by� command i.e. by county, sort: summarize income

31. Deleting variables and observations drop i.e. drop avginc - this drops the variable acginc i.e. drop if teachers<=5 this deletes only the observations for which teachers is less than 5.

32. Deleting variables and observations Keep i.e. keep if teachers>=7

33. Combining datasets �merge� command use "My Statistics\_respondent.dta", clear; sort ID; merge ID using "My Statistics\_annualfile.dta"; sort ID year; merge (ID year) using "temp1.dta";

34. Statistical relationships Correlations: correlate i.e. correlate income teachers i.e. correlate income teachers computers Regressions: reg i.e. reg income teachers i.e. reg income teachers computers

35. Graphs Scatter Plots i.e. twoway (scatter income computer)

36. Loops �forvalues� Generate 100 uniform random variables named x1, x2, ..., x100: forvalues i = 1(1)100 { generate xì' = uniform() } Divide a dataset into two datasets, each with a different education forvalues e=1/2{; use "My Statistics\Maleyearly.dta", clear; keep if education==è'; save "My Statistics\Males_è'.dta", replace; }

37. Collapse command Creates a new dataset with the specified variables summarizing current data i.e. collapse (mean) no_kids, by (education age status);

38. Saving your data Saving in Stata format: i.e. save �file name.dta� You can export your data in another format from �File�, then �Export�, then choose the type of file you want.

39. More on data cleaning �reshape� From long to wide or from wide to long Example: Wide data:

40. Reshape Example of long data:

41. Reshape From wide to long: i.e. reshape long lrn, i(id) j(year) From long to wide: i.e. reshape wide lrn, i(id) j(year) Source: http://www.ats.ucla.edu/stat/stata/notes/reshape.htm

42. �tabstat� command i.e. by ay: tabstat N if INC==2 & education1==1, s(n mean max min p50 p25 p75);

43. �egen� command Extended �generate� command. More powerful than �generate� Examples: egen age_mean = mean(age), by(year) egen totalsum = total(x) egen stdage = std(age)

44. Lagged variables �[_n-1]� tells STATA this is the previous observation �[_n-2]� is 2 observations before Examples: (assuming data is sorted) gen GDP_lagged= GDP[_n-1] gen GDP_2= GDP[_n-2]

45. Other uses for [_n-1] Filling in missing data i.e. by ID: replace education=1 if education[_n-1]==1 & education[_n+1]==1 & ID[_n-1]==ID[_n+1];

46. Shortcut: local Example using �local�: local t = 80; while `t'<=98{; use "Tagsets\status_`t'.dta", clear; do "rename_status.do"; save "weeklyfile_`t'.dta"; local t = `t'+2; };

47. Collapse command Initial data:

48. New data after collapse collapse (mean) avgage=age avgwt=wt (count) numkids=birth, by(famid) create one record per family (famid) with the average of age (called avgage) and average weight (called avgwt) within each family, and the number of kids (numkids) per family Source: http://www.ats.ucla.edu/stat/Stata/modules/collapse.htm

49. �preserve� and �restore� �preserve� tells STATA to keep your data in memory, so if your next commands modify it, you can come back to your original data Example: use �data1.dta� preserve collapse (mean) age, by (family) save �data2.dta� restore

50. Tests of significance i.e. ttest sysbp = 122.3 , level(95) computes the sample mean and standard deviation of the variable sysbp, computes a t-test that the population mean is equal to 122.3, and computes a 95% confidence interval for the population mean Source: mhtml:http://www.biostat-edu.com/files/Stata_Program_Notes_Chapter_8-posted.mht

51. STATA output . ttest sysbp = 122.3; One-sample t test ------------------------------------------------------------------------------ Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- sysbp | 199 125.8241 1.288642 18.17853 123.2829 128.3653 ------------------------------------------------------------------------------ mean = mean(sysbp) t = 2.7348 Ho: mean = 122.3 degrees of freedom = 198 � Ha: mean < 122.3 Ha: mean != 122.3 Ha: mean > 122.3 Pr(T < t) = 0.9966 Pr(|T| > |t|) = 0.0068 Pr(T > t) = 0.0034

52. Testing if means are equal ttest testscr_lo=testscr_hi, unequal unpaired test the hypothesis that testscr_lo and testscr_hi come from populations with the same mean. computes the t-statistic for the null hypothesis that the mean of testscr_lo is the same as the mean of testscr_hi unequal tells STATA that the variances in the two populations may not be the same. unpaired tells STATA that the observations are for different districts

53. ttest testscr_lo=testscr_hi, unequal unpaired; Two-sample t test with unequal variances ------------------------------------------------------------------------------ Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- testsc~o | 238 657.3513 1.254794 19.35801 654.8793 659.8232 testsc~i | 182 649.9788 1.323379 17.85336 647.3676 652.5901 ---------+-------------------------------------------------------------------- combined | 420 654.1565 .9297082 19.05335 652.3291 655.984 ---------+-------------------------------------------------------------------- diff | 7.37241 1.823689 3.787296 10.95752 ------------------------------------------------------------------------------ diff = mean(testscr_lo) - mean(testscr_hi) t = 4.0426 Ho: diff = 0 Satterthwaite's degrees of freedom = 403.607 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0001 Pr(T > t) = 0.0000

54. Simple regression regress science math female socst read Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 4, 195) = 46.69 Model | 9543.72074 4 2385.93019 Prob > F = 0.0000 Residual | 9963.77926 195 51.0963039 R-squared = 0.4892 -------------+------------------------------ Adj R-squared = 0.4788 Total | 19507.5 199 98.0276382 Root MSE = 7.1482 ------------------------------------------------------------------------------ science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- math | .3893102 .0741243 5.25 0.000 .243122 .5354983 female | -2.009765 1.022717 -1.97 0.051 -4.026772 .0072428 socst | .0498443 .062232 0.80 0.424 -.0728899 .1725784 read | .3352998 .0727788 4.61 0.000 .1917651 .4788345 _cons | 12.32529 3.193557 3.86 0.000 6.026943 18.62364 ------------------------------------------------------------------------------ Source: http://www.ats.ucla.edu/stat/stata/output/reg_output.htm

55. Reading the output table SSTotal��--The total variability around the mean.� S(Y - Ybar)2.�� SSResidual� --The sum of squared errors: S(Y - Ypredicted)2 SSModel��-- SSTotal - SSResidual. Note that SSModel / SSTotal is equal to .4892, the value of R-Square (�=proportion of the variance explained by the independent variables)

56. Reading the output table df - These are the degrees of freedom associated with the sources of variance. MS - These are the Mean Squares, the Sum of Squares divided by their respective DF.�

57. Reading the output table Coefficients: sciencePredicted = 12.32529 + .3893102*math + -2.009765*female +.0498443*socst+.3352998*read t and P>|t| - These columns provide the t-value and 2-tailed p-value used in testing the null hypothesis that the coefficient (parameter) is 0. [95% Conf. Interval] - This shows a 95% confidence interval for the coefficient.� (the coefficient will not be statistically significant if the confidence interval includes 0)

58. Predicted values After the regression, type �predict yhat� Creates a new variable �yhat� with the predicted values for the dependant variable

59. Saving the residuals predict r, residuals Checking homoskedasticity of residuals i.e. �rvfplot, yline(0)� Plots the residuals against the predicted values i.e. �estat imtest� (White test) i.e. �estat hettest� (Breusch-Pagan test) Http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm http://www.nd.edu/~rwilliam/stats2/l25.pdf

60. Linear regression with heteroskedastic errors Need robust standard errors (Huber-White): - use the �robust� option with �regress� i.e. reg teachers meal_pct expn_stu, robust Linear regression Number of obs = 420 F( 2, 417) = 9.58 Prob > F = 0.0001 R-squared = 0.0232 Root MSE = 186.17 ------------------------------------------------------------------------------ | Robust teachers | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- meal_pct | .8239426 .336556 2.45 0.015 .1623848 1.4855 expn_stu | -.026066 .0098221 -2.65 0.008 -.045373 -.0067591 _cons | 230.7061 59.44205 3.88 0.000 113.8627 347.5496 ------------------------------------------------------------------------------

61. Note: rreg � robust regression (outliers) i.e. rreg inc school exp This command is for robust regressions it concerns point estimates more than standard errors, and it implements a data-dependent method for downweighting outliers. Not to be used for heteroskedastic errors, because not the same as robust option

62. �cluster� commad For example, you might think that in a panel of countries, errors are correlated across time but independent across countries. Then, you should cluster standard errors on countries. i.e. regress y k, cluster(country)

63. Linear regression with panel data Declaring the data to be a panel: Example, where data consists of many firms, each observed over 5 years iis Firm ;tis Year ; xt are the prefix for the commands in this class xtreg should be used for regressions with panel data

64. Fixed effects: yit = a + xitb + vi + eit i.e. xtreg� lnc lny, fe Equivalent to including a dummy variable for each case (i.e. firm).

65. Random effects (RE) If you think some omitted variables may be constant over time but vary between cases, and others may be fixed between cases but vary over time, then you can include both types by using random effects. Stata's RE estimator is a weighted average of fixed and between effects i.e. xtreg lnc lny, re

66. Choosing Between Fixed and Random Effects running a Hausman test: estimate the FE model, save the coefficients, estimate the RE model, and then do the comparison. Example: xtreg dependentvar var1 var2 var3 ... , fe estimates store fixed xtreg dependentvar var1 var2 var3 ... , re estimates store random hausman fixed random If significant p-value, use FE Source: http://dss.princeton.edu/online_help/analysis/panel.htm

67. Time series data tsset � declare data to be time-series data Examples: �tsset time, yearly� (For an annual time series, time takes on values such as 1990, 1991, ...) �tsset company year, yearly� (For yearly panel data, variable company being the panel ID variable and year being a four-digit calendar year)

68. Serial correlation in residuals Testing for first order serial correlation (Durbin-Watson statistic) reg col25 col2 col3 col7 if country=="Mexico� estat dwatson Testing for higher order serial correlation (Breusch-Godfrey statistic) estat bgodfrey

69. Useful link http://www.iies.su.se/~masa/stata.htm This contains links to other STATA websites by topic http://www.princeton.edu/~erp/stata/main.html http://www.ats.ucla.edu/stat/stata/webbooks/reg/default.htm

70. GETTING MORE INFORMATION ABOUT STATA The Help menu in STATA� STATA has detailed help files available for all STATA commands. STATA commands are described in detail in the STATA User�s Guide and Reference Manual. www.stata.com. Finally, you can find several good STATA tutorials on the Web. An easy way to find a list is to do a Google search for Stata tutorial. (This tutorial was prepared using information from �STATA Tutorial to accompany Stock/Watson Introduction to Econometrics� Pearson 2003. )

STATA Tutorial- MFE

STATA Tutorial- MFE

Presentation Transcript

Stata Workshop #1

Intro to Stata

Accelerating Progress in MFE

STATA APPLICATIONS

MFE ENTERPRISES, INC.

INTRODUCTION TO STATA

MFE Simulation Data Management

Stata 简介

Stata 教學

MFE & The Business Library

STATA

Advanced Stata Programming

Teaching with Stata

Stata 3, Regression

Stata statistical software

Tips for Applying for MFE

STATA Tutorial- MFE