780 likes | 1.66k Views
Stata 10: How to get it. Buy from Stata website, pick up at Robarts: http://www.stata.com/order/new/edu/gradplans/cgpcampus-order.htmlor http://www.utoronto.ca/ic/software/detail/stata.html. Finding Data. http://datacentre.chass.utoronto.ca/For some data, need to access from campus, sometimes need to go in person to the datacentreAccess off-campus using UTORvpn (download from library website)Data library ? 5th floor Robars.
E N D
1. STATA Tutorial- MFE Elena Capatina
elena.statahelp@gmail.com
Office hours: Mondays 2-4pm, GE313
Wed 2-4pm, GE213 (starting Feb25th)
2. Stata 10: How to get it Buy from Stata website, pick up at Robarts:
http://www.stata.com/order/new/edu/gradplans/cgpcampus-order.html
or
http://www.utoronto.ca/ic/software/detail/stata.html
3. Finding Data http://datacentre.chass.utoronto.ca/
For some data, need to access from campus, sometimes need to go in person to the datacentre
Access off-campus using UTORvpn (download from library website)
Data library – 5th floor Robars
4. Data on CHASS Canadian data:
CANSIM
Census Analyser
International Trade and Finance data:
Trade analyser: by partner, commodity, on M and X
IMF trade data
5. Data on CHASS Financial markets data:
CRSP Database - access NYSE/AMEX/Nasdaq daily and monthly security prices and other historical data related to over 20,000 companies
Canadian Financial Markets Research Centre – Toronto stock exchange trading info about specific securities
Fundata Mutual Fund Database
6. Data on CHASS Companies financial data:
Financial Post Corporate Database
COMPUSTAT Database - Income Statement, Balance Sheet, Flow of Funds, and supplemental data items on more than 10,000 active and 9,400 inactive companies
National income statistics:
OECD National Accounts Database
World Bank databases
Penn World Tables
7. STATA windows The command window
The viewer/results window
The review of commands window
The variable window
8. Working with STATA From the command window
Using a “do” file
9. The “do” file A text file that can be edited using any text editor (the STATA do-file editor, notepad, word, etc), but you need to save it as “filename.do” for STATA to read it
From the STATA do-file editor, click “do” for STATA to execute all commands
Can highlight and click “do” to execute only the highlighted command lines
10. Data editor/data browser Shows you your data
Check this frequently, especially after commands you are unsure about
11. Type of commands 1. Administrative commands that tell STATA where to save results, how to manage computer memory, and so forth
2. Commands that tell STATA to read and manage datasets
3. Commands that tell STATA to modify existing variables or to create new variables
4. Commands that tell STATA to carry out the statistical analysis
12. Example: “stata1.do” clear
log using "C:\Users\Elena\Documents\Various\STATA-TA\stata1.log", replace
use "C:\Users\Elena\Documents\Various\STATA-TA\caschool.dta"
describe
generate income = avginc*1000
summarize income
log close
exit
13. The “log using” command The log file is an “output file”
Creates and saves a log with all the actions performed by STATA and all the results
How to view it later?
In Stata, go to “File”, then “log”, “view”, and search for your filename, keeping in mind it has extension “.log”
14. Loading your data If your data is in STATA format, ie “filename.dta”, then enter:
use “filename.dta”
If your data is a comma delimited file:
insheet using “filename.txt”
For other formats, can use “StatTransfer” to convert to STATA format
15. Using a dictionary file A dictionary file reads data with extensions “.raw” (“.dat” too)
“infile” command
i.e. infile using “dictionaryname.dct”
“infix” command
Use if you prefer to copy and paste your dictionary in a do file
16. Useful Commands: “describe”:
STATA will list all the variables, their labels, types, and tell you the # of observations
Two types of variables:
Numerical
String (usually appear in red in the data browser)
You can convert a string variable to numerical using the “destring” command: ie. “destring var1, replace” or “destring var1, force replace”
17. More commands: “generate” or “gen”
Creates a new variable
i.e. generate income = avginc*1000
i.e. generate log_inc = log(income)
i.e. gen inc_sq = (income)^2
18. More commands: “summarize”
tells STATA to compute summary statistics (mean, standard deviations, and so forth) for all variables
Useful to identify outliers and get an idea of your data
i.e. summarize
i.e. summ income inc_sq
19. Ending the do file log close closes the file stata1.log that contains the output.
The command exit tells STATA that the program has ended.
20. Example: stata2.do # delimit ;
* Administrative Commands;
set more off;
clear;
log using "C:\Users\Elena\Documents\Various\STATA-TA\stata1.log", replace
* Read in the Dataset;
use "C:\Users\Elena\Documents\Various\STATA-TA\caschool.dta"
describe;
* Transform data and Create New Variables;
**** Construct average district income in $'s;
generate income = avginc*1000;
* Carry Out Statistical Analysis;
***** Summary Statistics for Income;
summarize
income;
* End the Program ;
log close;
exit;
21. Comments in your do file: Asterisk:
STATA ignores the text that comes after * (does not execute them)
these lines can be used to describe what the commands are doing, or allows you to write comments.
i.e. * Administrative Commands
22. Useful commands # delimit ;
tells STATA that each STATA command ends with a semicolon.
Useful for long commands
Do not forget the “;” and write this even after the comment lines that start with *.
23. Useful Commands set more off
Ensures STATA executes all commands. Otherwise, if your code is too long, the output window might be filled, and STATA will display --more-- at the bottom, not executing all commands
24. Increasing memory “set memory 600m”
You may also need to increase the number of variables allowed by Stata: cannot be done with IC Stata
25. My typical admin commands clear
#delimit ;
set more off;
set mem 1200000;
cap log close;
cd "C:\NLSY_1\mainfiles\";
log using "Logs\Calibration_statistics.log", replace;
26. Other commands tabulate
i.e. tabulate county
Shows the frequency and percent of each value of “county” in the dataset
27. The “if” command i.e. generate teachers_new= teachers if teachers<=10
replace teachers_new=0 if teachers>10
i.e. summarize teachers if county==“Nevada”
28. Operators < less than
> greater than
<= less than or equal to
>= greater than or equal to
== equal to
~= not equal to
29. Sorting the data sort
i.e. sort income
i.e. sort county income
30. The “by” command i.e. by county, sort: summarize income
31. Deleting variables and observations drop
i.e. drop avginc
- this drops the variable acginc
i.e. drop if teachers<=5
this deletes only the observations for which teachers is less than 5.
32. Deleting variables and observations Keep
i.e. keep if teachers>=7
33. Combining datasets “merge” command
use "My Statistics\_respondent.dta", clear;
sort ID;
merge ID using "My Statistics\_annualfile.dta";
sort ID year;
merge (ID year) using "temp1.dta";
34. Statistical relationships Correlations:
correlate
i.e. correlate income teachers
i.e. correlate income teachers computers
Regressions:
reg
i.e. reg income teachers
i.e. reg income teachers computers
35. Graphs Scatter Plots
i.e. twoway (scatter income computer)
36. Loops “forvalues”
Generate 100 uniform random variables named x1, x2, ..., x100:
forvalues i = 1(1)100 {
generate x`i' = uniform()
}
Divide a dataset into two datasets, each with a different education
forvalues e=1/2{;
use "My Statistics\Maleyearly.dta", clear;
keep if education==`e';
save "My Statistics\Males_`e'.dta", replace;
}
37. Collapse command Creates a new dataset with the specified variables summarizing current data
i.e. collapse (mean) no_kids, by (education age status);
38. Saving your data Saving in Stata format:
i.e. save “file name.dta”
You can export your data in another format from “File”, then “Export”, then choose the type of file you want.
39. More on data cleaning “reshape”
From long to wide or from wide to long
Example:
Wide data:
40. Reshape Example of long data:
41. Reshape From wide to long:
i.e. reshape long lrn, i(id) j(year)
From long to wide:
i.e. reshape wide lrn, i(id) j(year)
Source: http://www.ats.ucla.edu/stat/stata/notes/reshape.htm
42. “tabstat” command i.e. by ay: tabstat N if INC==2 & education1==1, s(n mean max min p50 p25 p75);
43. “egen” command Extended “generate” command.
More powerful than “generate”
Examples:
egen age_mean = mean(age), by(year)
egen totalsum = total(x)
egen stdage = std(age)
44. Lagged variables “[_n-1]” tells STATA this is the previous observation
“[_n-2]” is 2 observations before
Examples: (assuming data is sorted)
gen GDP_lagged= GDP[_n-1]
gen GDP_2= GDP[_n-2]
45. Other uses for [_n-1] Filling in missing data
i.e. by ID: replace education=1 if education[_n-1]==1 & education[_n+1]==1 & ID[_n-1]==ID[_n+1];
46. Shortcut: local Example using “local”:
local t = 80;
while `t'<=98{;
use "Tagsets\status_`t'.dta", clear;
do "rename_status.do";
save "weeklyfile_`t'.dta";
local t = `t'+2;
};
47. Collapse command Initial data:
48. New data after collapse collapse (mean) avgage=age avgwt=wt (count) numkids=birth, by(famid)
create one record per family (famid) with the average of age (called avgage) and average weight (called avgwt) within each family, and the number of kids (numkids) per family
Source: http://www.ats.ucla.edu/stat/Stata/modules/collapse.htm
49. “preserve” and “restore” “preserve” tells STATA to keep your data in memory, so if your next commands modify it, you can come back to your original data
Example:
use “data1.dta”
preserve
collapse (mean) age, by (family)
save “data2.dta”
restore
50. Tests of significance i.e. ttest sysbp = 122.3 , level(95)
computes the sample mean and standard deviation of the variable sysbp, computes a t-test that the population mean is equal to 122.3, and computes a 95% confidence interval for the population mean
Source: mhtml:http://www.biostat-edu.com/files/Stata_Program_Notes_Chapter_8-posted.mht
51. STATA output
. ttest sysbp = 122.3;
One-sample t test
------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
sysbp | 199 125.8241 1.288642 18.17853 123.2829 128.3653
------------------------------------------------------------------------------
mean = mean(sysbp) t = 2.7348
Ho: mean = 122.3 degrees of freedom = 198
Ha: mean < 122.3 Ha: mean != 122.3 Ha: mean > 122.3
Pr(T < t) = 0.9966 Pr(|T| > |t|) = 0.0068 Pr(T > t) = 0.0034
52. Testing if means are equal ttest testscr_lo=testscr_hi, unequal unpaired
test the hypothesis that testscr_lo and testscr_hi come from populations with the same mean.
computes the t-statistic for the null hypothesis that the mean of testscr_lo is the same as the mean of testscr_hi
unequal tells STATA that the variances in the two populations may not be the same.
unpaired tells STATA that the observations are for different districts
53. ttest testscr_lo=testscr_hi, unequal unpaired;
Two-sample t test with unequal variances
------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
testsc~o | 238 657.3513 1.254794 19.35801 654.8793 659.8232
testsc~i | 182 649.9788 1.323379 17.85336 647.3676 652.5901
---------+--------------------------------------------------------------------
combined | 420 654.1565 .9297082 19.05335 652.3291 655.984
---------+--------------------------------------------------------------------
diff | 7.37241 1.823689 3.787296 10.95752
------------------------------------------------------------------------------
diff = mean(testscr_lo) - mean(testscr_hi) t = 4.0426
Ho: diff = 0 Satterthwaite's degrees of freedom = 403.607
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0001 Pr(T > t) = 0.0000
54. Simple regression regress science math female socst read
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 4, 195) = 46.69
Model | 9543.72074 4 2385.93019 Prob > F = 0.0000
Residual | 9963.77926 195 51.0963039 R-squared = 0.4892
-------------+------------------------------ Adj R-squared = 0.4788
Total | 19507.5 199 98.0276382 Root MSE = 7.1482
------------------------------------------------------------------------------
science | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
math | .3893102 .0741243 5.25 0.000 .243122 .5354983
female | -2.009765 1.022717 -1.97 0.051 -4.026772 .0072428
socst | .0498443 .062232 0.80 0.424 -.0728899 .1725784
read | .3352998 .0727788 4.61 0.000 .1917651 .4788345
_cons | 12.32529 3.193557 3.86 0.000 6.026943 18.62364
------------------------------------------------------------------------------
Source: http://www.ats.ucla.edu/stat/stata/output/reg_output.htm
55. Reading the output table SSTotal --The total variability around the mean. S(Y - Ybar)2.
SSResidual --The sum of squared errors: S(Y - Ypredicted)2
SSModel -- SSTotal - SSResidual.
Note that SSModel / SSTotal is equal to .4892, the value of R-Square ( =proportion of the variance explained by the independent variables)
56. Reading the output table df - These are the degrees of freedom associated with the sources of variance.
MS - These are the Mean Squares, the Sum of Squares divided by their respective DF.
57. Reading the output table Coefficients:
sciencePredicted = 12.32529 + .3893102*math + -2.009765*female +.0498443*socst+.3352998*read
t and P>|t| - These columns provide the t-value and 2-tailed p-value used in testing the null hypothesis that the coefficient (parameter) is 0.
[95% Conf. Interval] - This shows a 95% confidence interval for the coefficient. (the coefficient will not be statistically significant if the confidence interval includes 0)
58. Predicted values After the regression, type “predict yhat”
Creates a new variable “yhat” with the predicted values for the dependant variable
59. Saving the residuals predict r, residuals
Checking homoskedasticity of residuals
i.e. “rvfplot, yline(0)”
Plots the residuals against the predicted values
i.e. “estat imtest” (White test)
i.e. “estat hettest” (Breusch-Pagan test)
Http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm
http://www.nd.edu/~rwilliam/stats2/l25.pdf
60. Linear regression with heteroskedastic errors Need robust standard errors (Huber-White):
- use the “robust” option with “regress”
i.e. reg teachers meal_pct expn_stu, robust
Linear regression Number of obs = 420
F( 2, 417) = 9.58
Prob > F = 0.0001
R-squared = 0.0232
Root MSE = 186.17
------------------------------------------------------------------------------
| Robust
teachers | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meal_pct | .8239426 .336556 2.45 0.015 .1623848 1.4855
expn_stu | -.026066 .0098221 -2.65 0.008 -.045373 -.0067591
_cons | 230.7061 59.44205 3.88 0.000 113.8627 347.5496
------------------------------------------------------------------------------
61. Note: rreg – robust regression (outliers) i.e. rreg inc school exp
This command is for robust regressions
it concerns point estimates more than standard errors, and it implements a data-dependent method for downweighting outliers.
Not to be used for heteroskedastic errors, because not the same as robust option
62. “cluster” commad For example, you might think that in a panel of countries, errors are correlated across time but independent across countries. Then, you should cluster standard errors on countries.
i.e. regress y k, cluster(country)
63. Linear regression with panel data Declaring the data to be a panel:
Example, where data consists of many firms, each observed over 5 years
iis Firm ;tis Year ;
xt are the prefix for the commands in this class
xtreg should be used for regressions with panel data
64. Fixed effects: yit = a + xitb + vi + eit
i.e. xtreg lnc lny, fe
Equivalent to including a dummy variable for each case (i.e. firm).
65. Random effects (RE) If you think some omitted variables may be constant over time but vary between cases, and others may be fixed between cases but vary over time, then you can include both types by using random effects.
Stata's RE estimator is a weighted average of fixed and between effects
i.e. xtreg lnc lny, re
66. Choosing Between Fixed and Random Effects running a Hausman test:
estimate the FE model, save the coefficients, estimate the RE model, and then do the comparison.
Example:
xtreg dependentvar var1 var2 var3 ... , fe
estimates store fixed
xtreg dependentvar var1 var2 var3 ... , re
estimates store random
hausman fixed random
If significant p-value, use FE
Source: http://dss.princeton.edu/online_help/analysis/panel.htm
67. Time series data tsset – declare data to be time-series data
Examples:
“tsset time, yearly” (For an annual time series, time takes on values such as 1990, 1991, ...)
“tsset company year, yearly” (For yearly panel data, variable company being the panel ID variable and year being a four-digit calendar year)
68. Serial correlation in residuals Testing for first order serial correlation (Durbin-Watson statistic)
reg col25 col2 col3 col7 if country=="Mexico“
estat dwatson
Testing for higher order serial correlation (Breusch-Godfrey statistic)
estat bgodfrey
69. Useful link http://www.iies.su.se/~masa/stata.htm
This contains links to other STATA websites by topic
http://www.princeton.edu/~erp/stata/main.html
http://www.ats.ucla.edu/stat/stata/webbooks/reg/default.htm
70. GETTING MORE INFORMATION ABOUT STATA The Help menu in STATA
STATA has detailed help files available for all STATA commands.
STATA commands are described in detail in the STATA User’s Guide and Reference Manual.
www.stata.com.
Finally, you can find several good STATA tutorials on the Web. An easy way to find a list is to do a Google search for Stata tutorial.
(This tutorial was prepared using information from “STATA Tutorial to accompany Stock/Watson Introduction to Econometrics” Pearson 2003. )