920 likes | 1.14k Views
A Gentle Introduction to STATA . Jose Ramon G. Albert Research Division Chief Statistical Research & Training Center (SRTC) email: srtcres@srtc.gov.ph. SIAP-SRTC Training Course on Sampling Acceed Center, AIM, Makati Philippines 4 April 2002. OUTLINE . Statistical Computing Resources
E N D
A Gentle Introduction to STATA Jose Ramon G. Albert Research Division Chief Statistical Research & Training Center (SRTC) email: srtcres@srtc.gov.ph SIAP-SRTC Training Course on Sampling Acceed Center, AIM, Makati Philippines 4 April 2002
OUTLINE • Statistical Computing Resources • Data Management with Stata • Table Generation • Tab and Table Commands • Survey Commands
Computing Resources • The Age of ICT has brought about a synergy of computing and communications • Implications: • More DATA collected • More DATA stored • More DATA accessible and distributed
Computing Resources • There are a host of statistical software that provide pre-programmed analytical and data management capabilities. These software may be classified according to use and cost.
Computing Resources Types of Stat Software by usage • General Purpose -- SAS, SPSS, R, Splus, Statistica, Stata • Special Purposes -- econometric modeling (Eviews), seasonal adjustment (X12), Bayesian modeling (WINBUGS), survey data tabulation & variance estimation (IMPS, CENVAR)
Computing Resources Types of Stat Software by cost • Commercial Software - SAS, SPSS, Stata, S-plus • Freeware - R, IMPS, X12
Computing Resources FOR SURVEY DATA • Bascula from Statistics Netherlands. • CENVAR (& IMPS)from U.S. Bureau of the Census. • CLUSTERS from University of Essex. • Epi Info from Centers for Disease Control. • Generalized Estimation System (GES) from Statistics Canada. • IVEWare (beta version) from University of Michigan.
Computing Resources FOR SURVEY DATA • PCCARP from Iowa State University. • SAS/STAT from SAS Institute. • Stata from Stata Corporation. • SUDAAN from Research Triangle Institute. • VPLX from U.S. Bureau of the Census. • WesVar from Westat, Inc.
Computing Resources • Lists of Statistical Software http://members.aol.com/johnp71/javasta2.html http://www.stir.ac.uk/Departments/HumanSciences/SocInfo/Statistical.htm http://www.fas.harvard.edu/~stats/survey-soft/ http://www.feweb.vu.nl/econometriclinks/software.html
Computing Resources This afternoon, we will provide a demonstration on how to use STATA for accomplishing some of the most common tasks of data management, statistical computing and analysis of survey data.
Computing Resources Stata Estimation of means, totals, ratios, and proportions; linear regression, logistic regression, and probit. Point estimates, associated standard errors, confidence intervals, and design effects for the full population or subpopulations are displayed.
Computing Resources Stata Auxiliary commands display various information for linear combinations (e.g., differences) of estimators, and conduct hypothesis tests. New in Stata : contingency tables with Rao-Scott corrections of chi-squared tests; new survey-corrected regression commands including tobit, interval, censored, instrumental variables, multinomial logit, ordered logit and probit, and Poisson
Computing Resources Stata • stratified designs; • cluster sampling; • FPCs can be calculated for simple random sampling w/o replacement of sampling units within strata; • variance estimation for multistage sample data carried out through the customary between-PSU-squared-differences calculation.
Computing Resources Stata Variance estimation is done thru Taylor-series linearization in the survey analysis commands. There are also commands for jackknife and bootstrap variance estimation, but these are not specifically oriented toward survey data.
Computing Resources Note: We will demonstrate the use of STATA version 6. Current version is version 7; even a Special Edition (SE) which can handle up to 32,766 variables w/ strings up to 244 chars, and up to 11,000 x 11,000 matrices.
Data Management STARTING UP • Go to Start, Programs, Stata, Intercooled Stata • Alternatively, from Windows Explorer, go to folder c:\stata Double click wstata.exe
Data Management CREATING A NEW DATASET • Open the STATA spreadsheet editor
Data Management CREATING A NEW DATASET • Enter data into the editor, when done close the editor.
Data Management CREATING A NEW DATASET • In the STATA COMMAND window enter the command save newfile
Data Management NOTE • A STATA dataset will have extension name dta. That is, newfile is actually newfile.dta • Public use files of some surveys, e.g. VLSS (Vietnam Living Standards Survey), are in Stata format.
Data Management INSPECTING DATA BASE • In the STATA COMMAND window enter the following commands describe list summarize
Data Management NOTE: • Stata is case sensitive. • Stata commands may be abbreviated, e.g. D for DESCRIBE, SUM for SUMMARIZE, etc. • We may use Page Up/Down keys or mouse for re-selecting commands in the Review window.
Data Management NOTE: • Commands and output are shown in Results window. Windows may be re-sized. • Commands and output may be logged into a log file by pressing Open Log button.
Data Management RENAMING VARIABLES • ONE WAY : (From Data Editor) Double click anywhere in the variable‘s column resulting in a dialogue box
Data Management RENAMING VARIABLES • SECOND WAY: (In the STATA COMMAND window) enter rename var1 domain rename var2 hcn rename var3 age label variable age “HH head age” d
Data Management SAVING EDITED DATABASE • In the STATA COMMAND window enter the following commands save newfile, replace Note: typing only save newfile will result in an error message
Data Management READING PRE-EXISTING STATA DATASET • If dataset is in folder c:\fies2000 and filename is “fies00small.dta”, enter clear set mem 64m cd c:\fies2000 use fies00small NOTE: Impt for MEMORY MANAGEMENT
Data Management IMPORTING DATA • Suppose we have a dataset try.txt in c:\fies2000 folder NOTE: Missing Data coded as “.”
Data Management IMPORTING DATA • Suppose we have a dataset try.txt in c:\fies2000 folder • Use the infile command with syntax infile variable-list using filename.raw • In particular, enter cd c:\fies2000 infile domain hcn age using try.txt, automatic
Data Management TRIVIA ON STRING VARIABLES • When using the infile command for character (string) variables, we need to identify these variables. For instance infile domain hcn str30 prov using tr.txt • For more details regarding infile, enter help infile1
Data Management IMPORTING DATA • Suppose we have a dataset try2.txt in c:\fies2000 folder with the data in specific fields Assumes last line is blank line
Data Management IMPORTING DATA • Suppose we have a dataset try2.txt in c:\fies2000 folder with the data in specific fields • Use the infix command infix domain 1 hcn 2 age 3-4 using try2.txt, clear
Data Management Thus, Stata can read text files with • Infile (if the data in text is separated by spaces and does not have strings, or if strings are just one word, or if all strings are enclosed in quotes) • Infix (fixed format text) • Insheet (if text file was created by a spreadsheet or db program)
Data Management NOTE: • The commands infile, infix, insheet read data from ASCII files. Outfile is a way to save the data in ASCII. • There are third party programs, esp. Stat/Transfer and DBMS/COPY, that perform translations from one data format (e.g., dBASE, Excel, SAS, SPSS, Stata) to another.
Data Management OTHER USEFUL COMMANDS • To sort the dataset by age sort age • To get a listing of the dataset list • To get a listing of the 2nd-4th data list in 2/4
Data Management OTHER USEFUL COMMANDS • To summarize the restricted dataset of HHs whose head’s age is less than/equal to 50 summarize if age <=50 • HH head age between 35 and 50 summarize if age <50 & age >35
Data Management Comparison operators > > = == < <= != Logical operators & (and) ! (not) | (or) ~ (not)
Data Management OTHER USEFUL COMMANDS • To tabulate domain tab domain • To generate contingency tables tab domain hcn if age>35 • To get the correlation matrix correlate x y z
Data Management GENERATING & REPLACING VARIABLES • Suppose we want to obtain per capita income (pci) of FIES 2000 households clear cd d:\fies00 use fies00small gen pci=toinc/hsize
Data Management GENERATING & REPLACING VARIABLES • Now tag the household as poor (1) if pci < some threshold, say 13823, determine percent of HHs that are poor. gen poor=1 if pci < 13823 replace poor=0 if poor==. sum poor [aw=rfact] save fies00small, replace
Data Management NOTE • Small portion of data set of FIES 2000 was used. The Family Income and Expenditure Survey (FIES) is conducted by the National Statistics Office (NSO)every 3 years. Data may be purchased through the NSO website: www.census.gov.ph
Introduction to STATA (cont’d) Jose Ramon G. Albert Research Division Chief Statistical Research & Training Center (SRTC) email: srtcres@srtc.gov.ph SIAP-SRTC Training Course on Sampling Acceed Center, AIM, Makati Philippines 5 April 2002
Data Management RECALL • That if we use our fies2000 data set set mem 64m cd c:\fies2000 use fies00small sum poor [aw=rfact] • Note poverty line we provided is a weighted average of the variable poverty lines in the Philippines (for urban-rural areas across the different regions)
Digression … Official Poverty • Measurement & Latest Poverty Statistics
Estimating Food Poverty Line • Food poverty line estimated from low cost one day menus (breakfast, lunch, supper snack) constructed for each urban-rural area of a region by Food and Nutrient Research Institute (FNRI) which meet 100% sufficiency in energy and protein requirements and 80% sufficiency of other nutrients and vitamins. • RDA’s for energy: 2000 Kcal per person • RDA’s for protein: 50 grams per person • 29 such menus constructed on the basis of the 1988 Food Consumption Survey