2.1k likes | 2.22k Views
Discrete and Categorical Data. William N. Evans Department of Economics University of Maryland. Part I. Introduction. Introduction. Workhorse statistical model in social sciences is the multivariate regression model Ordinary least squares (OLS)
E N D
Discrete and Categorical Data William N. Evans Department of Economics University of Maryland
Part I Introduction
Introduction • Workhorse statistical model in social sciences is the multivariate regression model • Ordinary least squares (OLS) • yi = β0 + x1iβ1+ x2iβ2+… xkiβk+ εi • yi = xi β + εi
Linear model yi = + xi + i • and are “population” values – represent the true relationship between x and y • Unfortunately – these values are unknown • The job of the researcher is to estimate these values • Notice that if we differentiate y with respect to x, we obtain • dy/dx =
represents how much y will change for a fixed change in x • Increase in income for more education • Change in crime or bankruptcy when slots are legalized • Increase in test score if you study more
Put some concretenesson the problem • State of Maryland budget problems • Drop in revenues • Expensive k-12 school spending initiatives • Short-term solution – raise tax on cigarettes by 34 cents/pack • Problem – a tax hike will reduce consumption of taxable product • Question for state – as taxes are raised, how much will cigarette consumption fall?
Simple model: yi = + xi + i • Suppose y is a state’s per capita consumption of cigarettes • x represents taxes on cigarettes • Question – how much will y fall if x is increased by 34 cents/pack? • Problem – many reasons why people smoke – cost is but one of them –
Data • (Y) State per capita cigarette consumption for the years 1980-1997 • (X) tax (State + Federal) in real cents per pack • “Scatter plot” of the data • Negative covariance between variables • When x>, more likely that y< • When x<, more likely that y> • Goal: pick values of and that “best fit” the data • Define best fit in a moment
Notation • True model • yi = + xi + i • We observe data points (yi,xi) • The parameters and are unknown • The actual error (i)is unknown • Estimated model • (a,b) are estimates for the parameters (,) • ei is an estimate of i where • ei=yi-a-bxi • How do you estimate a and b?
Objective: Minimize sum of squared errors • Min iei2 = i(yi – a – bxi)2 • Minimize the sum of squared errors (SSE) • Treat positive and negative errors equally • Over or under predict by “5” is the same magnitude of error • “Quadratic form” • The optimal value for a and b are those that make the 1st derivative equal zero • Functions reach min or max values when derivatives are zero
The model has a lot of nice features • Statistical properties easy to establish • Optimal estimates easy to obtain • Parameter estimates are easy to interpret • Model maximizes prediction • If you minimize SSE you maximize R2 • The model does well as a first order approximation to lots of problems
Discrete and Qualitative Data • The OLS model work well when y is a continuous variable • Income, wages, test scores, weight, GDP • Does not has as many nice properties when y is not continuous • Example: doctor visits • Integer values • Low counts for most people • Mass of observations at zero
Downside of forcing non-standard outcomes into OLS world? • Can predict outside the allowable range • e.g., negative MD visits • Does not describe the data generating process well • e.g., mass of observations at zero • Violates many properties of OLS • e.g. heteroskedasticity
This talk • Look at situations when the data generating process does lend itself well to OLS models • Mathematically describe the data generating process • Show how we use different optimization procedure to obtain estimates • Describe the statistical properties
Show how to interpret parameters • Illustrate how to estimate the models with popular program STATA
Types of data generating processes we will consider • Dichotomous events (yes or no) • 1=yes, 0=no • Graduate high school? work? Are obese? Smoke? • Ordinal data • Self reported health (fair, poor, good, excel) • Strongly disagree, disagree, agree, strongly agree
Count data • Doctor visits, lost workdays, fatality counts • Duration data • Time to failure, time to death, time to re-employment
Econometric Resources • Recommended textbook • Jeffrey Wooldridge, undergraduate and grad • Lots of insight and mathematical/statistical detail • Very good examples • Helpful web sites • My graduate class • Jeff Smith’s class
Part II A quick introduction to STATA
STATA • Very fast, convenient, well-documented, cheap and flexible statistical package • Excellent for cross-section/panel data projects, not as great for time series but getting better • Not as easy to manipulate large data sets from flat files as SAS • I usually clean data in SAS, estimate models in STATA
Key characteristic of STATA • All data must be loaded into RAM • Computations are very fast • But, size of the project is limited by available memory • Results can be generated two different ways • Command line • Write a program, (*.do) then submit from the command line
Sample program to get you started • cps87_or.do • Program gets you to the point where can • Load data into memory • Construct new variables • Get simple statistics • Run a basic regression • Store the results on a disk
Data (cps87_do.dta) • Random sample of data from 1987 Current Population Survey outgoing rotation group • Sample selection • Males • 21-64 • Working 30+hours/week • 19,906 observations
Major caveat • Hardest thing to learn/do: get data from some other source and get it into STATA data set • We skip over that part • All the data sets are loaded into a STATA data file that can be called by saying: use data file name
Housekeeping at the top of the program • * this line defines the semicolon as the ; • * end of line delimiter; • # delimit ; • * set memork for 10 meg; • set memory 10m; • * write results to a log file; • * the replace options writes over old; • * log files; • log using cps87_or.log,replace; • * open stata data set; • use c:\bill\stata\cps87_or; • * list variables and labels in data set; • desc;
------------------------------------------------------------------------------------------------------------------------------------------------------------ • > - • storage display value • variable name type format label variable label • ------------------------------------------------------------------------------ • > - • age float %9.0g age in years • race float %9.0g 1=white, non-hisp, 2=place, • n.h, 3=hisp • educ float %9.0g years of education • unionm float %9.0g 1=union member, 2=otherwise • smsa float %9.0g 1=live in 19 largest smsa, • 2=other smsa, 3=non smsa • region float %9.0g 1=east, 2=midwest, 3=south, • 4=west • earnwke float %9.0g usual weekly earnings • ------------------------------------------------------------------------------
Constructing new variables • Use ‘gen’ command for generate new variables • Syntax • gen new variable name=math statement • Easily construct new variables via • Algebraic operations • Math/trig functions (ln, exp, etc.) • Logical operators (when true, =1, when false, =0)
From program • * generate new variables; • * lines 1-2 illustrate basic math functoins; • * lines 3-4 line illustrate logical operators; • * line 5 illustrate the OR statement; • * line 6 illustrates the AND statement; • * after you construct new variables, compress the data again; • gen age2=age*age; • gen earnwkl=ln(earnwke); • gen union=unionm==1; • gen topcode=earnwke==999; • gen nonwhite=((race==2)|(race==3)); • gen big_ne=((region==1)&(smsa==1));
Getting basic statistics • desc -- describes variables in the data set • sum – gets summary statistics • tab – produces frequencies (tables) of discrete variables
* get descriptive statistics; • sum; • * get detailed descriptics for continuous variables; • sum earnwke, detail; • * get frequencies of discrete variables; • tabulate unionm; • tabulate race; • * get two-way table of frequencies; • tabulate region smsa, row column cell;
STATA Resources - Specific • “Regression Models for Categorical Dependent Variables Using STATA” • J. Scott Long and Jeremy Freese • Available for sale from STATA website for $52 (www.stata.com) • Post-estimation subroutines that translate results • Do not need to buy the book to use the subroutines
In STATA command line type • net search spost • Will give you a list of available programs to download • One is Spostado from http://www.indiana.edu/~jslsoc/stata • Click on the link and install the files
Continuous Distributions • Random variables with infinite number of possible values • Examples -- units of measure (time, weight, distance) • Many discrete outcomes can be treated as continuous, e.g., SAT scores
How to describe a continuous random variable • The Probability Density Function (PDF) • The PDF for a random variable x is defined as f(x), where f(x) $ 0 If(x)dx = 1 • Calculus review: The integral of a function gives the “area under the curve”
Cumulative Distribution Function (CDF) • Suppose x is a “measure” like distance or time • 0 # x # 4 • We may be interested in the Pr(x#a) ?
CDF What if we consider all values?
Properties of CDF • Note that Pr(x # b) + Pr(x>b) =1 • Pr(x>b) = 1 – Pr(x # b) • Many times, it is easier to work with compliments
General notation for continuous distributions • The PDF is described by lower case such as f(x) • The CDF is defined as upper case such as F(a)
Standard Normal Distribution • Most frequently used continuous distribution • Symmetric “bell-shaped” distribution • As we will show, the normal has useful properties • Many variables we observe in the real world look normally distributed. • Can translate normal into ‘standard normal’
Examples of variables that look normally distributed • IQ scores • SAT scores • Heights of females • Log income • Average gestation (weeks of pregnancy) • As we will show in a few weeks – sample means are normally distributed!!!
Standard Normal Distribution • PDF: • For -# z #
Notation • (z) is the standard normal PDF evaluated at z • [a] = Pr(z a)
Standard Normal • Notice that: • Normal is symmetric: (a) = (-a) • Normal is “unimodal” • Median=mean • Area under curve=1 • Almost all area is between (-3,3) • Evaluations of the CDF are done with • Statistical functions (excel, SAS, etc) • Tables
Standard Normal CDF • Pr(z -0.98) = [-0.98] = 0.1635