Discrete and Categorical Data

Discrete and Categorical Data William N. Evans Department of Economics/MPRC University of Maryland

Part I Introduction

Introduction • Workhorse statistical model in social sciences is the multivariate regression model • Ordinary least squares (OLS) • yi = β0 + x1iβ1+ x2iβ2+… xkiβk+ εi • yi = xi β + εi

Linear model yi =  + xi + i •  and  are “population” values – represent the true relationship between x and y • Unfortunately – these values are unknown • The job of the researcher is to estimate these values • Notice that if we differentiate y with respect to x, we obtain • dy/dx = 

 represents how much y will change for a fixed change in x • Increase in income for more education • Change in crime or bankruptcy when slots are legalized • Increase in test score if you study more

Put some concretenesson the problem • State of Maryland budget problems • Drop in revenues • Expensive k-12 school spending initiatives • Short-term solution – raise tax on cigarettes by 34 cents/pack • Problem – a tax hike will reduce consumption of taxable product • Question for state – as taxes are raised, how much will cigarette consumption fall?

Simple model: yi =  + xi + i • Suppose y is a state’s per capita consumption of cigarettes • x represents taxes on cigarettes • Question – how much will y fall if x is increased by 34 cents/pack? • Problem – many reasons why people smoke – cost is but one of them –

Data • (Y) State per capita cigarette consumption for the years 1980-1997 • (X) tax (State + Federal) in real cents per pack • “Scatter plot” of the data • Negative covariance between variables • When x>, more likely that y< • When x<, more likely that y> • Goal: pick values of  and  that “best fit” the data • Define best fit in a moment

Notation • True model • yi =  + xi + i • We observe data points (yi,xi) • The parameters  and  are unknown • The actual error (i)is unknown • Estimated model • (a,b) are estimates for the parameters (,) • ei is an estimate of i where • ei=yi-a-bxi • How do you estimate a and b?

Objective: Minimize sum of squared errors • Min iei2 = i(yi – a – bxi)2 • Minimize sum of squared errors (SSE) • Treat (+) and (-) errors equally • Over or under predict by “5” is the same magnitude of error • “Quadratic form” • The optimal value for a and b are those that make the 1st derivative equal zero • Functions reach min or max values when derivatives are zero

The model has a lot of nice features • Statistical properties easy to establish • Optimal estimates easy to obtain • Parameter estimates are easy to interpret • Model maximizes prediction • If you minimize SSE you maximize R2 • The model does well as a first order approximation to lots of problems

Discrete and Qualitative Data • The OLS model work well when y is a continuous variable • Income, wages, test scores, weight, GDP • Does not has as many nice properties when y is not continuous • Example: doctor visits • Integer values • Low counts for most people • Mass of observations at zero

Downside of forcing non-standard outcomes into OLS world? • Can predict outside the allowable range • e.g., negative MD visits • Does not describe the data generating process well • e.g., mass of observations at zero • Violates many properties of OLS • e.g. heteroskedasticity

This talk • Look at situations when the data generating process does not lend itself well to OLS models • Mathematically describe the data generating process • Show how we use different optimization procedure to obtain estimates • Describe the statistical properties

Show how to interpret parameters • Illustrate how to estimate the models with popular program STATA

Types of data generating processes we will consider • Dichotomous events (yes or no) • 1=yes, 0=no • Graduate high school? work? Are obese? Smoke? • Ordinal data • Self reported health (fair, poor, good, excel) • Strongly disagree, disagree, agree, strongly agree

Count data • Doctor visits, lost workdays, fatality counts • Duration data • Time to failure, time to death, time to re-employment

Recommended Textbooks • Jeffrey Wooldridge, “Econometric analysis of cross sectional and panel data” • Lots of insight and mathematical/statistical detail • Very good examples • William Greene, “Econometric Analysis” • more topics • Somewhat dated examples

Course web page • www.bsos.umd.edu/econ/evans/jpsm.html • Contains • These notes • All STATA programs and data sets • A couple of “Introduction to STATA” handouts • Links to some useful web sites

STATA Resources Discrete Outcomes • “Regression Models for Categorical Dependent Variables Using STATA” • J. Scott Long and Jeremy Freese • Available for sale from STATA website for $52 (www.stata.com) • Post-estimation subroutines that translate results • Do not need to buy the book to use the subroutines

In STATA command line type • net search spost • Will give you a list of available programs to download • One is Spostado from http://www.indiana.edu/~jslsoc/stata • Click on the link and install the files

Part II A brief introduction to STATA

STATA • Very fast, convenient, well-documented, cheap and flexible statistical package • Excellent for cross-section/panel data projects, not as great for time series but getting better • Not as easy to manipulate large data sets from flat files as SAS • I usually clean data in SAS, estimate models in STATA

Key characteristic of STATA • All data must be loaded into RAM • Computations are very fast • But, size of the project is limited by available memory • Results can be generated two different ways • Command line • Write a program, (*.do) then submit from the command line

Sample program to get you started • cps87_or.do • Program gets you to the point where can • Load data into memory • Construct new variables • Get simple statistics • Run a basic regression • Store the results on a disk

Data (cps87_do.dta) • Random sample of data from 1987 Current Population Survey outgoing rotation group • Sample selection • Males • 21-64 • Working 30+hours/week • 19,906 observations

Major caveat • Hardest thing to learn/do: get data from some other source and get it into STATA data set • We skip over that part • All the data sets are loaded into a STATA data file that can be called by saying: use data file name

Housekeeping at the top of the program • * this line defines the semicolon as the ; • * end of line delimiter; • # delimit ; • * set memork for 10 meg; • set memory 10m; • * write results to a log file; • * the replace options writes over old; • * log files; • log using cps87_or.log,replace; • * open stata data set; • use c:\bill\stata\cps87_or; • * list variables and labels in data set; • desc;

------------------------------------------------------------------------------------------------------------------------------------------------------------ • > - • storage display value • variable name type format label variable label • ------------------------------------------------------------------------------ • > - • age float %9.0g age in years • race float %9.0g 1=white, non-hisp, 2=place, • n.h, 3=hisp • educ float %9.0g years of education • unionm float %9.0g 1=union member, 2=otherwise • smsa float %9.0g 1=live in 19 largest smsa, • 2=other smsa, 3=non smsa • region float %9.0g 1=east, 2=midwest, 3=south, • 4=west • earnwke float %9.0g usual weekly earnings • ------------------------------------------------------------------------------

Constructing new variables • Use ‘gen’ command for generate new variables • Syntax • gen new variable name=math statement • Easily construct new variables via • Algebraic operations • Math/trig functions (ln, exp, etc.) • Logical operators (when true, =1, when false, =0)

From program • * generate new variables; • * lines 1-2 illustrate basic math functoins; • * lines 3-4 line illustrate logical operators; • * line 5 illustrate the OR statement; • * line 6 illustrates the AND statement; • * after you construct new variables, compress the data again; • gen age2=age*age; • gen earnwkl=ln(earnwke); • gen union=unionm==1; • gen topcode=earnwke==999; • gen nonwhite=((race==2)|(race==3)); • gen big_ne=((region==1)&(smsa==1));

Getting basic statistics • desc -- describes variables in the data set • sum – gets summary statistics • tab – produces frequencies (tables) of discrete variables

From program • * get descriptive statistics; • sum; • * get detailed descriptics for continuous variables; • sum earnwke, detail; • * get frequencies of discrete variables; • tabulate unionm; • tabulate race; • * get two-way table of frequencies; • tabulate region smsa, row column cell;

Results from sum • Variable | Obs Mean Std. Dev. Min Max • -------------+-------------------------------------------------------- • age | 19906 37.96619 11.15348 21 64 • race | 19906 1.199136 .525493 1 3 • educ | 19906 13.16126 2.795234 0 18 • unionm | 19906 1.769065 .4214418 1 2 • smsa | 19906 1.908369 .7955814 1 3 • -------------+--------------------------------------------------------

Detailed summary • usual weekly earnings • ------------------------------------------------------------- • Percentiles Smallest • 1% 128 60 • 5% 178 60 • 10% 210 60 Obs 19906 • 25% 300 63 Sum of Wgt. 19906 • 50% 449 Mean 488.264 • Largest Std. Dev. 236.4713 • 75% 615 999 • 90% 865 999 Variance 55918.7 • 95% 999 999 Skewness .668646 • 99% 999 999 Kurtosis 2.632356

2x2 Table • 1=east, | • 2=midwest, | 1=live in 19 largest smsa, • 3=south, | 2=other smsa, 3=non smsa • 4=west | 1 2 3 | Total • -----------+---------------------------------+---------- • 1 | 2,806 1,349 842 | 4,997 • | 56.15 27.00 16.85 | 100.00 • | 38.46 18.89 15.39 | 25.10 • | 14.10 6.78 4.23 | 25.10 • -----------+---------------------------------+---------- • 2 | 1,501 1,742 1,592 | 4,835 • | 31.04 36.03 32.93 | 100.00 • | 20.58 24.40 29.10 | 24.29 • | 7.54 8.75 8.00 | 24.29 • -----------+---------------------------------+---------- • 3 | 1,501 2,542 1,904 | 5,947 • | 25.24 42.74 32.02 | 100.00 • | 20.58 35.60 34.80 | 29.88 • | 7.54 12.77 9.56 | 29.88 • -----------+---------------------------------+---------- • 4 | 1,487 1,507 1,133 | 4,127 • | 36.03 36.52 27.45 | 100.00 • | 20.38 21.11 20.71 | 20.73 • | 7.47 7.57 5.69 | 20.73 • -----------+---------------------------------+---------- • Total | 7,295 7,140 5,471 | 19,906 • | 36.65 35.87 27.48 | 100.00 • | 100.00 100.00 100.00 | 100.00 • | 36.65 35.87 27.48 | 100.00

Running a regression • Syntax reg dependent-variable independent-variables • Example from program *run simple regression; reg earnwkl age age2 educ nonwhite union;

Source | SS df MS Number of obs = 19906 • -------------+------------------------------ F( 5, 19900) = 1775.70 • Model | 1616.39963 5 323.279927 Prob > F = 0.0000 • Residual | 3622.93905 19900 .182057239 R-squared = 0.3085 • -------------+------------------------------ Adj R-squared = 0.3083 • Total | 5239.33869 19905 .263217216 Root MSE = .42668 • ------------------------------------------------------------------------------ • earnwkl | Coef. Std. Err. t P>|t| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • age | .0679808 .0020033 33.93 0.000 .0640542 .0719075 • age2 | -.0006778 .0000245 -27.69 0.000 -.0007258 -.0006299 • educ | .069219 .0011256 61.50 0.000 .0670127 .0714252 • nonwhite | -.1716133 .0089118 -19.26 0.000 -.1890812 -.1541453 • union | .1301547 .0072923 17.85 0.000 .1158613 .1444481 • _cons | 3.630805 .0394126 92.12 0.000 3.553553 3.708057 • ------------------------------------------------------------------------------

Analysis of variance • R2 = .3085 • Variables explain 31% of the variation in log weekly earnings • F(5,19900) • Tests the hypothesis that all covariates (except constant) are jointly zero

Interpret results • Y = β0 + β1Xi + εi • dY/dX = β1 • But in this case Y=ln(W) where W weekly wages • dln(W)/dX = (dW/W)/dX = β1 • Percentage change in wages given a change in x

For each additional year of education, wages increase by 6.9% • Non whites earn 17.2% less than whites • Union members earn 13% more than nonunion members

Part III Some notes about probability distributions

Continuous Distributions • Random variables with infinite number of possible values • Examples -- units of measure (time, weight, distance) • Many discrete outcomes can be treated as continuous, e.g., SAT scores

How to describe a continuous random variable • The Probability Density Function (PDF) • The PDF for a random variable x is defined as f(x), where f(x) $ 0 If(x)dx = 1 • Calculus review: The integral of a function gives the “area under the curve”

Cumulative Distribution Function (CDF) • Suppose x is a “measure” like distance or time • 0 # x # 4 • We may be interested in the Pr(x#a) ?

CDF What if we consider all values?

Discrete and Categorical Data

Discrete and Categorical Data

Presentation Transcript

Categorical and discrete data. Non-parametric tests

Categorical Data Analysis

Categorical Data

Categorical Data

Categorical Data

Analyzing categorical data

Displaying Categorical data

Categorical Data

Categorical Data

Interpreting Categorical Data

Categorical Data

Categorical Data

Categorical Data

Categorical Data Analysis

Categorical Data

Classifying Categorical Data

Relations and Categorical Data

Discrete and Categorical Data

Analyzing Categorical Data

Categorical Data Analysis

Categorical data

Categorical and Quantiative Data