Statistics Workshop Specialized Models Spring 2009 Bert Kritzer

Statistics Workshop Specialized ModelsSpring 2009Bert Kritzer

Inferring Causation Regression suggests but does not prove causation! • Must check for spurious relationships • Must have correct time ordering • Must elimination of alternatives • Must recognize the possibility of multiple causal processes functioning independently • “multi-conjunctional causation” • Must confront possible mutual causation • simultaneous equations • identification

Quality of Law School Attended r Career Success Quality of Law School Attended IQ Career Success Ambition Spurious Correlation

Regression and One Way Causation • Crucial element is eliminating alternative explanations • Need to include predictors representing alternatives in regression equation • hope that they are not statistically significant • hope that variable of interest remains significant after alternative explanations are included • Must still deal with • Proper time ordering • Form of the relationship (linear vs. nonlinear, interactions/conditional relationships)

Quantity Price Simultaneous EquationsClassic Supply/Demand Problem Supply Demand

Quantity Price Simultaneous EquationsThe Supply/Demand Blob Problem Demand Supply Observed Line

Simultaneous EquationsThe “Identification” Problem Supply Equation Is Identified: Both Equations Are Identified:

Quantity Price Simultaneous EquationsIdentify Supply Equation by Adding Advertising to Demand Equation Supply Demand, A=2 Demand, A=3 Demand, A=1

Simultaneous EquationsNonlegal Example: Political Socialization .34 FF F .29 .60 .38 C .26 .40 MF M

What Regression Can’t Do Regarding Causation • Sorting out “necessary” vs. “sufficient” conditions • Sorting out multiple causal processes leading to same result

Time Series Data • Change over time tends to be incremental • Observation at time t is usually not independent of the observation at time t-1 • The revenue a company receives from a product at time t is correlated with the revenue at time t-1 • Although there can be “interruptions” to the basic pattern due to market changes (e.g., entrance of a competitor) • The issue presented is labeled “serial correlation” or “autocorrelation”

Time Series • Observations are not independent over time • Biased estimates of standard errors • Tend to be too low • Diagnostics • Plots • Residuals • Specialized statistics (Durbin-Watson) • Solutions • Remove the correlation by transforming data (usually involves focusing on differences) • Incorporating time dependence into statistical model (MLE) • Specialized methods (“Box-Jenkins”)

The Random WalkEight Random Walks

Time SeriesA Random Walk? Number of Cases Decided by the Supreme Court with Signed Opinions

Time SeriesFirst Differences Number of Cases Decided by the Supreme Court with Signed Opinions

Time SeriesA Random Walk, 2? Number of IFP Petitions for Certiorari Filed the Supreme Court

Time SeriesFirst Differences, 2 Number of IFP Petitions for Certiorari Filed the Supreme Court

The Interrupted Time Series 1962-1976

Interrupted Time SeriesTypes of Interventions

Serially Correlated ErrorsThe Usual Conceptualization (1) “AR1” (2) where “AR2” (3)

Autoregression (“AR”)The Echo Effect

Moving Average (MA)

Diagnosing Serial Correlation Issues • A wide variety of statistics are used to diagnose the presence and structure of serial correlation • Estimate ρ as shown in the previous slide • Statistic called “Durbin-Watson” • LaGrange Multiplier Test • Dickey-Fuller Test • Fitting models that take into account this correlation • “filter” the data to remove the correlation • fit models that explicitly include the lack of independence

Example: First DifferencesTo Remove Serial Correlation

Generalized First Differences

Specialized Time Series Models • “ARIMA” (Box-Jenkins) Models • deals jointly with AR and MA effects • Models that do “seasonal adjustment” • Panel models • Cross-section time-series models

Time Series ExamplePrice and Anti-Trust Jonathan B. Baker and Daniel L. Rubinfeld, Empirical Methods in Antitrust: Review and Critique, 1 Am. L. & Econ. Rev. 386, 393 (1999).

The Selection Problem • grades/lsat example • “in-out” variable is itself a random variable • “selection” models • “censored” (“truncation”): Y=Y* if Y* > some constant C; otherwise Y=C (C usually is 0) • “tobit” • “self-selection” (“observed” vs. “not observed”): observe Y only if some other unobserved variable (the selection variable) exceeds some threshold (you only observe a 1 or 0, as in probit or logit) • “Heckman” models • “switching” models • compare to interaction models where key variable is not stochasticsuch as gender or race • switching model involves a variable which itself is stochastic and is determined by a causal mechanism • something like Heckman

Heckman Selection ModelApplied to Sentencing • How to measure sentence? • How is probation or suspended sentence treated? • Do same factors affect in-out decision as affect length of incarceration? • Could race affect one, but not the other? • Could race have an opposite effect on length compared to in-out?

Pennsylvania Study

The Heckman Model Results

Race Differences Darrell Steffensmeier & Stephen Demuth. Ethnicity and Judges' Sentencing Decisions: Hispanic-Black-White Comparisons, 39 Criminology 145-178 (2001)

Multilevel Models • Data at several levels • Census • individual • household • residential building • Education • District • School • Classroom/teacher • Student

Hierarchical Linear Model (HLM) • HLM is a method that is specifically designed to model data measured at multiple levels • It produces estimates of the effects of variables measured at each level in a way that takes into account how many actual measurements you have at each level • Running standard regression models fails to account for the different frequency of measurement in multi-level data • Might have 10,000 individuals but only 20 distinct measures of school characteristics

HLM Model for School Achievement SOURCE: Sarah TheuleLubienski and Christopher Lubienski , School Sector and Academic Achievement: A Multilevel Analysis of NAEP Mathematics Data, 43 Am. Educ. Res. J. 651, 672-73 (2006)

Conclusions I • There is usually no one correct way to do statistics • There are wrong ways • With large data sets, specific choices as to how to test hypotheses will not affect conclusions except at the margins • The “model” used can have substantial effects on the estimates of the values one obtains

Conclusions II • Descriptive statistics can be used with samples and with populations • Inferential statistics can be used with samples and populations • With samples, one typically wants to draw inferences to the full population • With populations, one is asking with the pattern observed could be generated by a purely random process

Conclusions IIIDescription • Univariate • central tendency, dispersion, shape • Bivariate/Multivariate • nature of relationship (e.g., slope in regression) • strength of relationship (“correlation”) • “proportional reduction in error” (PRE) • Model of relationship • form (linear, nonlinear)

Conclusions IVTesting Hypotheses • Power • How wrong is the null hypothesis (i.e., how strong is the effect we are looking for)? • Possibility of error • Type I error: incorrectly rejecting null (seeing something that really isn’t there) • Type II error: incorrectly failing to reject a null (failing to see something that is there) • Must avoid “fallacy of affirming the consequent” (failing to reject the null does not mean the null is correct)

Conclusions VIssues in Estimation • Estimation can be in the form of either • a single value (a point estimate), or • a range of values (interval estimate, confidence interval, margin of error) • In thinking about estimation, we need to consider • statistical bias • “efficiency” or variation

Conclusion VIMethods of Estimation • Point estimation • Population equivalent (“method of moments”) • Minimize error (e.g., least squares) • Maximum likelihood • Interval estimation • “pivot method”: use probability theory to pivot around a point estimation • “bootstrap”: simulate repeated samples

Conclusions VII THE BOTTOM LINE Statistics should be understood as Random Variables

THE END

Statistics Workshop Specialized Models Spring 2009 Bert Kritzer