1.08k likes | 1.26k Views
BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012. Motivation, Intuition, and Numerology (AUCShapiroPresent1.ppt) Exploring Theories: Bivariate Analysis Multivariate Models (Regression Approaches)
E N D
BASIC DATA ANALYSIS AND STATISTICSR. SHAPIROAmerican University in CairoJune 3-6, 2012 • Motivation, Intuition, and Numerology (AUCShapiroPresent1.ppt) • Exploring Theories: Bivariate Analysis • Multivariate Models (Regression Approaches) • Limited Dependent Variables (dichotomous variables) and Interactions • Survey Research: Issues and Sources of Error • Identifying Causal Mechanisms and Time Series Analysis • Using “Instruments” to Indentify Causal Effects
Exploring Theories: Bivariate analysis. “Correlation is not causation!” But you have to start somewhere.... First Steps • Centrality of causal theorizing. Dependent and independent variable(s) unit of analysis? Generalizing to what universe/population? Assumption of unidirectional causation (revisited later)? X -------- > Y e.g., Democracy -----------> Income (of countries) Education -----------> Income (of individuals) • Plausibility of theory? Causal mechanism/story?
Next Steps in Quantitative Research • Measurement of variables (ideally at the designated unit of analysis). “Validity” and “reliability” of measures? • Hypothesis specification (for measures); expected covariation/correlation? • Statistical evidence of covariation/correlation? • Rejecting null hypothesis? Substantive versus “statistical significance?” • Next steps? Statistical controls, multivariate analysis, to be continued…. Strengthening causal inferences.
Questions at the Statistical Analysis Stage? • “Level of measurement” for the measures of the dependent and independent variables: Categorical or Continuous? (Further distinction of “nominal,” “ordinal,” “interval” or “ratio” level variables.) • The preferred statistical method depends on this the level of measurement of the variables! • Motivation to put everything into a regression analysis framework – for later multivariate analysis. • The Bivariate Regression approach. Case of Income -------> Test score of individuals
Bivariate Ordinary Least Squares Regression (OLS) • Case of : Income -------> Test scores of individuals • The regression line takes the form of Predicted Y = intercept + slope (X) or Predicted Y = a + bX, where “a” and “b” take on the unique numeric values that minimize the average vertical distances (by minimizing the squared distances). between all the points and the regression line. To the extent Y and X are linearly related in this way, the regression line falls much closer to all the points than does the line through the mean of Y. • Min. for all cases the sum of (Y-Predicted Y)2
Regression Line Versus the Mean: Idea of “Explained Variance”
Linear regression lets us estimate the slope of the population regression line • Ultimately our aim is to estimate the causal effect on Y of a unit change in X – but for now, just think of the problem of fitting a straight line to data on two variables, Y and X. • The slope of the population regression line is the expected effect on Y of a unit change in X.
The Population Linear Regression ModelYi = β0 + β1Xi + ui, i = 1,…, n We have n observations, (Xi, Yi), i = 1,.., n. • X is the independent variable or regressor • Y is the dependent variable • β0 = intercept • β1 = slope • ui = the regression error • The regression error consists of omitted factors. In general, these omitted factors are other factors that influence Y, other than the variable X. The regression error also includes error in the measurement of Y.
The population regression model in a picture: Observations on Y and X (n = 7); the population regression line; and the regression error (the “error term”):
The OLS estimator solves: • The OLS estimator minimizes the average squared difference between the actual values of Yi and the prediction (“predicted value”) based on the estimated line. That is, it minimizes the vertical distances. • This minimization problem can be solved using calculus. • The result is the OLS estimators of β0 and β1.
Application to the California Test Score – Class Size data • Estimated slope = = – 2.28 • Estimated intercept = 698.9 • Estimated regression line: Tessscore = 698.9 – 2.28×STR
OLS regression: STATA output regress testscr str, robust Regression with robust standard errors Number of obs = 420 F( 1, 418) = 19.26 Prob > F = 0.0000 R-squared = 0.0512 Root MSE = 18.581 ------------------------------------------------------------------------- | Robust testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------+---------------------------------------------------------------- str | -2.279808 .5194892 -4.39 0.000 -3.300945 -1.258671 _cons | 698.933 10.36436 67.44 0.000 678.5602 719.3057 -------------------------------------------------------------------------
Example of the R2 and the SER Testscore = 698.9 – 2.28×STR, R2 = .05, SER = 18.6 STR explains only a small fraction of the variation in test scores. Does this make sense? Does this mean the STR is unimportant in a policy sense?
A real-data example from labor economics: average hourly earnings vs. years of education (data source: Current Population Survey):
Slope of the Regression Line, Variability Around It, and the Correlation Coefficient • Predicted Y = a + bX, where b is the slope. • Correlation Coefficient, Pearson’s “r”, ranges from -1 to 0 to +1, and is larger in size to the extent that the observed data fall very close to the regression line. The r2indicates how much closer proportionately the regression line fall closer (vertically) to the observed values of the dependent variable that the horizontal line through the mean of the dependent variable. Why are both useful?
Same Slope (b) but Correlation =0.75, Implications? More variability? Why?
OLS can be sensitive to an outlier(also look for non-linearity? discuss later?): • Is the lone point an outlier in X or Y? • In practice, outliers are often data glitches (coding or recording problems). Sometimes (or more often?) they are observations that really shouldn’t be in your data set. Plot your data!
The larger the variance of X, the smaller the variance of the slope b The number of black and blue dots is the same. Using which would you get a more accurate regression line?
Analyzing Categorical Measures • For categorical independent and dependent variables: Cross Tabulation • For a categorical independent variable and a continuous dependent variable or a categorical dependent variable that can be treated as continuous: Compare Means on the dependent variable. • For a dichotomous dependent variable coded 0-1, the mean is the proportion of cases in the 1 category, so means on it can be compared!
Go to Stata example of standard bivariate analysis, non-regression • Crucial: Preparing Data -- Recoding; Dealing with “Missing Values,” if any; etc. • Go to PDF file, W4910x11 Bivariate Crosstabs and Means Analysis. Examples from U.S Survey Data. • On to a Regression Analysis framework next…
Moving to a regression framework for categorical variables: • Treating categorical variables as continuous, if categories are “ordered” (“ordinal” vs. “nominal” level variables). • Special case of dichotomous variables. (The mean of a 0-1 variable is the proportion of cases in the “1” category (Ave. 0,0,1,1,1=.6) • Crucial bridge: “dummy variable regression.” (And now for some comic relief, normally done at a blackboard with chalk:
Example Using U.S. Survey Data and Stata Software • Assumptions in treating ordinal variables as continuous variables. • Statistical versus Substantive Significance? Variability. Sampling error”/confidence intervals. The “standard error.” • PDF file W4910x11 Bivariate Regression, Dummy Variables.
Statistical Control: Understanding Multivariate Models (Multiple Regression Analysis) • Predicted Y = a + b1X1 + b2X2, where the b’s are the coefficients for which the differences between the observed Y’s and predicted Y’s are minimum. In this case we have more b’s to estimate to min. the sum of (Y-Predicted Y)2 • It now also has the interpretations shown below, beginning with the comparisons of different possible scenarios for “conditional” regressions—that hold one variable constant.
“Effect” of Region and Democracy on Economic Growth (made up data) • Predicted EG = a + b1(Democracy)+b2(Region), where we think both democracy and region have possible causal effects. • Case of only two regions (1 and 2; Region is coded 0-1), to illustrate a simple case of Statistical Control/holding one var. constant. • Linear equation assumes no “interaction”; that is, “effect” of Democracy is the same in Region 1 and 2 (and same for Region; but is it?). There are different possibilities: (and comic relief)
(b) Interactions between continuous and binary variables Yi = β0 + β1Di + β2Xi + ui • Di is binary—a dummy variable coded 0-1; X is continuous • As specified above, the effect on Y of X (holding constant D) =β2, which does not depend on D; that is, it is the same for D=0 and for D=1. But what if that is not the case??? • To allow the effect of X to depend on D, include the “interaction term” Di×Xi as a regressor: Yi = β0 + β1Di + β2Xi + β3(Di×Xi) + ui
Binary-continuous interactions: the two regression lines Yi = β0 + β1Di + β2Xi + β3(Di×Xi) + ui Observations with Di= 0 (the “D = 0” group): Yi = β0 + β2Xi + uiThe D=0 regression line Observations with Di= 1 (the “D = 1” group): Yi = β0 + β1 + β2Xi + β3Xi + ui = (β0+β1) + (β2+β3)Xi + ui The D=1 regression line
(c) Interactions between two continuous variables Yi = β0 + β1X1i + β2X2i + ui • X1, X2 are continuous • As specified, the effect of X1 doesn’t depend on X2 • As specified, the effect of X2 doesn’t depend on X1 • To allow the effect of X1to depend on X2, include the “interaction term” X1i×X2i as a regressor: Yi = β0 + β1X1i + β2X2i + β3(X1i×X2i) + ui