260 likes | 373 Views
Lecture 12: Issues in Modeling. February 24, 2014. Question. Using the scatter plot matrix below, what is the range of LTV? 20-120 8-20 0.5-1.0 I have no idea. Administrative. Homework 5 due Quiz Wednesday Hypothesis testing workshop: Monday after spring break Exam 1 results
E N D
Lecture 12:Issues in Modeling February 24, 2014
Question Using the scatter plot matrix below, what is the range of LTV? 20-120 8-20 0.5-1.0 I have no idea
Administrative • Homework 5 due • Quiz Wednesday • Hypothesis testing workshop: Monday after spring break • Exam 1 results • A little low, but not bad.
Exam 1 • Average = 70.37; sd = 16; min = 27; max = 94; median = 73
Multiple Regression Example MRM checklist: • Look at the data (scatterplot / correlation matrix) • Transform any variables? • Use Theory! • Fit the regression model • Examine residuals and fitted values from the regression • Look at residual plots by the various explanatory variables and by fitted values • Look at calibration plot • Are the residuals normally distributed? • Examine the F-statistic • Test / interpret the partial slopes.
Regression Modeling • How do you decide what variables to include in your model? • Depends on the question you’re trying to answer! • Why not include all of them? • Over-fitting the data • Sidebar – what is the data? • A sample: generally (hopefully?) a representation of the population but not perfect • So which ones do you include? • Hopefully you have substantive knowledge to know which ones are important and how they relate to one another. Think path diagrams. • Often we’re interested in building a regression model to answer a question. • So… include the variables to answer that question, not necessarily to increase R2!
Regression Example 2 • Data: CAPM.csv • CAPM: Open the data and begin to fit a model predicting Sony %Change by Market %Change. • CAPM provides a theoretical starting point for the simple regression of Sony %Change on Market %Change • Can we improve our prediction of Sony %Change? • Fit a multivariate model predicting Sony %Change by Market %Change, Dow %Change, Small-Big and Hi-Low.
Regression Example 2 Fit a multivariate model predicting Sony %Change by Market %Change, Dow %Change, Small-Big and Hi-Low. • Look at the data (scatterplot / correlation matrix) • Any problems?
Collinearity • Collinearity = Correlation between your explanatory variables. • Not necessarily a problem but reason for concern. • Produces imprecise estimates of the partial slopes. And/or difficultly in interpreting the partial slopes. • What is the estimate of the slope in the simple regression model (Beta)? • 1.278 (se = 0.132) • What is the estimate of the slope of Market %Change in the MRM? • 0.97(se = 0.398) • This difference is because of collinearity between Market %Change and Dow %Change.
Collinearity • Many (most?) regression models have some collinearity. • Not technically a violation of any of the MRM assumptions • But can still cause problems interpreting the results of your model. • MRM estimates each partial slope using only variation that is unique to each explanatory variable. When we have high correlation, there isn’t much unique information there to use. • Variance Inflation Factor (VIF) • Measure of the effect of collinearity on the precision of the partial slope.
Variance Inflation Factor • Variance Inflation Factor (VIF) • Measure of the effect of collinearity on the precision of the partial slope. • Where R2j is the R2 when regressing Xj on all of the other explanatory variables (X-j): • If no correlation between explanatory variables, VIF = 1 • If explanatory variables are perfectly correlated, VIF = ∞
Variance Inflation Factor (VIF) • From the model predicting Sony %Change by Market %Change, Dow %Change, Small-Big and Hi-Low, what is the VIF of Market %Change? • FYI: I don’t know of a VIF function in Excel/StatTools. • Maybe there is one; I don’t know. But you can still calculate it.. • 9.76 • 8.92 • 1.59 • I have no idea
Variance Inflation Factor • Variance Inflation Factor (VIF) • You can also obtain the VIFs of all explanatory variables by the diagonal of the inverse of the correlation matrix of the explanatory variables • Construct a correlation matrix of just the explanatory variables (do not include the response) • Invert that matrix. In Excel use minverse() • The diagonal of the result will be the VIFs
Collinearity Signs of Collinearity: • R2 increases less than we would expect • Slopes of correlated explanatory variables in the model change dramatically • The F-statistic is more impressive than the individual t-statistics. • Standard errors for partial slopes are larger than those for marginal slopes • VIFs increase • No hard and fast rules for VIF thresholds. Some people say 5, some say 10.
Collinearity • Perfect collinearity: is it possible? • Any examples? • Definitely possible; you need to make sure you don’t include a perfectly collinear relationship by accident: • Eg: imagine SAT total score = SAT Math + SAT Writing + SAT CR • Including all 4 variables as explanatory variables will be a perfectly collinear relationship (3 of them define the 4th) • Why is this a problem? • We can’t estimate all 4 coefficients at once; the model isn’t identified. Multiple sets of coefficients produce the same answer. • Easier mistake to make (and more common) with categorical explanatory variables.
Collinearity Remedies for collinearity: • Remove redundant explanatory variables • Re-express explanatory variables • Eg: use the average of (Market %change + Dow %change) as an alternative explanatory variable • Do nothing • Not a joke, but only if the explanatory variables are sensible estimates. Realize that some collinearity will exist.
Removing Explanatory Vars • After adding several explanatory variables to a model, some of those added and some of those originally present may not be statistically significant. • Remove those variables for which both statistics and substance indicate removal (e.g., remove Dow % Change rather than Market % Change).
Multiple Regression:Choosing Independent Vars Several kinds of Specification Error (error in specifying the model): • Not including a relevant variable: omitted variable bias • Could lead to entire regression equation being suspect; might positively or negatively bias estimates, depending on the correlations with the omitted variable. • Including a redundant variable: • Less precise estimates, increased collinearity. Lower adjusted R2 • Incorrect functional form (non-linear) • Already dealt with this to some degree. • Simultaneity / endogeneity bias • More on this when we come to causality. Theory, not statistical fit, should be the most important criterion for the inclusion of a variable in a regression equation.
Choosing Independent Vars Choice of explanatory variables: • Causality language helps (jumping ahead slightly) • Imagine we’re really interesting in one coefficient in particular, the “treatment” variable but want to build a model estimating a dependent variable:
Choosing Independent Vars Relationships between variables: Type A: Affects the dependent variable but uncorrelated with the treatment
Choosing Independent Vars Relationships between variables: Type B: Affects the dependent variable but correlated with the treatment due to a common cause.
Choosing Independent Vars Relationships between variables: Type C: Affects the dependent variable but correlated with the treatment by chance.
Choosing Independent Vars Relationships between variables: Type D: Affects the dependent variable directly but also indirectly via the treatment variable.
Choosing Independent Vars Relationships between variables: Type E: Affects the dependent variable directly but are influenced by the treatment variable. Problematic variable!! Don’t include in model
Choosing Independent Vars When deciding whether to include a variable: theory is key. If you have a good understanding of the theoretical relationships between the variables (often we do): • Include types A-D • Avoid including type E. • Also known as a “post-treatment” variable. • Even if including it increases your R2 and/or your standard error of the regression is lower, avoid it. Including it will bias the estimate on the treatment variable.