Lecture 12: Issues in Modeling

Lecture 12:Issues in Modeling February 24, 2014

Question Using the scatter plot matrix below, what is the range of LTV? 20-120 8-20 0.5-1.0 I have no idea

Administrative • Homework 5 due • Quiz Wednesday • Hypothesis testing workshop: Monday after spring break • Exam 1 results • A little low, but not bad.

Exam 1 • Average = 70.37; sd = 16; min = 27; max = 94; median = 73

Multiple Regression Example MRM checklist: • Look at the data (scatterplot / correlation matrix) • Transform any variables? • Use Theory! • Fit the regression model • Examine residuals and fitted values from the regression • Look at residual plots by the various explanatory variables and by fitted values • Look at calibration plot • Are the residuals normally distributed? • Examine the F-statistic • Test / interpret the partial slopes.

Regression Modeling • How do you decide what variables to include in your model? • Depends on the question you’re trying to answer! • Why not include all of them? • Over-fitting the data • Sidebar – what is the data? • A sample: generally (hopefully?) a representation of the population but not perfect • So which ones do you include? • Hopefully you have substantive knowledge to know which ones are important and how they relate to one another. Think path diagrams. • Often we’re interested in building a regression model to answer a question. • So… include the variables to answer that question, not necessarily to increase R2!

Regression Example 2 • Data: CAPM.csv • CAPM: Open the data and begin to fit a model predicting Sony %Change by Market %Change. • CAPM provides a theoretical starting point for the simple regression of Sony %Change on Market %Change • Can we improve our prediction of Sony %Change? • Fit a multivariate model predicting Sony %Change by Market %Change, Dow %Change, Small-Big and Hi-Low.

Regression Example 2 Fit a multivariate model predicting Sony %Change by Market %Change, Dow %Change, Small-Big and Hi-Low. • Look at the data (scatterplot / correlation matrix) • Any problems?

Collinearity • Collinearity = Correlation between your explanatory variables. • Not necessarily a problem but reason for concern. • Produces imprecise estimates of the partial slopes. And/or difficultly in interpreting the partial slopes. • What is the estimate of the slope in the simple regression model (Beta)? • 1.278 (se = 0.132) • What is the estimate of the slope of Market %Change in the MRM? • 0.97(se = 0.398) • This difference is because of collinearity between Market %Change and Dow %Change.

Collinearity • Many (most?) regression models have some collinearity. • Not technically a violation of any of the MRM assumptions • But can still cause problems interpreting the results of your model. • MRM estimates each partial slope using only variation that is unique to each explanatory variable. When we have high correlation, there isn’t much unique information there to use. • Variance Inflation Factor (VIF) • Measure of the effect of collinearity on the precision of the partial slope.

Variance Inflation Factor • Variance Inflation Factor (VIF) • Measure of the effect of collinearity on the precision of the partial slope. • Where R2j is the R2 when regressing Xj on all of the other explanatory variables (X-j): • If no correlation between explanatory variables, VIF = 1 • If explanatory variables are perfectly correlated, VIF = ∞

Variance Inflation Factor (VIF) • From the model predicting Sony %Change by Market %Change, Dow %Change, Small-Big and Hi-Low, what is the VIF of Market %Change? • FYI: I don’t know of a VIF function in Excel/StatTools. • Maybe there is one; I don’t know. But you can still calculate it.. • 9.76 • 8.92 • 1.59 • I have no idea

Variance Inflation Factor • Variance Inflation Factor (VIF) • You can also obtain the VIFs of all explanatory variables by the diagonal of the inverse of the correlation matrix of the explanatory variables • Construct a correlation matrix of just the explanatory variables (do not include the response) • Invert that matrix. In Excel use minverse() • The diagonal of the result will be the VIFs

Collinearity Signs of Collinearity: • R2 increases less than we would expect • Slopes of correlated explanatory variables in the model change dramatically • The F-statistic is more impressive than the individual t-statistics. • Standard errors for partial slopes are larger than those for marginal slopes • VIFs increase • No hard and fast rules for VIF thresholds. Some people say 5, some say 10.

Collinearity • Perfect collinearity: is it possible? • Any examples? • Definitely possible; you need to make sure you don’t include a perfectly collinear relationship by accident: • Eg: imagine SAT total score = SAT Math + SAT Writing + SAT CR • Including all 4 variables as explanatory variables will be a perfectly collinear relationship (3 of them define the 4th) • Why is this a problem? • We can’t estimate all 4 coefficients at once; the model isn’t identified. Multiple sets of coefficients produce the same answer. • Easier mistake to make (and more common) with categorical explanatory variables.

Collinearity Remedies for collinearity: • Remove redundant explanatory variables • Re-express explanatory variables • Eg: use the average of (Market %change + Dow %change) as an alternative explanatory variable • Do nothing • Not a joke, but only if the explanatory variables are sensible estimates. Realize that some collinearity will exist.

Removing Explanatory Vars • After adding several explanatory variables to a model, some of those added and some of those originally present may not be statistically significant. • Remove those variables for which both statistics and substance indicate removal (e.g., remove Dow % Change rather than Market % Change).

Multiple Regression:Choosing Independent Vars Several kinds of Specification Error (error in specifying the model): • Not including a relevant variable: omitted variable bias • Could lead to entire regression equation being suspect; might positively or negatively bias estimates, depending on the correlations with the omitted variable. • Including a redundant variable: • Less precise estimates, increased collinearity. Lower adjusted R2 • Incorrect functional form (non-linear) • Already dealt with this to some degree. • Simultaneity / endogeneity bias • More on this when we come to causality. Theory, not statistical fit, should be the most important criterion for the inclusion of a variable in a regression equation.

Choosing Independent Vars Choice of explanatory variables: • Causality language helps (jumping ahead slightly) • Imagine we’re really interesting in one coefficient in particular, the “treatment” variable but want to build a model estimating a dependent variable:

Choosing Independent Vars Relationships between variables: Type A: Affects the dependent variable but uncorrelated with the treatment

Choosing Independent Vars Relationships between variables: Type B: Affects the dependent variable but correlated with the treatment due to a common cause.

Choosing Independent Vars Relationships between variables: Type C: Affects the dependent variable but correlated with the treatment by chance.

Choosing Independent Vars Relationships between variables: Type D: Affects the dependent variable directly but also indirectly via the treatment variable.

Choosing Independent Vars Relationships between variables: Type E: Affects the dependent variable directly but are influenced by the treatment variable. Problematic variable!! Don’t include in model

Choosing Independent Vars When deciding whether to include a variable: theory is key. If you have a good understanding of the theoretical relationships between the variables (often we do): • Include types A-D • Avoid including type E. • Also known as a “post-treatment” variable. • Even if including it increases your R2 and/or your standard error of the regression is lower, avoid it. Including it will bias the estimate on the treatment variable.

Lecture 12: Issues in Modeling

Lecture 12: Issues in Modeling

Presentation Transcript

Modeling Detailed Operations

Drift-Diffusion Modeling

STERILIZATION AND ASEPSIS

GUI Modeling with UML2

SVAR Modeling in STATA

Computer Modeling

UML Unified MODELING Language

A Practical Overview of Structural Equation Modeling for Business Research

Analysis

Composite Structural Analysis and Design Issues

Radio Propagation and Channel Modeling

Corporate Entrepreneurship

Coherent Functional, Electrical and Physical Modeling of IP Blocks using ALF

INF5120 ”Modellbasert Systemutvikling” ”Modelbased System development”

Modeling Application Process

Modeling Transformations

Watershed Modeling Using ArcView

CHAPTER 19 GIS MODELS AND MODELING 19.1 Basic Elements of GIS Modeling

Multi-scale Modeling in Systems Biology

Chapter 7: Entity-Relationship Model

Network Traffic Modeling