Multiple Linear Regression

Multiple Linear Regression An introduction, some assumptions, and then model reduction 1

First, what is multiple linear regression? • First, some terminology…these 3 equations all say the same thing… • 0, 1 and so on are called beta coefficients 1 Y’ =a+bX Y’ =mx +b Y’ =0+1X a = b = 0 = INTERCEPT b = m =1 = SLOPE

First, what is multiple linear regression? • Simple linear regression uses just one predictor or independent variable • Multiple linear regression just adds more IV’s (or predictors) • Each IV or predictor brings another beta coefficient with it… Y’ = 0 + 1X1 + 2X2 1 2

Now, an example… • So, now we can add the sex variable to our prediction equation from last week • Here is the one with just height in the model… 1 R2 = .65 Note R2 = .65 for the simple model

Now, an example… • But if we add sex… • The slope of each line is the same, but it now fits both values of sex by adjusting the height of the line R2 = .99 Nice improvement in R2! 1

Now, an example… • In terms of the equation, this is achieved by… 1 When sex = 1 (female): Y’ = 0 3 2 + 1X1 + (2*1) When sex = 2 (male): Y’ = 0 + 1X1 4 + (2*2)

Now, an example… • This is called “dummy coding” when the second variable is dichotomous (as sex is) • The principle is similar when the second variable is continuous • Adding more variables simply captures more variance on the dependent variable (potentially, of course) 1

Note on graphs/charts for MLR • I showed you the example in 2D, but with multiple regression an accurate chart is only possible in the number of dimensions equal to the total number of variables in the model (dependent plus independent) 1 So, three dimensions would be needed here 2 Y’ = 0 + 1X1 + 2X2

1 4 Y’ =1.5, x1= 0, x2= 1 Y’ = 2, x1= 1, x2= 1 Y’ 4 2 regressionsurface 0,1, 1 Y’ =.5, x1= 0, x2= 0 3 X2 Y’ =1, x1= 1, x2= 0 0,0,0 X1 0,1, 0

Assumptions of MLR • Four assumptions of MLR (known by acronym “LINE”) • Linearity: the residuals (differences between the obtained and predicted DV scores) should have a straight-line relationship with predicted DV scores • Independence: the observations on the DV are uncorrelated with each other • Normality: the observations on the DV are normally distributed for each combination of values for the IV’s • Equality of variance: the variance of the residuals about predicted DV scores should be the same for all predicted scores • (homoscedasticity…remember the cone shaped pattern?) • We will not test MLR assumptions in this class (enough that you do them for SLR) 1

Items to consider - 1 1 • Sample size & # predictors: • A crucial aspect of the worth of a prediction equation is whether it will generalize to other samples • With multiple regression (based on multiple correlation) minimizing the prediction errors of the regression line is like maximizing the correlation for that sample • So one would expect that on another sample, the correlation (and thus R2) would shrink 2 3

Items to consider - 1 • Sample size & # predictors: • Our problem is reducing the risk of shrinkage • Two most important factors: • Sample size (n) • Number of predictors (independent variables) (k) • Expect big shrinkage with ratios less than 5:1 (n:k) • Guttman (1941): 136 subjects, 84 predictors, obtained multiple r = .73 • On new independent sample, r = .04! • Stevens (1986): n:k should be 15:1 or greater in social science research • Tabachnick & Fidell (1996): n > 50 + 8k 1

Items to consider - 1 • Sample size & # predictors: • What to do if you violate these rules: • Report Adjusted R2in addition to R2 when your sample size is too small, or close to it…small samples (and/or too many predictors) tend to result in overestimating r (and consequently R2) 1

Items to consider - 2 • List data – check for errors, outliers, influential data points • MLR is very sensitive to outliers, but outliers and influential points are not necessarily the same thing • Need to sort out whether an outlier is influential • Check for outliers in the initial data screening process (scatterplot), or via Cook’s distance (see regression options in SPSS) • Cook’s distance • A measure of change in regression coefficients that would occur if the case was omitted…reveals the cases most influential for the regression equation • CD of 1 is generally thought to be large 1 2 3

Items to consider - 2 • Outliers • What to do with outliers, when found, is a highly controversial topic: • Leave • Delete • Transform 1 2

Multiple Linear Regression