Multiple Regression Analysis

Multiple Regression Analysis

Let us start with Simple Linear Regression

Simple Linear Regression • Regressionrefers to the statistical technique of modeling the relationship between variables. • Insimple linearregression, we model the relationship between two variables. • One of the variables, denoted by Y, is called thedependent variable and the other, denoted by X, is called theindependent variable. • The model we will use to depict the relationship between X and Y will be astraight-line relationship. • Agraphical sketch of the pairs (X, Y) is called ascatter plot.

0 0 Y Y Y 0 0 0 X X Y Y Y X X X Examples of Scatterplots X

Simple Linear Regression Model • The equation that describes how y is related to x and • an error term is called the regression model. • The simple linear regression model is: y = a+ bx +e where: • a and b are called parameters of the model, • a is the intercept and b is the slope. • e is a random variable called the error term.

SIMPLE REGRESSION Estimating Using the Regression Line First, lets look at the equation of a straight line is: Independent variable Dependent variable Slope of the line Y-intercept

SIMPLE REGRESSION The Method of Least Squares To estimate the straight line we have to use the least squares method. This method minimizes the sum of squares of error between the estimated points on the line and the actual observed points. The sign of r will be the same as the sign of the coefficient “b” in the regression equation Y = a + b X

SIMPLE REGRESSION AND CORRELATION The estimating line Slope of the best-fitting Regression Line Y-intercept of the Best-fitting Regression Line

SIMPLE REGRESSION – EXAMPLE (Appliance store) Suppose an appliance store conducts a five-month experiment to determine the effect of advertising on sales revenue. The results are shown below. (File: PPT_Regr_example) Advertising Exp.($100s)Sales Rev.($1000S) 11 21 32 42 54

SIMPLE REGRESSION - EXAMPLE XYX2XY 11 11 2 1 4 2 3 29 6 4216 8 54 2520

SIMPLE REGRESSION - EXAMPLE b = 0.7

Sample Coefficient of Determination Interpretation: Percentage of total variation explained by the regression. We can conclude that 81.67 % of the variation in the sales revenues is explain by the variation in advertising expenditure.

SIMPLE REGRESSION AND CORRELATION :- r is the positive square root :- r is the negative square root If the slope of the estimating line is positive If the slope of the estimating line is negative The relationship between the two variables is direct

Steps in Hypothesis Testing using SPSS • State the null and alternative hypotheses • Define the level of significance (α) • Calculate the actual significance : p-value • Make decision : Reject null hypothesis, if p≤ α, for 2-tail test • Conclusion

Scatter Plot Consider the example of the appliance store. The Data and Scatter Plot of the data: Advertising Exp Sales Rev ($100s) ($1000S) 11 21 32 42 54

Summary of SPSS Regression Analysis Output Alternately, R2 = 1-[SS(Residual) / SS(Total)] = 1-(1.1/6.0)=0.817 When adjusted for degrees of freedom, Adjusted R2 = 1-[SSResidual/(n-k-1)] / [SS(Total)/(n-1)] = 1-[1.1//3]/[6/4] = 0.756

SPSS Correlation Output

H0:  = 0 (No significant linear relationship) H1:   0 (Linear relationship is significant) Hypothesis Tests for the Correlation Coefficient Use p-value for decision making. The p-value is 0.035<0.05 Therefore we will reject the null hypothesis and conclude that the correlation is significant at 5% level.

Analysis-of-Variance Table and an F Test of the Regression Model H0 : The regression model is not significant H1 : The regression model is significant The p-value is 0.035 Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. Thus, the independent variable is linearly related to y. This linear regression model is valid

Testing for the existence of linear relationship • We test the hypothesis: H0: b = 0 (the independent variable is not a significant predictor of the dependent variable) H1: b is not equal to zero (the independent variable is a significant predictor of the dependent variable). • If b is not equal to zero (if the null hypothesis is rejected), we can conclude that the Independent variable contributes significantly in predicting the Dependent variable. Conclusion: The actual significance is 0.035. Therefore we will reject the null hypothesis. The advertising expenses is a significant explanatory variable.

Multiple Regression Analysis Multiple regression uses many independent variables to predict or explain the variation in a dependent variables The basic multiple regression model is a first-order model, containing each predictor butno nonlinear terms such as squared values. In this model, each slope should be interpreted as a partial slope, the predicted effect of a one- change in a variable, holding all other variables constant.

Assumptions of the Multiple Linear Regression Model • The size of the sample has a direct impact on the statistical power of the significance testing in multiple regression. Power in multiple regression refers to the probability of detecting as statistically significant a specific level of R-square, or a regression coefficient at a specified significance level and a specific sample size. As a rule of thumb, there should be at least 20 times more cases than independent variables. • The measurement of the variables can be either continuous (metric) or dichotomous (non-metric). When the dependent variable is dichotomous (coded 0–1), discriminant analysis is appropriate. • Linearity—As regression analysis is based on the concept of correlation, the linearity of the relationship between dependent and independent variables is important.

Assumptions of the Multiple Linear Regression Model • Homoscedasticity—The assumption of equal variances between pairs of variables. • Independence of Error Terms and Normality—It is assumed that errors of prediction (differences between the obtained and predicted dependent variable scores) are independent and normally distributed. • Multicollinearity— Multicollinearity denotes the situation where the independent/predictor variables are highly correlated. When independent variables are multicollinear, there is “overlap” or sharing of predictive power so that one variable can be highly explained or predicted by the other variable(s). Thus, that predictor variable adds little to the explanatory power of the entire set. • Checking for Multicollinearity: In Multiple Regression using SPSS, it is possible to request the display of Tolerance and VIF values for each predictor as a check for multicollinearity. A tolerance value is an indication of the percent of variance in the predictor that cannot be accounted for by the other predictors. Hence, very small values indicate “overlap” or sharing of predictive power (i.e., the predictor is redundant). Tolerance values that are less than 0.10 may merit further investigation. The VIF, which stands for variance inflation factor, is computed as “1/tolerance,” and it is suggested that predictor variables whose VIF values are greater than 10 may merit further investigation.

Multiple Regression Analysis Estimating Equation Describing Relationship among Three Variables Estimating Equation Describing Relationship among n Variables

Multiple Regression Analysis – Example File: PPT_MultRegr The department is interested to know whether the amount of field audits and computer hours spent on tracking have yielded any results. Further the department has introduced the reward system for tracking the culprits. The data on actual unpaid taxes for ten cases is considered for analysis. Initially the regression of Actual Unpaid Taxes(Y) on Field Audits(X1) and Computer Hours(X2) was carried out and as a next step the Reward to Informants(X3) was also considered as a variable and analyzed. The analysis yielded the following SPSS outputs.

Steps in SPSS Analysis: Analyze – Regression – Linear; Independents – X1, X2, X3; Dependent – Y; Statistics – Regression coefficients: Estimates, Model Fit, Descriptives, Collinearity diagnostics; OK

Multiple Regression and Correlation Analysis – SPSS Output

Multiple Regression and Correlation Analysis – SPSS Output for 3 Independent Variables

Multiple Regression and Correlation Analysis Using two independent variables : Using three independent variables:

Coefficient of Determination • From the output, R2 = 0.983 • 98.3% of the variation in actual unpaid taxes is explained by the three independent variables. 1.7% remains unexplained. • For this example, multicollinearity is not a problem since the collinearity statistics values are all more than 0.1.

Multiple Regression and Correlation Analysis Making inferences about Population Parameters 1. Inferences about an individual slope or whether a variable is significant 2. Regression as a whole

Testing the Validity of the Model • We pose the question: Is there at least one independent variable linearly related to the dependent variable? • To answer the question we test the hypothesis H0: B1 = B2 = … = Bk = 0 H1: At least one Bi is not equal to zero. • If at least one Bi is not equal to zero, the model has some validity.

Multiple Regression and Correlation Analysis Inferences about the Regression as a Whole

Multiple Regression and Correlation Analysis Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. At least one of the Bi is not equal to zero. Thus, at least one independent variable is linearly related to y. This linear regression model is valid The p-value (Significance F) = 0.0000 < α (0.05) Hence, Reject the null hypothesis.

Multiple Regression and Correlation Analysis Test of whether a variable is significant. For example, Test whether reward to informants is a Significant explanatory variable.

Multiple Regression and Correlation Analysis Conclusion: The p-value (Significance t) = 0.0000 < α (0.05) Hence, Reject the null hypothesis. The reward to informants is a significant explanatory variable. In other words, the reward to informants contributes significantly in predicting the actual unpaid taxes. Similarly the field audits and computer hours are significant explanatory variables.

Interpreting the Coefficients • b0 = - 45.796. This is the intercept, the value of y when all the variables take the value zero. Since the data range of all the independent variables do not cover the value zero, do not interpret the intercept. • b1 = 0.597. In this model, for each additional field audit, the actual unpaid taxes increaseson average by 0.597% (assuming the other variables are held constant).

Interpreting the Coefficients • b2 = 1.177. In this model, for each additional computer hour, the actual unpaid taxes increaseson average by 1.177% (assuming the other variables are held constant). • b3 = 0.405. In this model, for each additional reward to informants, the actual unpaid taxes increaseson average by 0.405% (assuming the other variables are held constant).

Steps in SPSS Analysis (Stepwise): Analyze – Regression – Linear; Method – Stepwise; Independents – X1, X2, X3; Dependent – Y; Statistics – Regression coefficients: Estimates, Model Fit, Descriptives, Collinearity diagnostics; OK

Stepwise Multiple Regression Analysis – SPSS Output

Stepwise Multiple Regression Analysis – SPSS Output (Contd.)

Coefficient of Determination • From the output, • R2 = 0.595 for Model 1. That means 59.5% of the variation in actual unpaid taxes is explained by the most significant independent variable computer hours. • R2 = 0.834 for Model 2. That means 83.4% of the variation in actual unpaid taxes is explained by the two independent variables computer hours and reward to informants. The incremental explanation attributed to the variable reward to informants is 23.9%. • R2 = 0.983 for Model 3. That means 98.3% of the variation in actual unpaid taxes is explained by the three independent variables. 1.7% remains unexplained. • For this example, multicollinearity is not a problem since the collinearity statistics values are all more than 0.1.

Multiple Regression Analysis