260 likes | 371 Views
Multiple Regression. Here we add more independent variables to the regression. In this section I focus on sections 13.1, 13.2 and 13.4.
E N D
Multiple Regression Here we add more independent variables to the regression. In this section I focus on sections 13.1, 13.2 and 13.4
Let’s begin with an example of simple linear regression. A trucking company is interested in understanding what is going on with the time the drivers are on the road. Travel time is the dependent variable. It seems that travel time (how long it takes to make the days deliveries) would be influenced by miles traveled, the independent variable. I have the simple regression results from such a study on the next slide.
Note the estimated equation is y (time) = 1.27 + .068x(miles). The p-value on the slope coefficient (miles line) is .0041 and since it is less than .05(say that is alpha we use) we reject the null of a zero slope and conclude there is a relationship between miles driven and travel time. R square = .66 and thus 66% of the variation in y is explained by x. Miles time (hours)
So, we had a significant relationship between x and y and the r-square was .66. This r square is not bad, but the company may think that with only 66% of the variation in travel time explained by miles driven, maybe other variables will explain the variability as well. Another variable that could explain the travel time is the number of deliveries that are made. In a multiple regression we can add another variable to the initial X variable we had included. In Excel you just include in the definition of X two (or more columns) variables Note you may want to have the Y variable in the last column of the right or the first column on the left because the X’s need to be included together in contiguous columns. I have a multiple regression output on the next slide.
Math form The multiple regression form of the model is: Yi = B0 + B1 x1 + B2x2 + … + e, where B0 is the Y intercept of the line, Bi is the slope of the line in terms of xi, and e is an error term that captures all those influences on Y not picked up by the x’s. The error term reflects the fact that all the points are not directly on the line. So, we think there is a regression line out there that expresses the relationship between x’s and Y. We have to go find it. In fact we take a sample and get an estimate of the regression line.
When we have a sample of data from a population we will say in general the regression line is estimated to be Ŷi = b0 + b1 x1 + b2x2 + …, where the ‘hat’ refers to the estimated, or predicted, value of Y. Once we have this estimated line we are right back to algebra. Ŷ values are exactly on the line. Now, for an each value of x we have data values, called Y’s, and we have the one value of the line, called Ŷ. This part of multiple regression is very similar to simple regression. But our interpretation will change a little.
From the multiple regression output we see the coefficients section means the estimated regression line is estimated to be Ŷ = -.8687 + .0611x1 + .9234x2. From the simple regression we had Ŷ = 1.2739 + .0678x1. You will note the variable x1 does not have the same value in each case. In the simple regression case the .0687 is the increase in the mean value of Y for each unit increase in x1, but we could not control all the other factors at work in influencing Y. In the multiple case the .0611 is the increase in the mean value of Y when x1 increases by 1, but we have controlled for the influence that x2 has on Y by including x2 in the equation. Each slope or net regression coefficient measures the mean change in Y per unit change in the particular x, holding constant the effect of the other x variable.
Problem 3 page 471 a) The formal model is Y = Bo + B1(foreimp) + B2(Midsole) + e and the estimates equation is Ŷ = -0.027 + 0.791X1 + 0.605X2. b) Since the p-value on each slope is less than alpha = .05 (something I saw more about later), we can interpret each slope separately in the following way. For each unit increase in foreimp, the mean value of Y increases by 0.791, holding constant the value of midsole. For each unit increase in midsole, the mean value of Y increases by 0.605, holding constant the value of foreimp.
Correlation, Causation Think about a light switch and the light that is on the electrical circuit. If you and I collect data about someone flipping the switch and the lights going on and off we would be able to say that there is correlation from a statistical point of view. In fact, you and I know we can say something even stronger. We can say in this case there is causation. In the world of business (and other areas) we want to find relationships between variables. We would hope to find correlation and if we have a compelling theory maybe we could say we have causation.
Example Say we are interested in crop yield on a farm. What variables are correlated with crop yield? You and I know the amount of water has been shown to have an impact on yield, as has fertilizer and soil type, among other things. In a multiple regression setting, if Y = yield, x1 = water amount, and x2 = amount of fertilizer, the a multiple regression would be of the form Y = Bo +B1x1 + B2x2 + e and our estimated regression would be of the form Ŷ = bo +b1x1 + b2x2.
r square r square on the regression printout is a measure designed to indicate the strength of the impact of the x’s on y. The number can be between 0 and 1, with values closer to 1 meaning the stronger the relationship. r square is actually the percentage of the variation in y that is accounted for by the x variables. This is also an important idea because although we may have a significant relationship we may not be explaining much. From the yield example the more variation we can explain then the more we can control yield and thus feed the world, perhaps. Or maybe in business setting the more variation we can explain the more profit we can make.
F Test In a multiple regression, a case of more than one x variable, we conduct a statistical test about the overall model. The basic idea is do all the x variables as a package have a relationship with the Y variable? The null hypothesis is that there is no relationship and we write this in a shorthand notation as Ho: B1 = B2 = … =0. If this null hypothesis is true the equation for the line would mean the x’s do not have an influence on Y. The alternative hypothesis is that at least one of the beta’s is not zero, written H1: not all Bi’s = 0. Rejecting the null means that the x’s as a group are related to Y. The test is performed with what is called the F test. From the sample of data we can calculate a number called the F statistic and use this value to perform the test. In our class we will have F calculated for us because it is a tedious calculation.
F Under the null hypothesis the F statistic we calculate from a sample has a distribution similar to the one shown. The F test here is a one tailed test. The farther to the right the statistic we get in the sample is, the more we are inclined to reject the null because extreme values are not very likely to occur under the null hypothesis. In practice we pick a level of significance and use a critical F to define the difference between accepting the null and rejecting the null.
Area we make = alpha F Critical F To pick the critical F we have two types of degrees of freedom to worry about. We have the numerator and the denominator degrees of freedom to calculate. They are called this because the F stat is a fraction. Numerator degrees of freedom = number of x’s, in general called k. Denominator degrees of freedom = n – k – 1, where n is the sample size. As an example, if n = 10 and k = 2 we would say the degrees of freedom are 2 and 7 where we start with the numerator value. You would see from a book the critical F is 4.74 when alpha is .05. Many times the book also has information for alpha = .025 and .01.
Area we make = alpha =.05 here F 4.74 here In our example here the critical F is 4.74. If from the sample we get an F statistic that is greater than 4.74 we would reject the null and conclude the x’s as a package have a relationship with the variable Y. On the previous slide is an example and the F stat is 32.8784 and so the null hypothesis would be rejected in that case.
Area we make = alpha =.05 here F 4.74 here 32.8784 P-value The computer printout has a number on it that means we do not even have to look at the F table if we do not want to. But, the idea is based on the table. Here you see 32.8784 is in the rejection region. I have colored in the tail area for this number. Since 4.74 has a tail area = alpha = .05 here, we know the tail area for 32.8784 must be less than .05. This tail area is the p-value for the test stat calculated from the sample and on the computer printout is labeled Significance F. In the example the value is .0003.
SOOOOOOO, Using the F table, Reject the null if the F stat > critical F in the table, or If the Significance F < alpha. If you can NOT reject the null then at this stage of the game there is no relation between the x’s and the Y and our work here would be done. So from here out I assume we have rejected the null. t tests – After the F test we would do a t test on each of the slopes similar to what we did in a simple linear regression case to make sure that each variable on its own has a relationship with y. There we reject the null of a zero slope when the p-value on the slope is less than alpha. The t test for each regression coefficient is equivalent to testing for the contribution of each independent variable.
Multicollinearity Can you say multicollinearity? Sure you can. Let’s all say it together on the count of 3. 1, 2, 3 multicollinearity! Very good class, now listen up! Multicollinearity is an idea that volumes have been written about. We want to have a basic feel for the problem here. You and I want x variables that help explain Y. The reason is so that we can predict and explain movement in Y. As an example, if we can predict and explain crop yield maybe we can make yield higher so that we can feed the world! So, we want x’s that are correlated with Y. This is a good thing. But, sometimes the x’s will be correlated with each other. This is called multicollinearity. The problem here is that sometimes we can not see the separate influence an x has on Y because the other x’s have picked up the influence due to their correlation.
From a practical point of view multicollinearity could have the following affect on your research. You reject the null hypothesis of no relationship between all the x variables and Y with the F test, but you can not reject some or all of the separate t tests for the separate slopes. Don’t freak out (yet!). Let’s think about crop yield. Some farmers have water systems. The more it rains in a summer the less water the farmers directly apply. (Okay, maybe I am ignorant here and farmers here can use all the water they can apply – its an example.) If you included both inches of rain and water applied there is a correlation between the two. This may make it difficult to see the separate impact of either the rain or the water from the system. If the x’s (the independent variables) have correlations more extreme than .7 or -.7 then multicollinearity could be a problem
Problem 4 page 471, problem 14 page 476 and 26 on page 481 On the previous slide I have an Excel printout. a)The model for the problem is Cost = Bo + B1(sales) + B2(# of orders) + e and the estimates line is Ŷ = -2.728 + 0.0471X1 + 0.0119X2. b) For each unit increase in sales, the mean value of Y increases by 0.0471, holding constant the # of orders. For each unit increase in# of orders, the mean value of Y increases by 0.0119, holding constant the value of sales. c) While the value for bo = -2.728 we really do not look at it for much meaning because in the data there are no sales values = 0 and no # of orders = 0. This is like extrapolation we saw before – this is risky to interpret outside the range of the values of the x’s.
d) To predict we use Ŷ = -2.728 + 0.0471X1 + 0.0119X2 and note the data for sales use 400 because data is in thousands. So, we have Ŷ = -2.728 + .0471(400) + .0119(4500) = 69.662. (Not doing e and f) Prob 14 a) The critical value of F with 2 and 21 degrees of freedom is 19.45. Our Fstat from the printout is 75.13 so we can reject Ho: B1=B2=0 and conclude that there is a significant relationship between the x’s and the Y variable. b) The p-value here is the value under the heading Significance F and has value 3.0429E-10. The E-10 part means move the decimal 10 places to the left. Thus the p-value is 0.00000000030429 which is way less than alpha so we can reject the null and conclude the same thing we did in part a.
c) r2 or R square = .8759 and means that 87.59% of the variation in costs is explained by the variation in sales and the variation in the number of orders. d) Not doing Prob 26 Not doing a b) Note the p-values for both sales and # of orders are both less than alpha = .05 so each variable makes a significant contribution to the model and both should be included in the model.