Simple Linear Regression

Simple Linear Regression Part II

Let’s do a brief review of some of our earlier work. On a single quantitative variable in a population the population mean was represented by the Greek letter mu - µ. If we don’t conduct a census (talk to all in the population), then xbar, the sample mean, can be used as the basis for inference about mu. A new idea we are exploring in this chapter has the context of a situation where we have two quantitative variables in a population and there is some conjecture about the variables being related in a linear fashion.

The Model If we investigated each subject in the population and noted the response variable y and the explanatory variable x we could create the model y= βo + β1x + ε, where βo is the y intercept for the model relationship between x and y (and note the book uses a Greek letter here, it looks like a capital B), β1is the slope for the model,and ε is the random error in y. This term is not always zero because we know not every point in the scatter plot is on a straight line. Note the y intercept is the mean value of y when x is 0. The slope β1is a number that represents the expected change in the mean value of y when x increases by 1 unit.

Using a Sample to Estimate the Model On the next slide I show some data and scatter plot for an example we will develop. Note that a sample has been taken from 7 graduates of a school and in the data the gpa and starting salary are in the rows of the table. Each point in the scatter plot is a gpa, starting salary pair for a graduate. With a sample of data we can estimate the regression model (meaning we find point estimates of Bo and B1) and in general notation the predicted line is written ŷ = bo + b1x, where bo is the point estimate for βo, b1 is the point estimate for β1and ŷis the value of y that will be on the line above an x value. ŷis called the predicted value of y for observation i. So in the one variable case xbar was used to think about mu, and in the two variable case bo + b1x is used for βo + β1x .

Least Squares For now we will assume Microsoft Excel or some other program can show us the estimated regression line using least squares. We just want to use what we get. On the Excel stuff in cell E17 you see the word Coefficients. In cells a18:a19 you see the words Intercept and GPA and then the numbers 14.8156153 and 5.706568981 are in cells b18:b19. This means ŷ= bo + b1Xi has been estimated to be Starting salary = 14.816 + 5.7066gpa. Note the data had starting salary measured in thousands. This means, for example, the data had 29.8 but it means the real value is 29,800.

Prediction with least squares Remember our estimated line is Starting Salaries = 14.816 + 5.7066gpa. Say we want to predict salary if the gpais 2.7. Starting Salaries = 14.816 + 5.7066(2.7) = 30.22382. This starting salary is $30,223.82

Interpolation and Extrapolation You will notice in our example data set that the smallest value for x was 2.21 and the largest value was 3.82. When we want to predict a value of y, ŷ, for a given x, if the x is within the range of the data values for x (2.21 to 3.82 in our example) then we are interpolating. But if an x is outside our range for x we are extrapolating. Extrapolating should be used with a great deal of caution. Maybe the relationship between x and y is different outside the range of our data. If so, and we use the estimated line we may be way off in our predictions. Note the intercept has to be interpreted with similar caution because unless our data includes x’s that include zero in the range, the relationship between x and y could be very different in the x = 0 neighborhood than the one suggested by least squares.

Let’s explore the starting salary, gpa example. Both variables are quantitative. The response variable is starting salary. In the population of all graduates µy is the mean or average starting salary. In a single variable setting we would use ybar as an estimate of the population mean. Plus, since we see not everyone has the same starting salary we could use the value Σ(y – ybar)2 As the basis for estimating the standard error of the mean. But, when we think about starting salary being related to gpa we change our perspective. The mean starting salary changes as gpa changes. Thus, our notion of the standard error will change as well.

Standard error of the estimate Recall earlier in the term we talked about the measure of variability for a variable called the standard deviation. We looked at xi minus xbar. For the dependent variable we have the actual data values and the predicted values. The standard error of the estimate is a measure of variability for the actual values of y around the predicted values. We look at y – ŷ. The formula is in the book and programs like Excel print the value. In cell E8 of our example we see the value as .563620693.

y ybar y yhat x x

On the previous slide I show a single data point (a y value) in reference to both the mean of y (ybar) and the predicted value of y (yhat) for a given x. In a real world data set When x changes y changes as well ( if there is a relationship between x and y) At each x there can be more than one y value (not everyone with the same x has the same y – there may be other things going on with y besides x) Let’s remember that in real world data sets 1) Not everyone has the same value on a variable – thus there is variability in the variable 2) When we introduce the regression line, note every point is on the line (usually) – there is not a perfect relationship between x and Y. To connect what we did in the context of 1 variable to what we do when we have two variables is to break up (y – yhat) – the single variable idea, into (ybar – yhat) + (y – ybar). Let’s explore this more.

Variation Remember to calculate the standard deviation of a variable we take each value and subtract off the mean and then square the result. (We also the divided by something, but that is not important in this discussion.) In a regression setting on the dependent variable Y we define the total sum of squares SST as Σ(Yi – Ybar)2 . SST can be rewritten as SST = Σ(Yi – Ŷi + Ŷi –Ybar)2 = Σ(Ŷi –Ybar)2 + Σ(Yi – Ŷi)2 = SSR + SSE. Note: you may recall from algebra that (a + b)2 = a2 + 2ab + b2. In our story here 2ab = 0. While this is not true in general in algebra it is in this context of regression. If this note makes no sense to you do not worry, just use SST = SSR + SSE

Variation So we have SST = Σ(Yi – Ybar)2, SSR = Σ(Ŷi –Ybar)2 and SSE = Σ(Yi – Ŷi)2 . On the next slide I have a graph of the data with the regression line put in and a line showing the mean of Y. For each point we could look at the how far the point is from the mean line. This is what SST is looking at. But SSR is indicating that of all the difference in the point and the mean the regression line is able to account for some of that variation. The rest of the difference is SSE. SST is the total variation in y. SSR is the variation along the regression and is looking at how yhat and ybar differ. SSE is the variation about the line and is looking at how all the data points do not sit on the regression line.

Variation Y Least Squares regression Line = Ŷi Two examples of what is going into SSE Y bar Two examples of what is going into SSR X

The Coefficient of Determination The coefficient of determination, often denoted r2, measures the proportion in the variation in Y that is explained by the independent variable X in the regression model. r2 = SSR/SST. In our example take cell F13/F15 to get .976568924. This means that 97.66 percent of the variation in starting salary is explained by the variability in gpa. Plus, only 2.34% of the variability in is due to other factors.

Coefficient of Determination Say we didn’t have an X variable to help us predict the Y variable. Then a reasonable way to predict Y would be to just use its average or mean value. But, with a regression, by using an X variable it is thought we can do better than just using the mean of Y as a predictor. In a simple linear regression r2 is an indicator of the strength of the relationship between two variables because the use of the regression model would reduce the variability in predicting the sales by just using the mean sales by the percentage obtained. In different areas of study (like marketing, management, and so on) the idea of what a good r2 is varies. But, you can be sure if r2 is .8 or above you have a strong relationship.

F test Remember we have seen two ways to test the idea that x and y are related in the population. We looked at the population model slope and the population correlation coefficient. A third way to test is through what is called the F- test. Essentially the F test is looking at the ratio of the SSR to the SSE. The thinking is that if the two variables are not related the SSR will be relatively low and the SSE will be relatively high and thus the ratio will be small. On the Excel printout we see the F statistic in our example is 208.39 and the p-value is listed as the “Significance F” and here is .000029. This is less than .05 and we conclude the variables are related.

You will notice that in this simple linear regression example the t statistic on the slope is the square root of the F statistic. The P-value on the slope and the Significance F are the same! Remember how I mentioned the t statistics with a large degree of freedom is like the standard normal z statistics. Now we see the F statistic and the t stat are related as well.

Simple Linear Regression