Simple Linear Regression

Simple Linear Regression

Start by exploring the data • Construct a scatterplot • Does a linear relationship between variables exist? • Is the relationship strong? • How much variation can be explained by a linear relationship with the independent or explanatory variable?

Beers and BAC

Variance “Candy Bar” Explained Unexplained • The R-sq value: estimates the percentage of variation explained by a linear relationship with the independent or explanatory variable. Unless this estimate is 100% (or very near), it is not sufficient on its own. • The amounts of explained and unexplained information due to the model are measured by Sums of Squares

Decomposition of information into explained and unexplained parts

Residuals • A residual is the difference between an observed value of the dependent variable and the value predicted by the regression line. • Residual = (observed y) - (predicted y)= y – ŷ They help us assess the fit of a regression line.

Variance “Candy Bar” Explained Unexplained SS explained by model SS Total SS Error Systematic SS + Random SS = Total SS

Model Assumptions about the residuals (ε) • The distribution is NORMAL • The mean is ZERO • The variance is CONSTANT for all values of x (σ2) • Errors associated with any two observations are independent

Assessing the utility of the model: model variance • Variance is variability of the random error (σ2) • The higher the variability of the random error, the greater the error of prediction • σ2 is estimated with s2 (often called the mean square for error, MSE) • Variance: s2= SSE/degrees of freedom (n-2) • Standard error: • This is like standard deviation; with standard error, we are looking at deviation from the line • Approximately 95% of observed y values will lie within 2s of their respective predicted values

Assessing the utility of the model: Slope • Does y change as x changes? Does x contribute information for the prediction of y? Test this with the t-statistic or p-value (p<.05); these values are included in software output

Assessing the utility of the model: Correlation Coefficient r • Measure of the strength and direction of the linear relationship between x and y • Always between -1 and +1 • High correlation does not imply causality

Assessing the utility of the model: Coefficient of Determination (r2) • The R squared value is the % of the variation in y explained by the model. • For linear regression, the higher the value, the better the model.

Using the model for estimation and prediction: Confidence interval for mean response • For any specific value of x: • A confidence interval for adds to this estimate a margin of error based on the standard error . • Confidence intervals widen as the value of x is further from its mean.

Confidence interval for mean response

Prediction interval for a future observation • Similar to confidence interval for mean response • Standard error used in prediction interval includes • Variability due to the fact that the least-squares line is not exactly equal to the true regression line • Variability of the future response variable y around the subpopulation mean.

Prediction interval for a future observation

In the MINITAB regression window, you might want to… • Set confidence levels in Options • Enter a value for prediction in Options • Store Residuals and Fits in Storage • Display full table of fits and residuals in Results (select last bullet)

Beware of Extrapolation • Extrapolation is the use of a regression line for prediction far outside the range of values of the independent variable x that you used to obtain the line. Such predictions are not accurate.

Example from book: p. 138 • How can we tell if it is reasonable to fit a linear regression model? • Let’s run the analysis and interpret the results

Simple Linear Regression