Introduction to Probability and Statistics Thirteenth Edition

Introduction to Probability and StatisticsThirteenth Edition Chapter 12 Linear Regression and Correlation

Correlation & Regression • Univariate & Bivariate Statistics • U: frequency distribution, mean, mode, range, standard deviation • B: correlation – two variables • Correlation • linear pattern of relationship between one variable (x) and another variable (y) – an association between two variables • graphical representation of the relationship between two variables • Warning: • No proof of causality • Cannot assume x causes y

1. Correlation Analysis • Correlation coefficient measures the strength of the relationship between x and y Sample Pearson’s correlation coefficient

Pearson’s Correlation Coefficient • “r” indicates… • strength of relationship (strong, weak, or none) • direction of relationship • positive (direct) – variables move in same direction • negative (inverse) – variables move in opposite directions • r ranges in value from –1.0 to +1.0 -1.0 0.0 +1.0 Strong Negative No Rel. Strong Positive

Limitations of Correlation • linearity: • can’t describe non-linear relationships • e.g., relation between anxiety & performance • no proof of causation • Cannot assume x causes y

Some Correlation Patterns Linear relationships Curvilinear relationships Y Y X X Y Y X X

Some Correlation Patterns Strong relationships Weak relationships Y Y X X Y Y X X

Example The table shows the heights and weights of n= 10 randomly selected college football players.

Example – scatter plot r = .8261 Strong positive correlation As the player’s height increases, so does his weight.

Inference using r • The populationcoefficient of correlationis called (“rho”). We can test for a significant correlation between x and y using a t test:

Example Is there a significant positive correlation between weight and height in the population of all college football players? Use the t-table with n-2 = 8 df to bound the p-value as p-value < .005. There is a significant positive correlation between weight and height in the population of all college football players.

2. Linear Regression • Regression: Correlation + Prediction • Regression analysis is used to predict the value of one variable (the dependent variable) on the basis of other variables (the independent variables). • Dependent variable: denoted y • Independent variables: denoted x1, x2, …, xk

Example • Let y be the monthly sales revenuefor a company. This might be a function of several variables: • x1 = advertising expenditure • x2 = time of year • x3 = state of economy • x4 = size of inventory • We want to predict y using knowledge of x1, x2, x3 and x4.

Some Questions • Which of the independent variables are useful and which are not? • How could we create a prediction equation to allow us to predict y using knowledge of x1, x2, x3 etc? • How good is this prediction? We start with the simplest case, in which the response y is a function of a single independent variable, x.

In regression, the systematic component is the overall linear relationship, and the random component is the variation around the line. Data A statistical model separates the systematic component of a relationship from the random component. Statistical model Systematic component + Random errors Model Building

A Simple Linear Regression Model • Explanatory and Response Variables are Numeric • Relationship between the mean of the response variable and the level of the explanatory variable assumed to be approximately linear (straight line) • Model: • b1 > 0  Positive Association • b1 < 0  Negative Association • b1 = 0  No Association

Picturing the Simple Linear Regression Model Regression Plot Y y  = Slope Error:  1 a = Intercept X 0 x

Simple Linear Regression Analysis • Variables: • x = Independent Variable • y = Dependent Variable • Parameters: • a = y Intercept • β= Slope • ε~ normal distribution with mean 0 and variance s2 • y = actual value of a score • = predicted value

Simple Linear Regression Model… y b=slope=y/x a intercept x

The Method of Least Squares The equation of the best-fitting line is calculated using a set of n pairs (xi, yi). • We choose our estimates a and b to estimate a and b so that the vertical distances of the points from the line, • are minimized.

Least Squares Estimators

Example The table shows the IQ scores for a random sample of n = 10 college freshmen, along with their final calculus grades. Use your calculator to find the sums and sums of squares.

Example

The Analysis of Variance • The total variation in the experiment is measured by the total sum of squares: • TheTotal SSis divided into two parts: • SSR(sum of squares for regression): measures the variation explained by using x in the model. • SSE(sum of squares for error): measures the leftover variation not explained by x.

The Analysis of Variance We calculate

The ANOVA Table Total df = Mean Squares Regression df = Error df = n -1 1 MSR = SSR/(1) n –1 – 1 = n - 2 MSE = SSE/(n-2)

The Calculus Problem

Testing the Usefulness of the Model (The F Test) • You can test the overall usefulness of the model using an F test. If the model is useful, MSR will be large compared to the unexplained variation, MSE. This test is exactly equivalent to the t-test, with t2 = F.

Least squares regression line Regression Analysis: y versus x The regression equation is y = 40.8 + 0.766 x Predictor Coef SE Coef T P Constant 40.784 8.507 4.79 0.001 x 0.7656 0.1750 4.38 0.002 S = 8.70363 R-Sq = 70.5% R-Sq(adj) = 66.8% Analysis of Variance Source DF SS MS F P Regression 1 1450.0 1450.0 19.14 0.002 Residual Error 8 606.0 75.8 Total 9 2056.0 Regression coefficients, a and b Minitab Output

Testing the Usefulness of the Model • The first question to ask is whether the independent variable x is of any use in predicting y. • If it is not, then the value of y does not change, regardless of the value of x. This implies that the slope of the line, b, is zero.

Testing the Usefulness of the Model The test statistic is function of b, our best estimate of b. Using MSE as the best estimate of the random variation s2, we obtain a t statistic.

The Calculus Problem • Is there a significant relationship between the calculus grades and the IQ scores at the 5% level of significance? Reject H 0 when |t| > 2.306. Since t = 4.38 falls into the rejection region, H 0 is rejected . There is a significant linear relationship between the calculus grades and the IQ scores for the population of college freshmen.

Measuring the Strength of the Relationship • If the independent variable x is of useful in predicting y, you will want to know how well the model fits. • The strength of the relationship between x and y can be measured using:

Measuring the Strength of the Relationship • Since Total SS = SSR + SSE, r2 measures • the proportion of the total variation in the responses that can be explained by using the independent variable x in the model. • the percent reduction the total variation by using the regression equation rather than just using the sample mean y-bar to estimate y. For the calculus problem, r2 = .705 or 70.5%. Meaning that 70.5% of the variability of Calculus Scores can be exlain by the model.

Estimation and Prediction Confidence interval Prediction interval

The Calculus Problem • Estimate the average calculus grade for students whose IQ score is 50 with a 95% confidence interval.

The Calculus Problem • Estimate the calculus grade for a particular student whose IQ score is 50 with a 95% confidence interval. Notice how much wider this interval is!

Confidence and prediction intervals when x = 50 Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 79.06 2.84 (72.51, 85.61) (57.95,100.17) Values of Predictors for New Observations New Obs x 1 50.0 Minitab Output • Green prediction bands are always wider than red confidence bands. • Both intervals are narrowest when x = x-bar.

Estimation and Prediction • Once you have • determined that the regression line is useful • used the diagnostic plots to check for violation of the regression assumptions. • You are ready to use the regression line to • Estimate the average value of yfor a given value of x • Predict a particular value of yfor a given value of x.

Estimation and Prediction • The best estimate of either E(y) or y for • a given value x = x0 is • Particular values of y are more difficult to predict, requiring a wider range of values in the prediction interval.

Regression Assumptions • Remember that the results of a regression analysis are only valid when the necessary assumptions have been satisfied. Assumptions: • The relationship between x and y is linear, given by y = a + bx + e. • The random error terms e are independent and, for any value of x, have a normal distribution with mean 0 and constant variance, s 2.

Diagnostic Tools • Normal probability plot or histogram of residuals • Plot of residuals versus fitorresiduals versus variables • Plot of residual versus order

Residuals • Theresidual erroris the “leftover” variation in each data point after the variation explained by the regression model has been removed. • If all assumptions have been met, these residuals should benormal, with mean 0 and variance s2.

Normal Probability Plot • If the normality assumption is valid, the plot should resemble a straight line, sloping upward to the right. • If not, you will often see the pattern fail in the tails of the graph.

Residuals versus Fits • If the equal variance assumption is valid, the plot should appear as a random scatter around the zero center line. • If not, you will see a pattern in the residuals.

Introduction to Probability and Statistics Thirteenth Edition