Simple Linear Regression

Simple Linear Regression • Conditions • Confidence intervals • Prediction intervals Section 9.1, 9.2, 9.3 Professor Kari Lock Morgan Duke University

To Do • Homework 8 (due Monday, 4/9) • Project 2 Proposal (due Wednesday, 4/11)

Hypothesis Test > 2*pt(16.21,5,lower.tail=FALSE) [1] 1.628701e-05 There is strong evidence that the slope is significantly different from 0, and that there is an association between cricket chirp rate and temperature.

Correlation Test • Test for a correlation between temperature and cricket chirps (r = 0.9906). > 2*pt(16.21,5,lower.tail=FALSE) [1] 1.628701e-05

Two Quantitative Variables • The t-statistic (and p-value) for a test for a non-zero slope and a test for a non-zero correlation are identical! • They are equivalent ways of testing for an association between two quantitative variables.

Simple Linear Regression • Simple linear regression estimates the population model • with the sample model:

Conditions • Inference based on the simple linear model is only valid if the following conditions hold: • Linearity • Constant Variability of Residuals • Normality of Residuals • Independence

Linearity • The relationship between x and y is linear (it makes sense to draw a line through the scatterplot)

Dog Years Happy Birthday Charlie! • From www.dogyears.com: • “The old rule-of-thumb that one dog year equals seven years of a human life is not accurate. The ratio is higher with youth and decreases a bit as the dog ages.” ACTUAL • 1 dog year = 7 human years • Linear: human age = 7×dog age LINEAR A linear model can still be useful, even if it doesn’t perfectly fit the data.

“All models are wrong, but some are useful” • -George Box

Simple Linear Model

Residual Plot • A residual plot is a scatterplot of the residuals against the predicted responses • Should have: • No obvious pattern or trend • Constant variability

Residual Plots Obvious pattern Variability not constant

Residuals (errors) • Conditions for residuals: The errors are normally distributed The standard deviation of the errors is constant for all cases The average of the errors is 0 Constant vertical spread in the residual plot Check with a histogram (Always true for least squares regression)

Independence • The independence assumption is that the residuals are independentfrom case to case • The residual of one case is not associated with the residual of another case • This condition is confirmed by thinking about the data, not necessarily by looking at plot • Can you think of data for which the independence assumption doesn’t hold?

Data over Time • Data collected over time usually violates the independent condition, because the residual of one case may be similar to the residuals of the cases directly before or after • Dow Jones index over the past month:

Conditions not Met? • If the association isn’t linear: •  Try to make it linear (next week) •  If can’t make linear, then simple linear regression isn’t a good fit for the data • If variability is not constant, residuals are not normal, or cases are not independent: •  The model itself is still valid, but inference may not be accurate

Simple Linear Regression • Plot your data! (linear association?) • Fit the model (least squares) • Use the model (interpret coefficients and/or make predictions) • Look at residual plot (no obvious pattern? constant variability?) • Look at histogram of residuals (normal?) • Cases independent? • Inference (extend to population)

President Approval and Re-Election • Can we use Obama’s approval rating to predict his margin of victory (or defeat) when he runs for re-election? • Data on all* incumbent U.S. presidential candidates since 1940 (11 cases) • Response: margin of victory (or defeat) • Explanatory: approval rating *Except 1944 because Gallup went on a wartime hiatus Source: Silver, Nate, “Approval Ratings and Re-Election Odds", fivethirtyeight.com, posted on 1/28/11.

President Approval and Re-Election 1. Plot the data: Is the trend linear? Yes (b) No

President Approval and Re-Election 2. Fit the Model:

President Approval and Re-Election 3. Use the model: • Which of the following is a correct interpretation? • For every percentage point increase in margin of victory, approval increases by 0.84 percentage points • For every percentage point increase in approval, predicted margin of victory increases by 0.84 percentage points • For every 0.84 increase in approval, predicted margin of victory increases by 1

President Approval and Re-Election 3. Use the model: • Obama’s current (based on polls 3/26/12 – 4/1/12) approval rating is 46%. Based only on this information, do you think Obama will • Win • Lose Predicted margin of victory = -36.5 + 0.84*46 = 2.14, which is positive • Source: http://www.gallup.com/poll/116479/barack-obama-presidential-job-approval.aspx

President Approval and Re-Election 4. Look at residual plot: • Is there no obvious trend? • (a) Yes (b) No • Is the variability constant? • (a) Yes (b) No

President Approval and Re-Election 5. Look at histogram of residuals: • Are the residuals approximately normally distributed? • (a) Yes (b) No

President Approval and Re-Election 6. Cases independent? • Are the cases independent? • Yes • No

President Approval and Re-Election 7. Inference • Should we do inference? • (a) Yes • (b) No • Due to the non-constant variability, the non-normal residuals, and the small sample size, our inferences may not be entirely accurate. • However, they are still better than nothing!

President Approval and Re-Election 7. Inference • Give a 95% confidence interval for the slope coefficient. • Is it significantly different than 0? • (a) Yes • (b) No

President Approval and Re-Election 7. Inference: • We don’t really care about the slope coefficient, we care about Obama’s margin of victory in the upcoming election! • How do we create a 95% interval predicting Obama’s margin of victory???

Prediction • We would like to use the regression equation to predict y for a certain value of x • This includes not only a point estimate, but interval estimates also • We will predict the value of y at x = x*

Point Estimate • The point estimate for the average y value at x=x* is simply the predicted value: • Alternatively, you can think of it as the value on the line above the x value • The uncertainty in this point estimate comes from the uncertainty in the coefficients

Confidence Intervals • We can calculate a confidence interval for the averagey value for a certain x value • “We are 95% confident that the average y value for x=x* lies in this interval” • Equivalently, the confidence interval is for the point estimate, or the predicted value • This is the amount the line is free to “wiggle,” and the width of the interval decreases as the sample size increases

Bootstrapping • We need a way to assess the uncertainty in predicted y values for a certain x value… any ideas? • Take repeated samples, with replacement, from the original sample data (bootstrap) • Each sample gives a slightly different fitted line • If we do this repeatedly, take the middle P% of predicted y values at x* for a confidence interval of the predicted y value at x*

Bootstrap CI Middle 95% of predicted values gives the confidence interval for average (predicted) margin of victory for an incumbent president with an approval rating of 46%

Confidence Interval

Confidence Interval • For x = 46%: (-2.89, 6.80) • For an approval rating of 46%, we are 95% confident that the average margin of victory for incumbent U.S. presidents is between -2.89 and 6.80 percentage points • But wait, this still doesn’t tell us about Obama! We don’t care about the average, we care about an interval for one incumbent president with an approval rating of 46%!

Prediction Intervals • We can also calculate a prediction interval for y values for a certain x value • “We are 95% confident that the y value for x = x* lies in this interval” • This takes into account the variability in the line (in the predicted value) AND the uncertainty around the line (the random errors)

Intervals

Intervals • A confidence interval has a given chance of capturing the mean y valueat a specified x value • A prediction interval has a given chance of capturing the y value for a particular case at a specified x value • For a given x value, which will be wider? • Confidence interval • Prediction interval

Intervals • -- Prediction • -- Confidence

Intervals • As the sample size increases: • the standard errors of the coefficients decrease • we are more sure of the equation of the line • the widths of the confidence intervals decrease • for a huge n, the width of the CI will be almost 0 • The prediction interval may be wide, even for large n, and depends more on the correlation between x and y (how well y can be linearly predicted by x)

Prediction Interval • Based on the data and the simple linear model (and using Obama’s current approval rating): • The predicted margin of victory for Obama is 2.14 percentage points • We are 95% confident that Obama’s margin of victory (or defeat) will be between -12.36 and 16.26 percentage points

Simple Linear Regression