Regression

Regression

Motivating Example

Motivating Example • Let X represent height (in inches) and let Y represent weight (in pounds). • Using SPSS, I found inches and pounds. • Question 1: One person is chosen at random. Given no other information, what is your best guess at the person’s weight?

Motivating Example • Let X represent height (in inches) and let Y represent weight (in pounds). • Using SPSS, I found inches and pounds. • Question 1: One person is chosen at random. Given no other information, what is your best guess at the person’s weight? • Answer: the mean weight: 127 pounds.

Motivating Example • Question 2: One person is chosen at random. You are told the person’s height is 73 inches. What is your best guess at the person’s weight? • The person is tall, and is likely to be heavier than average. Your best guess is the mean weight for people of similar height.

Motivating Example • The mean weight of people in the study whose height was between 72.5 and 73.5 inches is 142.18 pounds. • This estimate took some effort and required knowing all the data. • It would be nice to have a simple equation which gives a good estimate.

What about the SD line? • The SD line is the line data appear to cluster about if they are strongly associated. • The equation of the SD line is (assuming positive association) • Using SPSS, I (carefully!) found and for the height-weight data.

What about the SD line? • Thus the equation of the SD line is • We might try to use the SD line as a prediction tool for the weight of a person 73 inches tall. • The equation gives pounds

What about the SD line? • But 157.68 isn’t very close to 142.18. • While the SD line gives a good visual cue for determining strength of correlation, it’s not good for making predictions about data.

The regression line • Using SPSS, I computed the correlation coefficient of the height-weight data to be . • Consider the equation • This differs from the equation of the SD line in that the slope has been multiplied by .

The regression line • Using this line to predict the weight of a person whose height is 73 inches, I get pounds • This is much better!

The regression line • The equation of the regression line is • The regression line for Y on X estimates the mean value for Y corresponding to each value of X. • Each increase of one SD in X yields an average increase of SDs in Y, where is the correlation coefficient.

The regression line • To illustrate the concept of the regression line, we plot the graph of averages. This is a plot of average Y-values vs. (regular) X-values.

Regression Line vs. SD Line

The “Regression Effect” Graph of averages

The “Regression Effect” • If the data are positively associated, the SD line overestimates the average increase in Y. • On average, the weight of a tall person is less than the value predicted by the SD line. • On average, the weight of a short person is bigger than the value predicted by the SD line. • Another Example: Children were tested upon entering and leaving a preschool program to boost IQ. On average, the bottom group improved on the second test and the top group fell back a bit. • This is to be expecteddue to spread around the SD line!

The “Regression Effect” • An instructor standardizes her midterm and final so average is 50 and SD is 10 on both tests. Correlation is 0.5. • One semester she gave extra tutoring to all students who scored below 30 on the midterm. They all scored above 50 on the final. • Can this be explained by the regression effect?

RMS Error for Regression

The regression line as a “mean” • There is a close analogy between the mean (for single variable data) and the regression line (for two variable data).

Facts about the mean • If is the mean of a variable X, then the sum of the deviations of X is 0. • The mean is the unique value that minimizes the root-mean-square deviations. • If you computed SD with another value in place of the mean, you’d get a larger answer.

Residuals • When using the regression line to predict Y given X, the residual of a data point (X, Y) is the difference between the actual value Y and the value predicted by the regression line. • Residuals are “deviations” from the regression line.

Residuals

The regression line as a “mean” • The sum of all residuals is 0. • The regression line is the unique line that minimizes the root-mean-square residuals.

RMS Error • For single variable data, the root-mean-square of the deviations Y • For two variable data, the root-mean-square of the residuals is called the RMS error.

RMS Error • You can compute RMS error using the formula (here Y is the predicted variable) • The RMS error is less than because the regression line is a better predictor of Y-values than the horizontal line at .

RMS Error • About 68% of points on a scatter diagram fall within one RMS error of the regression line. • About 95% of points fall within two RMS errors of the regression line.

How is RMS error useful? • If the scatter diagram is football-shaped, thin vertical strips show similar amounts of spread. (Statisticians call such plots homoscedastic.)

How is RMS error useful? • In this case, you can use a normal approximation within a (thin) vertical strip. = value of regression line at horizontal center of strip. SDY = RMS error • Example: Of people 73” tall, what percent weigh between 135 and 145 pounds?

Example • For the height-weight data, recall that we found and • The regression line was • The regression line predicts the mean weight of a person 73” tall to be 142.34 pounds. • The RMS error is

Example • To find the percentage of people between 135 and 145 lbs, we find the z-values for each weight.

Example • To find the percentage of people between 135 and 145 lbs, we find the z-values for each weight. • Using the normal table, we find the percentage is about 1%

Exercise • Suppose we are given the following statistics about SAT scores and GPAs of college students: Assume S and G are normally distributed. • Of the students whose SAT score is 1300, estimate the percentage whose GPA is between 3.0 and 3.5.

Exercise Answer • We use a normal approximation in a thin vertical strip above S=1300. The mean is the value predicted by the regression line: • The SD is the RMS error for the regression line:

Exercise Answer • The z-values for G=3.0 and G=3.5 are and • Using the normal table, the area under the normal curve between 0.17 and 1.83 is approximately

Bigger Picture Questions • What conclusions can be reasonably drawn from the regression line? • When is linear regression an appropriate tool?

Cautions about regression • If you run an observational study, the regression line only describes the data you see. It cannot predict the results of experiments or values of data beyond the range of the study.

Cautions about regression • If you run an observational study, the regression line only describes the data you see. It cannot predict the results of experiments or values of data beyond the range of the study. • If the scatter plot is not homoscedastic, do not use RMS error as SD in vertical strips.

A heteroscedastic scatter plot

When is regression inappropriate? • If the data follow a non-linear pattern • If the graph of averages follows a non-linear pattern • If the residual plot shows a non-linear pattern

Residual Plot • An example residual plot with no trend or pattern. The regression line is an appropriate tool for this data.

Residual Plot • Horizontal axis: same as scatter plot • Vertical axis: residuals • For each data point (X, Y), plot the point with x-coordinate = X y-coordinate = residual of (X, Y)

Residual plot X: 1 1 2 2 3 3 4 4 5 5 Y: 2 1 2 3 5 5 4 6 8 5

Residual Plot • Horizontal axis: same as scatter plot • Vertical axis: residuals • For each data point (X, Y), plot the point with x-coordinate = X y-coordinate = residual of (X, Y) • There should be no trend or pattern in the residuals. If there is, linear regression is not appropriate.

Regression

Regression

Presentation Transcript

Regression Analysis Simple Regression

Regression

Regression

Regression

Regression

Regression

Regression

REGRESSION

Regression

Regression

REGRESSION

Regression

Regression Linear Regression Regression Trees

Regression Linear Regression

Regression

REGRESSION

Regression

Regression

Regression Analysis Simple Regression

REGRESSION

Regression

Regression