450 likes | 866 Views
Regression. What is regression to the mean? Suppose the mean temperature in November is 5 degrees What’s your best guess for tomorrow’s temperature? exactly 5? warmer than 5? colder than 5?. Regression. What is regression to the mean?
E N D
Regression • What is regression to the mean? • Suppose the mean temperature in November is 5 degrees • What’s your best guess for tomorrow’s temperature? • exactly 5? • warmer than 5? • colder than 5?
Regression • What is regression to the mean? • Suppose the mean temperature in November is 5 degrees and today the temperature is 15 • What’s your best guess for tomorrow’s temperature? • exactly 15 again? • exactly 5? • warmer than 15? • something between 5 and 15?
Regression • What is regression to the mean? • Regression to the mean is the fact that scores tend to be closer to the mean than the values they are paired with • e.g. Daughters tend to be shorter than mothers if the mothers are taller than the mean and taller than mothers if the mothers are shorter than the mean • e.g. Parents with high IQs tend to have kids with lower IQs, parents with low IQs tend to have kids with higher IQs
Regression • What is regression to the mean? • The strength of the correlation between two variables tells you the degree to which regression to the mean affects scores • strong correlation means little regression to the mean • weak correlation means strong regression to the mean • no correlation means that one variable has no influence on values of the other - the mean is always your best guess
Regression • Suppose you measured workload and credit hours for 8 students Could you predict the number of homework hours from credit hours?
Regression • Suppose you measured workload and credit hours for 8 students Your first guess might be to pick the mean number of homework hours which is 12.9
Regression • Sum of Squares • Adding up the squared deviation scores gives you a measure of the total error of your estimate
Regression • Sum of Squares • ideally you would pick an equation that minimized the sum of the squared deviations • You would need a line is as close as possible to each point
Regression • The regression line • That line is called the regression line • The sum of squared deviations from it is called the sum of squared error or SSE
Regression • The regression line • That line is called the regression line • its equation is:
Regression remember: y = ax + b ax + b predicted y
Regression • What happens if you had transformed all the scores to z scores and were trying to predict a z score?
Regression • What happens if you had transformed all the scores to z scores and were trying to predict a z score? but… Sy = Sx = 1 So….
The Regression Line • The regression line is a linear function that generates a y for a given x
The Regression Line • The regression line is a linear function that generates a y for a given x • What should its slope and y-intercept be to be the best predictor?
The Regression Line • The regression line is a linear function that generates a y for a given x • What should its slope and y-intercept be to be the best predictor? • What does best predictor mean? It means least distance between the predicted y and an actual y for a given x
The Regression Line • The regression line is a linear function that generates a y for a given x • What should its slope and y-intercept be to be the best predictor? • What does best predictor mean? It means least distance between the predicted y and an actual y for a given x • in other words, how much variability is residual after using the correlation to explain the y scores
Mean Square Residual • Recall that
Mean Square Residual • The variance of Zy is the average squared distance of each point from the x axis (note that the mean of Zy = 0)
Mean Square Residual • Some of the variance in the Zy scores is due to the correlation with x • Some of the variance in the Zy scores is due to other (probably random) factors
Mean Square Residual • the variance due to other factors is called “residual” because it is “leftover” after fitting a regression line • The best predictor should minimize this residual variance
Mean Square Residual MSres is the average squared deviation of the actual scores from the regression line
Minimizing MSres • the regression line (the best predictor of y) is the line with a slope and y intercept such that MSres is minimized
Minimizing MSres • What will be its y intercept? • if there was no correlation at all, your best guess for y at any x would be the mean of y • if there was a strong correlation between x and y, your best guess for the y that matches the mean x would be the mean y • the mean of Zx is zero so the best guess for the Zy that goes with it will be zero (the mean of the Zy’s)
Minimizing MSres • In other words, the regression line will predict zero when Zx is zero so the y intercept of the regression line will be zero (only so for Z scores !)
Minimizing MSres • y intercept is zero
Minimizing MSres • what is the slope?
Minimizing MSres • what is the slope? consider the extremes: • Do the slopes look familiar? Zy is random with respect to Zx Zy’=mean Zy=0 slope = 0 Zy = Zx Zy’=Zx slope = 1 Zy=-Zx Zy’=-Zx slope = -1
Minimizing MSres • a line (regression of Zy on Zx) that has a slope of rxy and a y intercept of zero minimizes MSres
Predicting raw scores • we have a regression line in z scores: • can we predict a raw-score y from a raw-score x?
Predicting raw scores • recall that: and
Predicting raw scores • by substituting we get:
Predicting raw scores + b • by substituting we get: • note that this is still of the form: • note that the slope still depends on r and the intercept still depends on the mean of y a y = ax + b
Interpreting rxy in terms of variance • Recall that rxy is the slope of the regression line that minimizes MSres
Interpreting rxy in terms of variance • Recall that rxy is the slope of the regression line that minimizes MSres
Interpreting rxy in terms of variance • MSres can be simplified to:
Interpreting rxy in terms of variance • Thus: • So can be thought of as the proportion of original variance accounted for by the regression line
Interpreting rxy in terms of variance Observed y Subtract this distance What % of this distance Regression Line is this distance Predicted y Mean of y
Interpreting rxy in terms of variance • it follows that 1 - is the proportion of variance not accounted for by the regression line - this is the residual variance
Interpreting rxy in terms of variance • this can be thought of as a partitioning of variance into the variance accounted for by the regression and the variance unaccounted for
Interpreting rxy in terms of variance • this can be thought of as a partitioning of variance into the variance accounted for by the regression and the variance unaccounted for
Interpreting rxy in terms of variance • often written in terms of sums of squares: • or simply SStotal = SSregression + SSresidual