530 likes | 702 Views
Chapter 7. Correlation, Bivariate Regression, and Multiple Regression. Pearson’s Product Moment Correlation. Correlation measures the association between two variables. Correlation quantifies the extent to which the mean, variation & direction of one variable are related to another variable.
E N D
Chapter 7 Correlation, Bivariate Regression, and Multiple Regression
Pearson’s Product Moment Correlation • Correlation measures the association between two variables. • Correlation quantifies the extent to which the mean, variation & direction of one variable are related to another variable. • r ranges from +1 to -1. • Correlation can be used for prediction. • Correlation does not indicate the cause of a relationship.
Scatter Plot • Scatter plot gives a visual description of the relationship between two variables. • The line of best fit is defined as the line that minimized the squared deviations from a data point up to or down to the line.
Line of Best Fit Minimizes Squared Deviations from a Data Point to the Line
Always do a Scatter Plot to Check the Shape of the Relationship
Will a Linear Fit Work? y = 0.5246x - 2.2473 R2 = 0.4259
2nd Order Fit? y = 0.0844x2 + 0.1057x - 1.9492 R2 = 0.4666
6th Order Fit? y = 0.0341x6 - 0.6358x5 + 4.3835x4 - 13.609x3 + 18.224x2 - 7.3526x - 2.0039 R2 = 0.9337
Linear Fit y = 0.0012x - 1.0767 R2 = 0.0035
Evaluating the Strength of a Correlation • For predictions, absolute value of r < .7, may produce unacceptably large errors, especially if the SDs of either or both X & Y are large. • As a general rule • Absolute value r greater than or equal .9 is good • Absolute value r equal to .7 - .8 is moderate • Absolute value r equal to .5 - .7 is low • Values for r below .5 give R2 = .25, or 25% are poor, and thus not useful for predicting.
Significant Correlation?? If N is large (N=90) then a .205 correlation is significant. ALWAYS THINK ABOUT R2 How much variance in Y is X accounting for? r = .205 R2 = .042, thus X is accounting for 4.2% of the variance in Y. This will lead to poor predictions. A 95% confidence interval will also show how poor the prediction is.
Venn diagram shows (R2) the amount of variance in Y that is explained by X. R2=.64 (64%) Variance in Y that is explained by X Unexplained Variance in Y. (1-R2) = .36, 36%
The vertical distance (up or down) from a data point to the line of best fit is a RESIDUAL. r = .845 R2 = .714 (71.4%) Y = mX + b Y = .72 X + 13
Calculation of Regression Coefficients (b, C) If r < .7 prediction will be poor. Large SDs adversely affect the accuracy of the prediction.
Standard Error of Estimate(SEE)SD of Y Prediction Errors The SEE is the SD of the prediction errors (residuals) when predicting Y from X. SEE is used to make a confidence interval for the prediction equation.
The SEE is used to compute confidence intervals for prediction equation.
Example of a 95% confidence interval. Both r and SDY are critical in accuracy of prediction. If SDY is small and r is big, predictions are will be small. If SDY is big and r is small, predictions are will be large. We are 95% sure the mean falls between 45.1 and 67.3
Multiple Regression • Multiple regression is used to predict one Y (dependent) variable from two or more X (independent) variables. • The advantage of multivariate or bivariate regression is • Provides lower standard error of estimate • Determines which variables contribute to the prediction and which do not.
Multiple Regression • b1, b2, b3, … bn are coefficients that give weight to the independent variables according to their relative contribution to the prediction of Y. • X1, X2, X3, … Xn are the predictors (independent variables). • C is a constant, similar to Y intercept. • Body Fat = Abdominal + Tricep + Thigh
List the variables and order to enter into the equation • X2 has biggest area (C), it comes in first. • X1 comes in next area (A) is bigger than area (E). Both A and E are unique, not common to C. • X3 comes in next, it uniquely adds area (E). • X4 is not related to Y so it is NOT in the equation.
Ideal Relationship Between Predictors and Y Each variable accounts for unique variance in Y Very little overlap of the predictors Order to enter? X1, X3, X4, X2, X5
Regression Methods • Enter: forces all predictors (independent variables) into the equation, in one step. • Forward: Each step adds a new predictor. Predictors enter based upon the unique variance in Y they explain. • Backward: Starts with full equation (all predictors) and removes them one at a time on each step, beginning with the predictor that adds the least. • Stepwise: Each step adds a new predictor. One any step a predictor can be added and another removed if it has high partial correlations with the newly added predictor.
Regression Methods in SPSS Choose desired Regression Method.
Regression Assumptions • Homoscedaticity: equal variance of X at any Y value. • The residuals are normally distributed around the line of best fit. • X and Y are linearly related
Tests for Normality • Use SPSS • Descriptives • Explore
Tests for Normality Not less than 0.05 so the data are normal.
Tests for Normality: Normal Probability Plot or Q-Q Plot If the data are normal the points cluster around a straight line
Tests for Normality: Boxplots Bar is the median, box extends from 25 – 75th percentile, whiskers extend to largest and smallest values within 1.5 box lengths Outliers are labeled with O, Extreme values are labeled with a star
Cntry15.Sav Example of Regression Assumptions Standardized Residual Stem-and-Leaf Plot Frequency Stem & Leaf 3.00 -1 . 019 4.00 -0 . 0148 7.00 0 . 0466669 1.00 1 . 7 Stem width: 1.00000 Each leaf: 1 case(s)
Cntry15.Sav Example of Regression Assumptions Distribution is normal. Two scores are somewhat outside
Cntry15.Sav Example of Regression Assumptions No Outliers [labeled O] No Extreme scores [labeled with a star]
Cntry15.Sav Example of Regression Assumptions The points should fall randomly in a band around 0, if the distribution is normal. In this distribution there is one extreme score.
Cntry15.Sav Example of Regression Assumptions The data are normal.