630 likes | 652 Views
Learn how to measure linear correlation and interpret scatter diagrams in Excel to analyze relationships between variables. Follow step-by-step instructions with examples and essential formulas provided in this comprehensive guide.
E N D
Section 7 Correlation & Regression William Christensen, Ph.D.
IMPORTANT Note • Before you can use the Excel functions for regression (this section) or ANOVA (section 8) you must be sure you’ve set up Excel to access these functions. To do that, while in an Excel spreadsheet, click “Tools” then “Add-Ins”. Then make sure that “Analysis TookPak” and “Analysis ToolPak –VBA” are checked (see picture).
Correlation exists between two variables when they are related to each other in some way • Example:when one increases the other also increases, OR when one increases the other decreases
Assumptions 1. The sample of paired data (x,y) is a random sample. 2. The pairs of (x,y) data are normally distributed.
Definition • Scatterplot (or scatter diagram) is a graph in which the paired (x,y) sample data are plotted with a horizontal x axis and a vertical y axis. Each individual (x,y) pair is plotted as a single point.
Scatter Diagram of Paired Data Restaurant Bill (x) and Tip (y)
Positive Linear Correlation y y y x x x (a) Positive (b) Strong positive (c) Perfect positive Sample Scatter Plots showing various degrees of “positive” correlation. That is, when x increases y also increases
Negative Linear Correlation y y y x x x (d) Negative (e) Strong negative (f) Perfect negative Sample Scatter Plots showing various degrees of “negative” correlation. When x increases y decreases
y y x x (h) Nonlinear Correlation (g) No Correlation No Linear Correlation Sample Scatter Plots showing NO linear correlation. The first plot (g) has no correlation of any kind. The other plot (h) shows a clear correlation between x and y, but the correlation is NOT LINEAR. Nonlinear correlation can be studied, but it is beyond this class. We will only study LINEAR CORRELATION.
Definition • Linear correlation can be measured. The strength of the relationship between an “x” variable and a “y” variable is measured by the “linear correlation coefficient” • The letter r represents the linear correlation coefficient for paired x,y sample data • The Greek letter (rho) represents the linear correlation coefficient for paired x,y population data
The Linear Correlation Coefficient r • The values of the linear correlation coefficient are ALWAYS between -1 and +1 • If r = +1 there is a perfect positive linear correlation between x and y • If r = -1 there is a perfect negative linear correlation between x and y • If r = 0 there is absolutely no linear correlation between x and y
Properties of the Linear Correlation Coefficient r • -1 r 1 (r is ALWAYS between -1 and 1) • The value of r does not change if all values of either variable are converted to a different scale. E.g., you can change from meters to feet and it will not change the value of r. • You can interchange x and y and the value of r will not change. I.e., it does not matter which comes first. • Only linear relationships are measured by r.
The Linear Correlation Coefficient r • We will use Excel to calculate the linear correlation coefficient, but just in case you want to see the formula, here it is:
The Linear Correlation Coefficient r • The Excel function for finding the linear correlation coefficient is =CORREL(array1,array2) • Where array1 is the range of x values and array2 is the range of y values (or visa versa) • Hint: always arrange the x and y values in side-by-side columns. Since the data is “paired” there is a specific y that goes with each x. Make sure these pairs are side-by-side. As long as the pairs are side-by-side, the sequencing of pairs makes no difference. See the next slide for an example.
Interpreting the Linear Correlation Coefficient – 2 ways • If the absolute value of r exceeds the value in Table A – 6 (see next slide), conclude that there is a significant linear correlation. • Otherwise, there is not sufficient evidence to support the conclusion of significant linear correlation. • Or, use the hypothesis testing method described in this section to confirm whether the correlation / relationship between x and y is significant
TABLE A-6Critical Values of the Pearson Correlation Coefficient r = .01 n = .05 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 25 30 35 40 45 50 60 70 80 90 100 .950 .878 .811 .754 .707 .666 .632 .602 .576 .553 .532 .514 .497 .482 .468 .456 .444 .396 .361 .335 .312 .294 .279 .254 .236 .220 .207 .196 .999 .959 .917 .875 .834 .798 .765 .735 .708 .684 .661 .641 .623 .606 .590 .575 .561 .505 .463 .430 .402 .378 .361 .330 .305 .286 .269 .256
Using Table A-6 • Note on rounding r: Round to three decimal places • Example: For our Manatee problem, we calculated r = 0.922 (as shown below) • Using an alpha=0.05, and with n=10 we see from Table A-6 that an r with an absolute value greater than 0.632 is significant. • Since our calculated r = 0.922, we can say that the relationship between the number of boats and the number of Manatee deaths is statistically significant.
Common Errors Involving Correlation • Causation: It is wrong to conclude that correlation implies causality • Just because x and y are “related” does NOT MEAN that x “causes” y or visa versa. Determining causality is a separate issue and beyond this class (this is a very common mistake) • Averages: Averages suppress individual variation and may inflate the correlation coefficient • Make sure you are using data for individual events, NOT averaged or grouped data • Linearity: There may be some relationship between x and y even when there is no significant linear correlation • Nonlinear relationships can be studied, but NOT in this class
Formal Hypothesis Test • To determine whether there is a significant linear correlation between two variables • Let • H0: =(no significant linear correlation) and • H1: (significant linear correlation) • Remember: is the symbol for correlation between the populations of x’s and y’s and r is the symbol for correlation between a sample of x’s and y’s
Test Statistic is t (follows format of earlier sections) Test statistic: • Critical values: • Use =NORMSINV for large samples • For small samples (n<30) use =TINV(prob,df) with probability = α (Excel knows it is a two-tailed test, so input the entire alpha, NOT α/2), and (df) degrees of freedom = n – 2 (YES, that’s df=n-2 in this case, NOT df=n-1 as in previous sections)
Manatee Example (Hypothesis Test) • Given α = 0.05 and n=10 • Test Statistic: • Critical Values: -2.306 and +2.306
Test Statistic = 6.735 t=0 -2.306 2.306 Manatee Example • Conclusion: Accept the alternate hypothesis (H1) that there is a significant relationship between x and y (number of boats and manatee deaths)
Regression Definition • The “Regression” process seeks to find a line (and the equation that describes that line) that “fits” through the paired data so that the sum of the vertical distances (squared) between the actual data points and the line is minimized. • The Regression Equation takes the form:
Regression Line Plotted on Scatter Plot The Regression Line is the line of “best fit” through the data points. “Best fit” means the sum of the vertical distances between each data point and the regression line is minimized.
The Regression Equation • xis the independent variable (predictor variable) • y-hat is the dependent variable (response variable) y-hat is the “dependent” or “response” variable because it depends on, or responds to the value of x b0 is the y-intercept or the value at which the regression line crosses the vertical axis b1 is the slope of the regression line or the amount of change in y for every 1 unit change in x x is the “independent” or “predictor” variable because it acts independently to predict the value of y-hat
Assumptions • We are investigating only linearrelationships • For each x value, y is a random variable having a normal (bell-shaped) distribution. All of these y distributions have the same variance • Results are not seriously affected if departures from normal distributions and equal variances are not too extreme
Rounding Rule for Regression • Round the y-intercept (b0) and the slope (b1) to three significant digits • If calculating manually, do not round intermediate values
Steps in Doing Regression • Use one of the methods we already learned to check for a significant linear correlation (r) between x and y • If there is NOT a significant linear correlation between x and y, then we CANNOT USE REGRESSION to predict y and our best estimate of y is simply y-bar (the mean of y). In this case forget about using regression because it fails us • However, IF THERE IS a significant linear correlation between x and y, the best predicted y-value is found by putting the x-value into the regression equation and calculating y
Regression using Excel • Click on: • Tools • Data Analysis • Regression • To run the media file on the next slide, click or double-click on the picture of the Excel worksheet • See the following examples of how to do Regression using Excel
Check ONLY if you include in the y and x ranges the “headings” or cells that label y and x. I prefer to do that. Input the range of cells that contains the y variables Input the range of cells that contains the x variables Leave unchecked Here you can specify where you want the results of the Regression analysis to appear Doesn’t matter if you check, but I prefer to leave unchecked This stuff is beyond this class (don’t worry about these). See me if you have questions Regression using Excel
Regression Example:What is the best predicted size of a household that discards 0.50 lb of plastic? Data was collected by some poor smucks going around to different houses and going through the garbage. The data is listed in columns (always list data in columns not in rows when using Excel). x represents the lbs. of plastic thrown out and y represents the number of people in that household. Notice that the y variable (number in household) is what we are trying to predict and the x variable (lbs. of plastic tossed) is what we use to try to predict y. Step 1: Determine whether or not there is a significant relationship between x and y. If there is NOT a significant relationship then we CANNOT use Regression. If there IS a significant relationship between x and y then we can proceed to use Excel to do a Regression Analysis and find the Regression Equation (note: assume an alpha of 0.05 unless specifically told otherwise)
Regression Example (Hypothesis Test) • Given α = 0.05 and n=8 (df=n-2=8-2=6) • Test Statistic: • Critical Values: -2.447 and +2.447 • Conclusion:The Test Statistic is greater than the Critical Value so we conclude that there IS a significant linear relationship between the variables and we can now proceed with our Regression Analysis
Regression Example:What is the best predicted size of a household that discards 0.50 lb of plastic? By specifying an Output Range of $A$11 we are telling Excel to use cell A11 as the upper-left corner of the Regression Analysis output (see next slide)
Regression Example:Answer this question using Regression. What is the best predicted size of a household that discard 0.50 lb of plastic? Before you get too excited about this output, let’s cross-off the info that we are not going to discuss or learn about in class. I’m only trying to give you an elementary exposure to Regression. The following slides will show you the things you need to understand and each of those items will be explained.
Regression Example:Answer this question using Regression. What is the best predicted size of a household that discard 0.50 lb of plastic? Info that you MUST know for test Info to help you understand Regression Multiple R has no importance by itself, it is the square of this value “R square” that is important “R Square” is also known as the “Coefficient of Determination” and represents the amount (percent) of the variance in y that is explained by x. You can sort-of say that this value shows how accurate or effective our regression equation is. In this case, our regression equation is 71% accurate or effective in predicting y, given some value of x. These coefficients form the Regression Equation. In this case y=0.549+1.480(x). Thus, the answer to our question is y=0.549+1.480(0.50)=1.289 people (or 1 rounded to the nearest whole person in a household that tosses 0.50 lbs. of plastic)
Regression Example:Answer this question using Regression. What is the best predicted size of a household that discard 0.50 lb of plastic? Info that you MUST know for test Info to help you understand Regression “Adjusted R Square” is the same as “R Square” EXCEPT that it is adjusted for sample size. When the sample size (n) is small, this value will be quite a bit below the unadjusted R Square value. As the sample size increases, the difference between R Square and Adjusted R Square becomes negligible. Adjusted R Square is a more accurate representation of the accuracy or effectiveness of our regression equation. Standard Error is an estimate of how far off our y predictions will be (on average) using the regression equation. Observations is simply n Here we simply see that Excel determines the appropriate degrees of freedom – nothing for us to do here
Regression Example:Answer this question using Regression. What is the best predicted size of a household that discard 0.50 lb of plastic? Info that you MUST know for test Info to help you understand Regression Significance F shows the alpha level that separates the regression equation between being statistically significant and NOT. Thus, if this value is LESS THAN our alpha value (usually 0.05) then the regression equation is statistically significant (i.e., a low chance of error in predicting y). Conversely, if this value is greater than our alpha then the regression equation is not statistically significant (i.e., a relatively high chance of error in predicting y). In this case, since Significance F is much less than 0.05 we conclude that the regression equation is statistically significant. SS requires some explanation – see the next couple of slides You may remember the F test from the section on hypothesis testing. Excel does an F test to see if the regression equation is any better than simply using the average y as the predictor of y. A large F value generally suggests that the regression equation is effective, but we rely on the “Significance F” to tell for sure
20 19 18 17 16 15 14 13 12 y 11 • 10 Unexplained deviation (Residual) (y – y-hat) 9 8 Total deviation (y – y-bar) 7 • 6 Explained deviation (Regression) (y-hat – y-bar) 5 • 4 y-bar = 9 3 2 1 y-hat = 3 + 2x 0 x 0 1 2 3 4 5 6 7 8 9 Regression Example:Answer this question using Regression. What is the best predicted size of a household that discard 0.50 lb of plastic? SS (Sum of Squares) NOT REQUIRED FOR YOU TO KNOW SS Regression (Explained Deviation) – This is calculated by first finding the vertical distance between the regression line and the mean y value (y-bar) at each data point. Next, those individual differences are squared and then all added together (summed) to form the SS (sum of squares) Regression. In practical terms, in order for the regression equation to be valid, we should see a relatively high value here as compared to SS Residual. SS Residual (Unexplained Deviation) – This is calculated by first finding the vertical distance between each data point and the regression line. Since we hope our regression line explains the relationship between x and y, any difference between the line and a data point is “unexplained”. We hope this value is relatively small compared to SS Regression. SS Total – This is simply SS Regression + SS Residual. It can also be calculated by finding the vertical distance between each data point and y-bar, squaring each value and added them all up.
Multiple Regression • Multiple Regression is a process for defining a linear relationship between a dependent variable y and two or more independent variables (x1, x2, x3 . . . , xk) • The linear equation for a regression problem in which we have multiple x variable is as follows, where b0 is the y-intercept and all the other b’s are coefficients associated with their respective x values:
Multiple Regression Guidelines • More x variables is NOT necessarily better • Remember that R Square is a measure of how effective our regression equation is. Therefore, if adding an x variable does not appreciably increase the R Square value, then DON’T add it • Use those x variables (the fewest possible) that give you the biggest R Square (or Adjusted R Square) value. We want efficiency so a few variables that provide a big R Square is best
Multiple Regression Example:Using the following data (measurements taken from Bears that had been anesthetized), construct a multiple regression equation to predict the weight of Bears. Step 1: Construct a “Correlation Matrix” to see which x variables have the strongest linear relationships with the y variable (weight). Use the Excel function Tools, Data Analysis, Correlation to construct a correlation matrix. An Excel file containing this Bear data and Correlation Matrix are on the class website (mrbear.xls).
Multiple Regression Example:Using the following data (measurements taken from Bears that had been anesthetized), construct a multiple regression equation to predict the weight of Bears. • Step 1: Construct a “Correlation Matrix” to see which x variables have the strongest linear relationships with the y variable (weight) • Ideally, we want to pick those few x variables that have strong correlations (close to -1 or +1) with the y variable, BUT we also want the x variables to NOT be highly correlated with each other • The addition of an x variable that is strongly correlated with any x variable(s) already in a multiple regression equation WILL NOT do much to increase the R Squared or Adjusted R Square value • On the other hand, adding an x variable that is strongly correlated with the y variable, but NOT with any x variables already in the regression equation WILL increase our R Squared value substantially
Multiple Regression Example:Using the following data (measurements taken from Bears that had been anesthetized), construct a multiple regression equation to predict the weight of Bears. Choosing which x variables to include in a multiple regression problem is often a subjective decision. I suggest you try whichever variables you think will work best. For this example, I am going to choose Neck size as my first x variable to predict y (weight) since it has an awesome correlation of 0.971 with Weight. To pick a second x variable I want one that is highly correlated with Weight, but NOT with Neck. Unfortunately, it looks like all the other x variables ARE highly correlated with Neck. In fact, if this were a real problem we would probably stick with simple regression and just use Neck to predict Weight. But, since we have to do a multiple regression problem I think I’ll pick Age as my second x variable. Age is not that highly correlated with Weight (0.814), but it is also the least highly correlated with Neck (0.906). Why don’t you try the problem with other x variables and see if you can beat my Adjusted R Square value.
Multiple Regression Example:Using the following data (measurements taken from Bears that had been anesthetized), construct a multiple regression equation to predict the weight of Bears. • For Multiple Regression we use the same Tools, Data Analysis, Regression function in Excel that we used for simple Regression. • The “Y Range” is still simply the range of cells that contains the y-variable • The “X Range” is the range of cells that contains ALL the x-variables we want to include in the regression model. MAKE SURE the x-variables are in adjacent columns (you cannot skip columns). Notice how I now have Neck and Age right next to each other. • I like to include the column headings or labels, so I checked the “labels” box. When you do this the output also includes the labels, making it a lot easier to interpret. • Everything else is the same as when we did simple Regression
Multiple Regression Example:Using the following data (measurements taken from Bears that had been anesthetized), construct a multiple regression equation to predict the weight of Bears. Again, before we analyze the output, let’s cross-off the info that we are not going to discuss or learn about in class. The following slides will explain those things you need to understand.
Multiple Regression Example:Using the following data (measurements taken from Bears that had been anesthetized), construct a multiple regression equation to predict the weight of Bears. Just a note: I ran another regression with Neck size as the only x variable. You can see the Adjusted R Square is 0.9330 as compared to the Adjusted R Square of 0.9536 that we got when we included Age in addition to Neck. I don’t think I would consider the slight increase in Adjusted R Square from including Age as really being worth the trouble of including another variable. There is no set guideline as to how much R Square should increase to justify adding another variable, but going from an already high R Squared to a slightly higher value does not seem worthwhile in my opinion