Crash Course in Correlation and Regression

Crash Course in Correlation and Regression MEASURING ASSOCIATION Establishing a degree of association between two or more variables gets at the central objective of the scientific enterprise. Scientists spend most of their time figuring out how one thing relates to another and structuring these relationships into explanatory theories.

Scatterplots A. scatter diagram A list of 1,000 data points would be impossible to grasp. [so we need some method that can examine this data and convert it into a more conceivable format]. One method is plotting the data for two variables (education and income; father’s height and son’s height; team spending in baseball and % wins) in a graph called a scatter diagram.

r = 1.0

r = .85

r = .42

R = .17

R = - .94

R = - .54

R = - .33

Formula for the Correlation Coefficient

Interpreting correlation coefficients Ranges from -1 to +1. [0 = no association; .25 weak; .5 moderate; .75 < strong] Square correlation coefficient to creat “R-squared” defined as the proportion of the variance of one variable accounted for by another variable a.k.a PRE STATISTIC (Proportionate Reduction of Error) Which bring us to Regression

MLB spending and performance example (Hoover & Donovan 2001): Y [team finish] =  + X [spending] Expressing the model in words: values of the Y variable (team finish: 1st place, 2nd place, etc.) are a function of some constant (), plus some amount of the X variable (spending). How much change in the Y variable (team finish) is associated with a change in the X variable (spending). The answer lies in β (beta), a.k.a the regression coefficient. In the baseball example, it would be the amount of improvement in team finish associated with an additional $1 million in spending on players’ salaries.

Hoover and Donovan using 1999 MLB season data and a bivariate regression found: Team finish = 4.4 – 0.03 x spending (in $millions) Interpretation: The beta (a.k.a the slope) suggests the relationship between spending and team finish was –0.03. Or, for each million dollars that a team spends, there is only a 3 percent change in division position. These results show that a team spending $70 million on players will finish close to second place. We can also show that any given team would have to spend almost $34 million more to improve its team finish by one position (-0.03 x $34million = 1.02). The correlation was -0.39 which means that spending explains only 15 percent of variation in the team’s finish (r-squared = .15 = -0.39 x -0.39).

Another Baseball Example • Testing Causality Between Team Performance and Payroll : The Cases of Major League Baseball and English Soccer • By Stephen Hall, Stefan Szymanski and Andrew S. Zimbalist • Journal of Sports Economics 2002

Multiple Regression Multiple regression contains a single dependent variable and two or more independent variables. Multiple regression is particularly appropriate when the causes (independent variables) are inter-correlated, which again is usually the case.

Multivariate Regression is a powerful tool to examine how multiple factors (independent variables) influence a dependent variable. It differs from bivariate regression in that it can identify the independent effect a variable has on a dependent variable by holding all other variables constant? What other variables would we include in the baseball model to predict winning %?

Y X2 X1

Y X2 X1 c

In figure 1 the fact that X1 and X2 do not overlap means that they are not correlated, but each is correlated with Y. This is great and means we don’t need sophisticated analysis, just two separate bivariate regressions. In figure 2, X1 and X2 are correlated. The area C is created by the correlation between X1 and X2; c represents the proportion of the variance in Y that is shared jointly with X1 and X2. How do we deal with C? We can’t count it twice or we will get a variation that is greater than 100%. Multivariate Regression

Crash Course in Correlation and Regression

Crash Course in Correlation and Regression

Presentation Transcript

Correlation and Regression

Correlation and Regression

Correlation and Regression

Regression and Correlation

Correlation and Regression

Correlation and Regression

Correlation and Regression

Correlation and Regression

Regression and Correlation

Regression and Correlation

Correlation and Regression

Correlation and Regression

Correlation and Regression

Correlation and Regression

Correlation and regression

CORRELATION AND REGRESSION

Correlation and regression

Correlation and Regression

Correlation and Regression

Correlation and Regression

Correlation and Regression

Regression and Correlation