200 likes | 378 Views
Correlation and Regression. Sporiš Goran, PhD. http://kif.hr/predmet/mki http://www.science4performance.com/. Correlation and Regression. Correlation : measure of the strength of an association (relationship) between continuous variables
E N D
Correlation and Regression Sporiš Goran, PhD. http://kif.hr/predmet/mki http://www.science4performance.com/
Correlation and Regression • Correlation: measure of the strength of an association (relationship) between continuous variables • Regression: predicting the value of a continuous dependent variable (y) based on the value of a continuous independent variable (x)
Correlation statistic - r • Values of r Range from –1 to +1 • -1 is a perfect negative association (correlation), meaning that as the scores of one variable increase, the scores of the other variable decrease at exactly the same rate • +1 is a perfect positive association, meaning that both variables go up or down together, in lock-step • Intermediate values of r (close to zero) indicate weak or no relationship • Zero r (never in real life) means no relationship – that the variables do not change or “vary” together except by chance.
Two “scattergrams” – each with a “cloud” of dots Y Y NOTE: Dependent variable (Y) is always placed on the vertical axis r = +1 1 2 3 4 5 6 1 2 3 4 5 6 r = - 1 NOTE: Independent variable (X) is always placed on the horizontal axis X X 1 2 3 4 5 1 2 3 4 5 Can changes in one variable be predicted by changes in the other?
Can changes in one variable be predicted by changes in the other? Y 1 2 3 4 5 6 r = 0 X 1 2 3 4 5
“Line of best fit” Y • To arrive at a value of “r” a straight line is placed through the cloud of dots (the actual “observed” data) • Linear relationship between the variables is assumed • This line is placed so that the overall distance between itself and the dots is minimized 1 2 3 4 5 6 X 1 2 3 4 5 2
“Line of best fit” • To place this line in the cloud of dots it is necessary to compute a and b for each observed (known) value of x. a = where the line crosses the y axis b = “slope”, orno. of units that the value of y changes when x changes one unit • When x is the “independent variable”: a = y - bx(x -x)(y -y) b = ------------------(x -x)2
Y y = a + bx a = where the line crosses the y axis b = “slope”, ornumber of units that y changes when x changes one unit 1 2 3 4 5 6 b a X 1 2 3 4 5
How closely will a straight line fit the “observed” (actual) data? Y Y 1 2 3 4 5 6 1 2 3 4 5 6 +1.0 - 1.0 X X 1 2 3 4 5 1 2 3 4 5 4 A perfect fit yields an r of +1 or -1
Y An intermediate fit yields an intermediatevalue of r 1 2 3 4 5 6 r = +.65 X 1 2 3 4 5 2
A poor fit yields a low value of r Y 1 2 3 4 5 6 r = - .19 X 1 2 3 4 5
“Line of best fit” Y if y =5, x=3.4 • The line of best fit predicts a value for one variable given the value of the other variable • There will be a difference between these estimated values and the actual, known (“observed”) values. This difference is called a “residual” or an “error of the estimate.” • As the error between the known and predicted values decreases – as the dots cluster more tightly around the line – the absolute value of r (whether + or –) increases 1 2 3 4 5 6 if x =.5, y=2.3 X 1 2 3 4 5
R-squared, the coefficient of determination • Proportion of the change in the dependent variable (also known as the “effect” variable) that is accounted for by change in the independent variable (also known as the “predictor” variable) • Taken by squaring the correlation coefficient (r) • “Big” R squared (R2) combines the effects of multiple independent/predictor variables • “Little” r squared (r2) is the contribution of a single independent/predictor variable
Class exercise Hypothesis 1: Height Weight Hypothesis 2: Age Weight • Use this data to build two scattergrams • Be sure to place the independent and dependent variables on the correct axes • Estimate a possible value for the r statistics
r = .72 r2 = .52
r = .35 r2 = .12
Changing the level of measurement from continuous to categorical SHORT TALL 240 220 HEAVY 3 7 200 180 WEIGHT 160 12 4 140 LIGHT 120 100 58 60 62 64 66 68 70 72 74 76 HEIGHT
Some other correlation techniques • “Partial correlation” (see next slide) • Using a control variable to assess its potential influence on a bivariate (two-variable) relationship when all variables are continuous • Analogous to using tables for categorical variables • “Spearman’s r” • Assess correlation between two ordinal categorical variables • Logistical (“Logit” )regression • Used when a dependent variable is dichotomous. It’s converted into a binary 0/1 (e.g., 0 means “no”; 1 means “yes”) • Can use continuous and categorical independent variables • Results given as an odds ratio (aka log-odds ratio), which signifies the likelihood that an independent (“predictor”) variable contributes to changes in the dependent (“effect”) variable. A result of “1” means there is no relationship; results less than 1 and greater than 1 imply a relationship.
Partial correlation • Instead of height weight, is it possible that a variable related to height – age – is the real cause of changes in weight? Why or why not? HEIGHT WEIGHT AGE HEIGHT 1.00 .72 .04 WEIGHT .72 1.00 .34 AGE .04 .34 1.00 Zero-ordercorrelations Controlling for.. AGE HEIGHT WEIGHT HEIGHT 1.00 .75 WEIGHT .75 1.00 first-orderpartialcorrelations