Correlation and Regression

Correlation and Regression • Introduction to linear correlation and regression • Numerical illustrations • SAS and linear correlation/regression • CORR • REG • GLM • Assumptions of linear correlation/regression • Model II regression

Introduction • Correlation • Bivariate correlation • Multiple correlation • Partial correlation • Canonical correlation • Regression • Simple regression • Multiple regression • Nonlinear regression (1857-1936) (1822-1911) (1890-1962)

Regression Coefficient X Y Change Y to 3, 4, 5, 6, 7 for students to recompute a and b.

Least-squares method Least-square estimate of the sample mean

Least-Square Estimation of Regression Coefficient A trick to simplify the estimation y = m + b x + e i i i ŷ = a + b ( x - x ) i i 2 2 ) Q = ( y - ŷ = [ y - a - b ( x - x )] å å i i i i ¶ Q = - 2 [ y - a - b ( x - x )] = 0 å i i ¶ a ¶ Q = - 2 [ y - a - b ( x - x )]( x - x ) = 0 å i i i ¶ b [ y - a - b ( x - x )] = 0 å i i [ y - a - b ( x - x )]( x - x ) = 0 å i i i y - a - b ( x - x ) = 0 å å å i i y å i y - n a = 0 ; a = = y å i n [ y - y - b ( x - x )]( x - x ) = 0 å i i i 2 ( y - y )( x - x ) - b ( x - x ) = 0 å å i i i ( y - y )( x - x ) å i i b = 2 ( x - x ) å i

Maximum Likelihood Method Estimation of proportion of males (p) of a fish species in a pond: Two samples are taken, one with 10 fish with 5 males and other with 12 fish but only 3 males R. A. Fisher

Correlation & Regression Coefficients X Y

Regression Coefficient X Y

The Beetle Experiment

Regression Coefficient

Partition of variance 8 7 Total deviation 6 5 4 Y 3 Explained deviation Unexplained Deviation 2 1 0 0 2 4 6 8 X

ANOVA test in regression Partition of SS in Regression Perform an ANOVA significance test.

SAS Program Listing /* Weight loss (in mg) of 9 batches of 25 Tribolium beetles after six days of starvation at nine different humidities*/ data beetle; input Humidity WtLoss @@; cards; 0 8.98 12 8.14 29.5 6.67 43 6.08 53 5.9 62.5 5.83 75.5 4.68 85 4.2 93 3.72 ; proc reg; Title ‘Simple linear regression of WtLoss on Humidity’; model WtLoss=Humidity / R CLM alpha = 0.01 CLI ; plot WtLoss *Humidity / conf ; plot WtLoss *Humidity / pred ; plot residual.*Humidity ; run; proc glm; model WtLoss=Humidity; Title ‘Simple linear regression of WtLoss on Humidity’; run;

SAS Output Dependent Variable: WTLOSS Sum of Mean Source DF Squares Square F Value Prob>F Model 1 23.51449 23.51449 267.183 0.0001 Error 7 0.61606 0.08801 C Total 8 24.13056 Root MSE 0.29666 R-square 0.9745 Dep Mean 6.02222 Adj R-sq 0.9708 C.V. 4.92614 (=100*Root MSE / Mean) Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 8.704027 0.19156450 45.437 0.0001 HUMIDITY 1 -0.053222 0.00325603 -16.346 0.0001

Confidence Limits for  16 14 12 10 Y 8 6 4 2 0 0 5 10 15 X MSE SSX

Confidence Limits for Y 16 14 12 10 Y 8 6 4 2 0 0 5 10 15 X n Xi - Mean X MSE SSX

WtLoss = 8.704 -0.0532 Humidity 10 /* 99% CL of predicted means, equivalent to Predictedt,dfSE (See Eq)*/ plot WtLoss *Humidity / conf ; 9 8 7 WtLoss 6 5 4 3 0 10 20 30 40 50 60 70 80 90 100 Humidity

10 /* 99% CL of prediction intervals, equivalent to Predictedt,dfSTD (with n = 1 in Eq) */ plot WtLoss *Humidity / pred ; 9 8 7 WtLoss 6 5 4 3 2 0 10 20 30 40 50 60 70 80 90 100 Humidity

Regression summary

Assumptions • The regression model Yi =  +  Xi + i • Assumptions • The error term has a mean = 0, is independent and normally distributed at each value of X, and have the same variance at each value of X (homoscedasticity). • Y is linearly related to X • There is negligible error (e.g., measurement error) for X. (Model II regression)

More plot functions data WtLoss; input Humidity WtLoss; cards; 0.00 8.98 12.00 8.14 29.50 6.67 43.00 6.08 53.00 5.90 62.50 5.83 75.50 4.68 85.00 4.20 93.00 3.72 ; procreg; model WtLoss=Humidity / alpha=0.01; plot WtLoss*Humidity / pred; plot residual.*predicted. / symbol='.'; Title ‘Simple linear regression of WtLoss on Humidity’; run;

3D Scatter plot data My3D ; input X Y Z; datalines; 25.71428 35 490.25 26.47058 34 1117.0667 27.27272 33 2564.3333 27.77777 36 122.5 28.57142 35 1579.9 29.41176 34 2258.2424 30.30303 33 3814.5185 31.25 32 12411.4167 31.42857 35 57.5833 32.35294 34 4679 33.33333 33 2690.8125 34.28571 35 22243.1667 34.375 32 2103.2255 35.29411 34 7455.1 35.48387 31 2639.0833 36.36363 33 905.9688 37.5 32 7211.1458 38.23529 34 11885.5 38.70967 31 2685.4815 39.39393 33 457.75 40 30 885 40.625 32 10263.5313 41.93548 31 4492.141 42.42424 33 1594 43.33333 30 10838.6333 ; proc g3d; scatter X*Y=Z; run;

Spurious Correlation Liquor Cons N. Church City Size 10041.7887 1 10000 20096.1752 3 20000 10041.7887 2 10000 30083.8478 3 30000 20096.1752 1 20000 40014.8096 5 40000 50060.0323 4 50000 60043.2171 6 60000 20096.1752 3 20000 50060.0323 4 50000 10041.7887 2 10000 10041.7887 1 10000 70096.1250 8 70000 50060.0323 2 50000 80064.3763 9 80000 90094.3248 9 90000 100034.3940 10 100000 110066.0155 10 110000

Spurious Correlation data Liquor; input Liquor Church PopSize @@; datalines; 10041.7887 1 10000 20096.1752 3 20000 10041.7887 2 10000 30083.8478 3 30000 20096.1752 1 20000 40014.8096 5 40000 50060.0323 4 50000 60043.2171 6 60000 20096.1752 3 20000 50060.0323 4 50000 10041.7887 2 10000 10041.7887 1 10000 70096.1250 8 70000 50060.0323 2 50000 80064.3763 9 80000 90094.3248 9 90000 100034.3940 10 100000 110066.0155 10 110000 ; procreg; model Liquor = PopSize; run; procreg; model Liquor = PopSize / NoInt; run; Forcing the intercept through the origin leads to different computation of SSm and SSt which will be sumsq instead of devsq, i.e., One can use the adjusted R2 to choose the model.

Correlation and Regression