460 likes | 588 Views
Topic 9: Remedies. Outline. Review diagnostics for residuals Discuss remedies Nonlinear relationship Nonconstant variance Non-Normal distribution Outliers. Diagnostics for residuals. Look at residuals to find serious violations of the model assumptions nonlinear relationship
E N D
Outline • Review diagnostics for residuals • Discuss remedies • Nonlinear relationship • Nonconstant variance • Non-Normal distribution • Outliers
Diagnostics for residuals • Look at residuals to find serious violations of the model assumptions • nonlinear relationship • nonconstant variance • non-Normal errors • presence of outliers • a strongly skewed distribution
Recommendations for checking assumptions • Plot Y vs X (is it a linear relationship?) • Look at distribution of residuals • Plot residuals vs X, time, or any other potential explanatory variable • Use the i=sm## in symbol statement to get smoothed curves
Plots of Residuals • Plot residuals vs • Time (order) • X or predicted value (b0+b1X) • Look for • nonrandom patterns • outliers (unusual observations)
Residuals vs Order • Pattern in plot suggests dependent errors / lack of indep • Pattern usually a linear or quadratic trend and/or cyclical • If you are interested read KNNL pgs 108-110
Residuals vs X • Can look for • nonconstant variance • nonlinear relationship • outliers • somewhat address Normality of residuals
Tests for Normality • H0: data are an i.i.d. sample from a Normal population • Ha: data are notan i.i.d. sample from a Normal population • KNNL (p 115) suggest a correlation test that requires a table look-up
Tests for Normality • We have several choices for a significance testing procedure • Proc univariate with the normal option provides four proc univariate normal; • Shapiro-Wilk is a common choice
Other tests for model assumptions • Durbin-Watson test for serially correlated errors (KNNL p 114) • Modified Levene test for homogeneity of variance (KNNL p 116-118) • Breusch-Pagan test for homogeneity of variance (KNNL p 118) • For SAS commands see topic9.sas
Plots vs significance test • Plots are more likely to suggest a remedy • Significance tests results are very dependent on the sample size; with sufficiently large samples we can reject most null hypotheses
Default graphics with SAS 9.3 proc reg data=toluca; model hours=lotsize; id lotsize; run;
Will discuss these diagnostics more in multiple regression Provides rule of thumb limits Questionable observation (30,273)
Additional summaries • Rstudent: Studentized residual…almost all should be between ± 2 • Leverage: “Distance” of X from center…helps determine outlying X values in multivariable setting…outlying X values may be influential • Cooks’D: Influence of ith case on all predicted values
Lack of fit • When we have repeat observations at different values of X, we can do a significance test for nonlinearity • Browse through KNNL Section 3.7 • Details of approach discussed when we get to KNNL 17.9, p 762 • Basic idea is to compare two models • Gplot with a smooth is a better (i.e., simpler) approach
SAS code and output proc reg data=toluca; model hours=lotsize / lackfit; run;
Nonlinear relationships • We can model many nonlinear relationships with linear models, some have several explanatory variables (i.e., multiple linear regression) • Y = β0 + β1X+ β2X2+ e(quadratic) • Y = β0 + β1log(X)+ e
Nonlinear Relationships • Sometimes can transform a nonlinear equation into a linear equation • Consider Y = β0exp(β1X)+ e • Can form linear model using log log(Y) = log(β0) + β1X + log(e) • Note that we have changed our assumption about the error
Nonlinear Relationship • We can perform a nonlinear regression analysis • KNNL Chapter 13 • SAS PROC NLIN
Nonconstant variance • Sometimes we model the way in which the error variance changes • may be linearly related to X • We can then use a weighted analysis • KNNL 11.1 • Use a weight statement in PROC REG
Non-Normal errors • Transformations often help • Use a procedure that allows different distributions for the error term • SAS PROC GENMOD
Generalized Linear Model • Possible distributions of Y: • Binomial (Y/N or percentage data) • Poisson (Count data) • Gamma (exponential) • Inverse gaussian • Negative binomial • Multinomial • Specify a link function for E(Y)
Ladder of Reexpression(transformations) 1.5 p Transformation is xp 1.0 0.5 0.0 -0.5 -1.0
Circle of Transformations X up, Y up X down, Y up Y X X up, Y down X down, Y down
Box-Cox Transformations • Also called power transformations • These transformations adjust for non-Normality and nonconstant variance • Y´ = Y or Y´ = (Y - 1)/ • In the second form, the limit as approaches zero is the (natural) log
Important Special Cases • = 1, Y´ = Y1, no transformation • = .5, Y´ = Y1/2, square root • = -.5, Y´ = Y-1/2, one over square root • = -1, Y´ = Y-1 = 1/Y, inverse • = 0, Y´ = (natural) log of Y
Box-Cox Details • We can estimate by including it as a parameter in a non-linear model • Y = β0 + β1X+ e and using the method of maximum likelihood • Details are in KNNL p 134-137 • SAS code is in boxcox.sas
Box-Cox Solution • Standardized transformed Y is • K1(Y - 1) if ≠ 0 • K2log(Y) if = 0 where K2 = ( Yi)1/n (the geometric mean) and K1 = 1/ ( K2 -1) • Run regressions with X as explanatory variable • estimated minimizes SSE
Example data a1; input age plasma @@; cards; 0 13.44 0 12.84 0 11.91 0 20.09 0 15.60 1 10.11 1 11.38 1 10.28 1 8.96 1 8.59 2 9.83 2 9.00 2 8.65 2 7.85 2 8.88 3 7.94 3 6.01 3 5.14 3 6.90 3 6.77 4 4.86 4 5.10 4 5.67 4 5.75 4 6.23 ;
Box Cox Procedure *Procedure that will automatically find the Box-Cox transformation; proctransreg data=a1; model boxcox(plasma)=identity(age); run;
Transformation Information for BoxCox(plasma) Lambda R-Square Log Like -2.50 0.76 -17.0444 -2.00 0.80 -12.3665 -1.50 0.83 -8.1127 -1.00 0.86 -4.8523 * -0.50 0.87 -3.5523 < 0.00 + 0.85 -5.0754 * 0.50 0.82 -9.2925 1.00 0.75 -15.2625 1.50 0.67 -22.1378 2.00 0.59 -29.4720 2.50 0.50 -37.0844 < - Best Lambda * - Confidence Interval + - Convenient Lambda
*The first part of the program gets the geometric mean; data a2; set a1; lplasma=log(plasma); proc univariate data=a2 noprint; var lplasma; output out=a3 mean=meanl;
data a4; set a2; if _n_ eq 1 then set a3; keep age yl l; k2=exp(meanl); do l = -1.0 to 1.0 by .1; k1=1/(l*k2**(l-1)); yl=k1*(plasma**l -1); if abs(l) < 1E-8 then yl=k2*log(plasma); output; end;
proc sort data=a4 out=a4; by l; proc reg data=a4 noprint outest=a5; model yl=age; by l; data a5; set a5; n=25; p=2; sse=(n-p)*(_rmse_)**2; proc print data=a5; var l sse;
Obs l sse 1 -1.0 33.9089 2 -0.9 32.7044 3 -0.8 31.7645 4 -0.7 31.0907 5 -0.6 30.6868 6 -0.5 30.5596 7 -0.4 30.7186 8 -0.3 31.1763 9 -0.2 31.9487 10 -0.1 33.0552
symbol1 v=none i=join; proc gplot data=a5; plot sse*l; run;
data a1; set a1; tplasma = plasma**(-.5); tage = (age+.5)**(-.5); symbol1 v=circle i=sm50; proc gplot; plot tplasma*age; proc sort; by tage; proc gplot;plot tplasma*tage; run;
Background Reading • Sections 3.4 - 3.7 describe significance tests for assumptions (read it if you are interested). • Box-Cox transformation is in nknw132.sas • Read sections 4.1, 4.2, 4.4, 4.5, and 4.6