360 likes | 538 Views
Topic 19: Remedies. Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping. Regression Diagnostics Summary. Check normality of the residuals with a normal quantile plot
E N D
Outline • Review regression diagnostics • Remedial measures • Weighted regression • Ridge regression • Robust regression • Bootstrapping
Regression DiagnosticsSummary • Check normality of the residuals with a normal quantile plot • Plot the residuals versus predicted values, versus each of the X’s and (when appropriate) versus time • Examine the partial regression plots • Use the graphics smoother to see if there appears to be a curvilinear pattern
Regression DiagnosticsSummary • Examine • the studentized deleted residuals (RSTUDENT in the output) • The hat matrix diagonals • Dffits, Cook’s D, and the DFBETAS • Check observations that are extreme on these measures relative to the other observations
Regression DiagnosticsSummary • Examine the tolerance for each X • If there are variables with low tolerance, you need to do some model building • Recode variables • Variable selection
Remedial measures • Weighted least squares • Ridge regression • Robust regression • Nonparametric regression • Bootstrapping
Weighted regression • Maximization of L with respect to β’s is equivalent to minimization of • Weight of each observation: wi=1/σi2
Weighted least squares • Least squares problem is to minimize the sum of wi times the squared residual for case i • Computations are easy, use the weight statement in proc reg • bw = (X΄WX)-1(X΄WY) where W is a diagonal matrix of the weights • The problem now becomes determining the weights
Determination of weights • Find a relationship between the absolute residual and another variable and use this as a model for the standard deviation • Similarly for the squared residual and another variable • Use grouped data or approximately grouped data to estimate the variance
Determination of weights • With a model for the standard deviation or the variance, we can approximate the optimal weights • Optimal weights are proportional to the inverse of the variance
KNNL Example • KNNL p 427 • Y is diastolic blood pressure • X is age • n = 54 healthy adult women aged 20 to 60 years old
Get the data and check it data a1; infile ‘../data/ch11ta01.txt'; input age diast; proc print data=a1; run;
Plot the relationship symbol1 v=circle i=sm70; proc gplot data=a1; plot diast*age / frame; run;
Diastolic bp vs age Strong linear relationship but non-constant variance
Run the regression proc reg data=a1; model diast=age; output out=a2 r=resid; run;
Regression output Estimators still unbiased but no longer have minimum variance Prediction interval coverage often lower or higher than 95%
Use the output data set to get the absolute and squared residuals data a2; set a2; absr=abs(resid); sqrr=resid*resid;
Do the plots with a smooth proc gplot data=a2; plot (resid absr sqrr)*age; run;
Model the std dev vs age (absolute value of the residual) proc reg data=a2; model absr=age; output out=a3 p=shat; Note that a3 has the predicted standard deviations (shat)
Compute the weights data a3; set a3; wt=1/(shat*shat);
Regression with weights proc reg data=a3; model diast=age / clb; weight wt; run;
Output Reduction in std err of the age coeff
Ridge regression • Similar to a very old idea in numerical analysis • If (X΄X) is difficult to invert (near singular) then approximate by inverting (X΄X+kI). • Estimators of coefficients are biased but more stable. • For some value of k ridge regression estimator has a smaller mean square error than ordinary least square estimator. • Can be used to reduce number of predictors • Ridge = k is an option for model statement . • Cross-validation used to determine k
Robust regression • Basic idea is to have a procedure that is not sensitive to outliers • Alternatives to least squares, minimize • sum of absolute values of residuals • median of the squares of residuals • Do weighted regression with weights based on residuals, and iterate
Nonparametric regression • Several versions • We have used i=sm70 • Interesting theory • All versions have some smoothing or penalty parameter similar to the 70 in i=sm70
Bootstrap • Very important theoretical development that has had a major impact on applied statistics • Based on simulation • Sample with replacement from the data or residuals and repeatedly refit model to get the distribution of the quantity of interest
Background Reading • We used programs topic19.sas • This completes Chapter 11 • This completes the material for the midterm