330 likes | 585 Views
Clinical Research Training Program 2021 . REGRESSION DIAGNOSTICS I. Fall 2004. www.edc.gsph.pitt.edu/faculty/dodge/clres2021.html. OUTLINE. Purpose of Regression Diagnostics Residuals Ordinary residuals, standardized residuals, studentized residuals, Jackknife residuals Leverage points
E N D
Clinical Research Training Program 2021 REGRESSION DIAGNOSTICS I Fall 2004 www.edc.gsph.pitt.edu/faculty/dodge/clres2021.html
OUTLINE • Purpose of Regression Diagnostics • Residuals • Ordinary residuals, standardized residuals, studentized residuals, Jackknife residuals • Leverage points • Diagonal elements of the hat matrix • Influential observations • Cook’s distance • Collinearity • Alternate Strategies of Analysis
Purpose of Regression Diagnostics The techniques of regression diagnostics are employed to check the assumptions and to assess the accuracy of computations for a regression analysis.
MODEL ASSUMPTIONS • Independence: the errors associated with one observation are not correlated with the errors of any other observation • Linearity: the relationship between the predictors X and the outcome Y should be linear • Homoscedasticity: the error variance should be constant • Normality: the errors should be normally distributed • Model Specification: the model should be properly specified (including all relevant variables, and excluding irrelevant variables).
UNUSUAL OBSERVATIONS • Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent variable value is unusual given its values on the predictor variables X. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.
UNUSUAL OBSERVATIONS • Leverage: An observation with an extreme value on a predictor variable X is called a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. These leverage points can have an unusually large effect on the estimate of regression coefficients.
UNUSUAL OBSERVATIONS • Influence: An observation is said to be influential if removing the observation substantially changes the estimate of regression coefficients. Influence can be thought of as the product of leverage and outlierness.
SIMPLE APPROACHES Detect errors in the data & pinpoint potential violations of the assumptions: • Check the type of subject • Check the procedure for data collecting • Check the unit of measurement for each variable • Check the plausible range of values and a typical value for each variable • Descriptive statistics
SIMPLE APPROACHES • Analysis of residuals and other regression diagnostic procedures provide the most refined and accurate evaluation of model assumptions.
(unobserved) error term for the ith response RESIDUAL ANALYSIS Model i = 1, …, n Fitted model Ordinary residuals Difference b/w the observed and the expected outcomes
• • LEAST-SQUARES METHOD Birthweights (g/100) Estriol (mg/24 hr) levels of pregnant women
ORDINARY RESIDUALS • The ordinary residuals {ei} reflect the amount of discrepancy between observed and predicted values that remains after the data have been fitted by the least-squares model. • Underlying assumption for unobserved errors: • Each residual ei represents an estimate of the corresponding unobserved error i.
ORDINARY RESIDUALS • The mean of {ei} is 0. • The estimate of population variance computed from n residuals is s2 is an unbiased estimator of 2 if the model is correct.
ORDINARY RESIDUALS • Ordinary residuals are correlated (residuals sum up to 0) and have unequal variances even though underlying errors are independent and have equal variance where hii is the ith diagonal element of the (n n) matrix H = X(X´X)-1X´, the hat matrix. Note (expected y = H· observed y). • {ei} are not independent random variables (sum up to zero). However, if n >> p, then this dependency can be ignored in any analysis of the residuals.
STANDARDIZED RESIDUALS • Standardized residual is defined as • The standardized residuals sum to 0 and hence are not independent. • The standardized residuals have unit variance.
STUDENTIZED RESIDUALS • Studentized residualis defined as where hii is the ith diagonal element of the (n n) matrix H = X(X´X)-1X´, the hat matrix. Note (expected y = H· observed y). • Value hiiranges from 0 to 1, which is a measure of leverage. A high value means more leverage, i.e., X is further away from the X-centroid (X-variable means).
STUDENTIZED RESIDUALS • If data point is associated with a higher leverage value (hii is larger), we will get a bigger studentized residual value. • Therefore, residuals on the edging will have higher studentized residual values (making them easier to single out). • If the data follow the usual assumptions for linear regression, the studentized residual approximately follows tn-p-1.
LEVERAGE MEASURES • The quantity hii, the leverage, measures the distance of the ith observation from the set of x-variable means – namely, from • hii indicates that for a fixed xi , when yi moves a little bit, how much does move? • If moves a lot, then yi has the potential to drive the regression, so the point is a leverage point. However, if hardly moves at all, then yi has no chance of driving the regression.
LEVERAGE MEASURES • Under the model, we have Consequently, the average leverage value is • Hoaglin and Welsch (1978) recommended scrutinizing any observation for which hii > 2(p+1)/n.
JACKKNIFE RESIDUALS • Jackknife residual is defined as where is the MSE computed with the ith observation deleted. • If the ith observation lies far from the rest of the data, s(-i) will tend to be much smaller than s, which in turn will make r(-i) larger in comparison to ri. • If the ith observation has larger leverage value hii, then r(-i) will become larger in comparison to ri.
JACKKNIFE RESIDUALS • Therefore, residuals on the edging (high leverage) or far away from the rest of the data (small S(-i) value) will have higher Jackknife residual values (making them easier to single out). • If the usual assumptions are met, each jackknife residual exactly follows a t distribution with (n-p-1) degrees of freedom.
Graphical Analysis of Residuals • Stem-and-leaf diagram, histogram, boxplot, and normal probability plot of residuals can be employed to test the normality assumption. • Studentized residual (or jackknife residual) vs. can be employed to detect outliers. • Studentized residual (or jackknife residual) vs. can be employed to detect nonlinearity. • Studentized residual (or jackknife residual) vs. (or x) can be employed to detect heteroscedasticity.
Graphical Analysis of Residuals • Stem-and-leaf diagram, histogram, boxplot, and normal probability plot of residuals can be employed to test the normality assumption. • For the normal probability plot of residuals, if the data points are distributed away from the ideal 45 degree line, the normality assumption is questionable.
• regress sbp age • predict rstudent, rstudent • qnorm rstudent, grid
Graphical Analysis of Residuals • Studentized residual (or jackknife residual) vs. can be employed to detect outliers. • For a sample with size large enough, 95% of Jackknife residuals should lie between + 2. • For a sample with size large enough, 99% of Jackknife residuals should lie between + 2.5. • Any observation for which the absolute value of the Jackknife residuals is 3 or more, is likely to be an outlier.
• regress sbp age • predict yhat, xb • predict rstudent, rstudent • graph rstudent yhat, yline(-3, -2, 0, 2, 3)
Graphical Analysis of Residuals • Studentized residual (or jackknife residual) vs. can be employed to detect nonlinearity. • Studentized residual (or jackknife residual) vs. (or x) can be employed to detect heteroscedasticity. • In STATA®, command “hettest” performs Cook-Weisberg test for heteroscedasticityafter “regress.”
. regress SBP Age Source | SS df MS Number of obs = 70 ---------+--------------------------- F( 1, 68) = 77.92 Model | 15068.9324 1 15068.9324 Prob > F = 0.0000 Residual | 13150.4391 68 193.38881 R-squared = 0.5340 ---------+--------------------------- Adj R-squared = 0.5271 Total | 28219.3714 69 408.976398 Root MSE = 13.906 ----------------------------------------------------------------- SBP | Coef. Std. Err. t P>|t| [95% CI] ------+---------------------------------------------------------- Age | .9871668 .1118317 8.83 0.000 .764 1.210 _cons | 104.1781 5.422842 19.21 0.000 93.357 115.000 ----------------------------------------------------------------- . hettest Cook-Weisberg test for heteroskedasticity using fitted SBP Ho: Constant variance chi2(1) = 0.07 Prob > chi2 = 0.7975
Analysis of Leverage Points • Observations correspond to large diagonal elements of the hat matrix (i.e., hii>2(p+1)/n) are considered as leverage points.
. regress sbp age • Source | SS df MS Number of obs = 32 • -------------+------------------------------ F( 1, 30) = 45.18 • Model | 3861.63037 1 3861.63037 Prob > F = 0.0000 • Residual | 2564.33838 30 85.4779458 R-squared = 0.6009 • -------------+------------------------------ Adj R-squared = 0.5876 • Total | 6425.96875 31 207.289315 Root MSE = 9.2454 • ------------------------------------------------------------------------------ • sbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • age | 1.6045 .2387159 6.72 0.000 1.116977 2.092023 • _cons | 59.09162 12.81626 4.61 0.000 32.91733 85.26592 • ------------------------------------------------------------------------------ • . predict hat, hat • . predict rstudent, rstudent • . list sbp age yhat rstudent hat if hat>(4/32) • sbp age yhat rstudent hat • 2. 122 41 124.8761 -.3287683 .1312917
Should we remove outlier? • Error in the data management step? (coding error, data entry error, etc.) • Suspected data points? (subjects could not perform the task, subjects did not take experiment seriously, etc.) • Data points are from a different population than the rest of the data? • If the data points are moved, how much do the results change?