E N D
1. 1 Regression Analysis with SAS Robert A. Yaffee, Ph.D.
Statistics, Mapping, and Social Science Group
Academic Computing Services
ITS
NYU
251 Mercer Street
New York, NY 10012
Office: 75 Third Avenue, SB
p. 212.998.3402
Email: yaffee@nyu.edu
2. 2
3. 3 Regression AnalysisHave a clear notion of what you can and cannot do with regression analysis Conceptualization
A Path Model of a Regression Analysis
4. 4 Hypothesis Testing For example: hypothesis 1 : X is statistically significantly related to Y.
The relationship is positive (as X increases, Y increases) or negative (as X decreases, Y increases).
The magnitude of the relationship is small, medium, or large.
If the magnitude is small, then a unit change in x is associated with a small change in Y.
5. 5 Plotting the Data: Testing for Functional form The relationship between the Y and each X needs to be examined.
One needs to ascertain whether it is linear or not
If it is not linear, can it be transformed into linearity?
If so, then that may be the easiest recourse.
If not, then perhaps linear regression is not appropriate
6. 6 Explore the relationship
7. 7 Graphical Exploration
8. 8 Graphical Plot The output from this plot appears in figure 1:
9. 9 Testing Functional Form with Curve Fitting 1b. Curve Fitting
What is done here is that the data are subjected to a number of tests of functional form. Application of the data to such a process presumes either a linear relationship or a relationship that can be transformed into a linear one.
From a regression analysis of the relationship, an r square is generated. This is the square of the multiple correlation coefficient between the dependent variable and the independent variables in the model. This r square is the proportion of variance of the dependent variable explained by the functional form. The higher the r square, the closer the functional form is to the actual relationship inherent in the data.
The program to set up the curve fitting may be found in figure 3 below.
10. 10 We may fit functional forms Do x = 1 to 100;
liny = a + x;
quady = a + x**2;
cubicy = a + x**3;
lnx = log(x);
invx = 1/x;
expx = exp(x);
compound = a*b**x;
power = a*x**b;
sshapex = exp( a + x**(-1)) ;
growth = exp(a + b*x);
output;
End;
Proc print;
run;
Then run different models. For example,
Proc Reg;
model y = quadx;
run;
Use the transformation that yields the highest R2
11. 11
12. 12 Curve Fitting Output & Interpretation
13. 13
14. 14
15. 15 Fixes-continued
16. 16 The General Linear Model
17. 17
18. 18 Decomposition of the sum of squares
19. 19 Decomposition of the sum of squares Total SS = model SS + error SS
and if we divide by df
This yields the Variance Decomposition: We have the total variance= model variance + error variance
20. 20 F test for significance and R2 for magnitude of effect R2 = Model var/total var
21. 21
22. 22 Derivation of the Intercept
23. 23 Derivation of the Regression Coefficient
24. 24 The Multiple Regression Equation We proceed to the derivation of its components:
The intercept: a
The regression parameters, b1 and b2
25. 25 The Multiple Regression Formula
If we recall that the formula for the correlation coefficient can be expressed as follows:
26. 26
27. 27
28. 28 Significance Tests for the Regression Coefficients We find the significance of the parameter estimates by using the F or t test.
The R2 is the proportion of variance explained.
Adjusted R2
= 1-(1-R2) (n-1)/(n-p-1)
29. 29 F and T tests for significance for overall model
30. 30 Significance tests If we are using a type II sum of squares, we are dealing with the ballantine. DV Variance explained = a + b
31. 31 Significance tests T tests for statistical significance
32. 32 Significance tests Standard Error of intercept
33. 33 SAS Regression Command Syntax
34. 34 SAS Regression syntax Proc reg simple data=regdat;
model y = x1 x2 x3 /spec dw corrb collin r dffits influence ;
output out=resdat p=pred r=resid rstudent=rstud;
Data check;
set resdat;
Proc univariate normal plot;
var resid;
Title Check of Normality of Residuals;
Run;
Proc Freq data=resdat; tables rstud;
Title Check for Outliers;
run;
Proc ARIMA data=resdat;
identify var=resid;
Title Check of Autocorrelation of the errors;
Run;
35. 35 SAS Regression Output& Interpretation
36. 36 More Simple Statistics
37. 37 Omnibus ANOVA Statistics
38. 38 Parameter Estimates
39. 39 Assumptions of the Linear Regression Model Linear Functional form
Fixed independent variables
Independent observations
Representative sample and proper specification of the model (no omitted variables)
Normality of the residuals or errors
Equality of variance of the errors (homogeneity of residual variance)
No multicollinearity
No autocorrelation of the errors
No outlier distortion
40. 40 Explanation of the Assumptions 1. Linear Functional form
Does not detect curvilinear relationships
Independent observations
Representative samples
Autocorrelation inflates the t and r and f statistics and warps the significance tests
Normality of the residuals
Permits proper significance testing
Equality of variance
Heteroskedasticity precludes generalization and external validity
This also warps the significance tests
Multicollinearity prevents proper parameter estimation. It may also preclude computation of the parameter estimates completely if it is serious enough.
Outlier distortion may bias the results: If outliers have high influence and the sample is not large enough, then they may serious bias the parameter estimates
41. 41 Diagnostic Tests for the Regression Assumptions Linearity tests: Regression curve fitting
No level shifts: One regime
Independence of observations: Runs test
Normality of the residuals: Shapiro-Wilks or Kolmogorov-Smirnov Test
Homogeneity of variance if the residuals: Whites General Specification test
No autocorrelation of residuals: Durbin Watson or ACF or PACF of residuals
Multicollinearity: Correlation matrix of residuals. Condition index or condition number
No serious outlier influence: tests of additive outliers: Pulse dummies.
Plot residuals and look for high leverage of residuals
Lists of Standardized residuals
Lists of Studentized residuals
Cooks distance or leverage statistics
42. 42 Explanation of Diagnostics Plots show linearity or nonlinearity of relationship
Correlation matrix shows whether the independent variables are collinear and correlated.
Representative sample is done with probability sampling
43. 43 Explanation of Diagnostics Tests for Normality of the residuals. The residuals are saved and then subjected to either of:
Kolmogorov-Smirnov Test: Tests the limit of the theoretical cumulative normal distribution against your residual distribution.
Shapiro-Wilks Test
Proc reg;
model y = x1 x2;
Output out=resdat r=resid p=pred;
Data check;
set resdat;
Proc univariate normal plot; var resid;
Title Test of Normality of Residuals;
Run;
44. 44 Test for Homogeneity of variance Whites General Test
Proc reg;
model y = x1 x2/spec;
Run;
Chow test
45. 45 Weighted Least squares A fix for heteroskedasticity is
Weighted least squares or WLS
There are two ways to do this.
If x is proportional to the variance in y, then form the weight by wt = 1/y**2;
Proc reg;
weight wt;
Model y = x;
Run;
46. 46 Collinearity Diagnostics
47. 47 More Collinearity Diagnostics Watch for eigenvalues much greater than 1
condition numbers = maximum eigenvalue/minimum eigenvalue.
If condition numbers are between 100 and 1000, there is moderate to strong collinearity
48. 48 Collinearity Diagnostics
49. 49 Outlier Diagnostics Residuals.
The predicted value minus the actual value. This is otherwise known as the error.
Studentized Residuals
the residuals divided by their standard errors without the ith observation
Leverage, called the Hat diag
This is the measure of influence of each observation
Cooks Distance:
the change in the statistics that results from deleting the observation. Watch this if it is much greater than 1.0.
50. 50 Outlier detection Outlier detection involves the determination whether the residual (error = predicted actual) is an extreme negative or positive value.
We may plot the residual versus the fitted plot to determine which errors are large, after running the regression.
The command syntax was already demonstrated with the graph on page 16: rvfplot, border yline(0)
51. 51 Create Standardized Residuals A standardized residual is one divided by its standard deviation.
52. 52 Limits of Standardized Residuals If the standardized residuals have values in excess of 3.5
and -3.5, they are outliers.
If the absolute values are less than 3.5, as these are, then there are no outliers
While outliers by themselves only distort mean prediction when the sample size is small enough, it is important to gauge the influence of outliers.
53. 53 Outlier Influence Suppose we had a different data set with two outliers.
We tabulate the standardized residuals and obtain the following output:
54. 54 Outlier a does not distort and outlier b does.
55. 55 Studentized Residuals Alternatively, we could form studentized residuals. These are distributed as a t distribution with df=n-p-1, though they are not quite independent. Therefore, we can approximately determine if they are statistically significant or not.
Belsley et al. (1980) recommended the use of studentized residuals.
56. 56 Studentized Residual
57. 57 Influence of Outliers Leverage is measured by the diagonal components of the hat matrix.
The hat matrix comes from the formula for the regression of Y.
58. 58 Leverage and the Hat matrix The hat matrix transforms Y into the predicted scores.
The diagonals of the hat matrix indicate which values will be outliers or not.
The diagonals are therefore measures of leverage.
Leverage is bounded by two limits: 1/n and 1. The closer the leverage is to unity, the more leverage the value has.
The trace of the hat matrix = the number of variables in the model.
When the leverage > 2p/n then there is high leverage according to Belsley et al. (1980) cited in Long, J.F. Modern Methods of Data Analysis (p.262). For smaller samples, Vellman and Welsch (1981) suggested that 3p/n is the criterion.
59. 59 Cooks D Another measure of influence.
This is a popular one. The formula for it is:
60. 60 Using Cooks D in SAS Cook is the option /R
Finding the influential outliers
List cook, if cook > 4/n
Belsley suggests 4/(n-k-1) as a cutoff
61. 61 Graphical Exploration of Outlier Influence
62. 62 DFbeta One can use the DFbetas to ascertain the magnitude of influence that an observation has on a particular parameter estimate if that observation is deleted.
63. 63 Alternatives to Violations of Assumptions 1. Nonlinearity: Transform to linearity if there is nonlinearity or run a nonlinear regression
2. Nonnormality: Run a least absolute deviations regression or a median regression
3. Heteroskedasticity: weighted least squares regression or white estimator. One can use Proc Robustreg to obtain downweighted outlier effect in the estimation.
4. Autocorrelation: Newey-West estimators or autoregression model
4. Multicollinearity: components regression or ridge regression or proxy variables
64. 64 The Interaction model The product vector x1*x2 is an interaction term.
This is the joint effect over and above the main effects (x1 and x2).
The main effects must be in the model for the interaction to be properly specified, regardless of whether the main effects remain statistically significant.
65. 65 Path Diagram of an Interaction model
66. 66 SAS Regression Interaction Model Syntax data one;
input y x1 x2 x3 x4;
lincrossprodx1x2=x1*x2;
x1sq=x1*x1;
time+1;
datalines;
112 113 114 39 10
322 230 310 43 23
323 340 250 33 33
112 122 125 144 45
99 100 89 55 34
14 13 10 249 40
40 34 98 39 30
30 32 34 40 40
90 80 93 50 50
89 90 91 60 44
120 130 43 100 34
444 432 430 20 44
proc print;
run;
proc reg;
model y= x1 x2 lincrossprodx1x2/r collin stb spec influence;
output out=resdat r=resid p=pred stdr=rstd student=rstud cookd=cooksd;
run;
67. 67 SAS Regression Diagnostic syntax data outck;
set resdat;
degfree=7;
if cooksd > 4/7; /* if cd > p/(n-p-1)
where p = # parms, n=sample size */
proc freq; tables rstud;
title 'Outlier Indicator';
run;
axis1 label=(a=90 'Cooks D Influence Stat');
proc gplot data=resdat;
plot cooksd * rstud/vaxis=axis1;
title 'Leverage and the Outlier';
run;
68. 68 Regression Model test of linear interaction
69. 69 SAS Polynomial Regression Syntax proc reg data=one;
model y= x1 x1sq;
title 'The Polynomial Regression';
run;
70. 70 Model Building Strategies Specific to General: Cohen and Cohen
General to Specific: Hendry and Richard
Extreme Bounds analysis: E. Leamer.
F. Harrell, Jr. approach: Regression Modeling Strategies (Springer, 2001).
Loess, Splines for mode fitting, data reduction, missing data imputation, validation & simplification.
71. 71 Goodness of Fit indices R2 = 1 SSE/SST
Explained Variance divided by total variance
AIC = -2LL + 2pSC = -2LL + nlog(p)
Nested models can be compared on the basis of these criteria to determine which model is best
72. 72 Robust Regression This procedure permits 4 types of robust regression: M, least trimmed squares, S, and MM estimation
These means down-weight the outliers.
M (median absolute deviation) estimation used by Huber is performed with
proc robustreg data=stack; model y = x1 x2 x3 / diagnostics leverage; id x1; test x3; run;
Estimation is done by IRLS.