1 / 41

Regression Analysis with SAS

MikeCarlo
Download Presentation

Regression Analysis with SAS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. 1 Regression Analysis with SAS Robert A. Yaffee, Ph.D. Statistics, Mapping, and Social Science Group Academic Computing Services ITS NYU 251 Mercer Street New York, NY 10012 Office: 75 Third Avenue, SB p. 212.998.3402 Email: yaffee@nyu.edu

    2. 2

    3. 3 Regression Analysis Have a clear notion of what you can and cannot do with regression analysis Conceptualization A Path Model of a Regression Analysis

    4. 4 Hypothesis Testing For example: hypothesis 1 : X is statistically significantly related to Y. The relationship is positive (as X increases, Y increases) or negative (as X decreases, Y increases). The magnitude of the relationship is small, medium, or large. If the magnitude is small, then a unit change in x is associated with a small change in Y.

    5. 5 Plotting the Data: Testing for Functional form The relationship between the Y and each X needs to be examined. One needs to ascertain whether it is linear or not If it is not linear, can it be transformed into linearity? If so, then that may be the easiest recourse. If not, then perhaps linear regression is not appropriate

    6. 6 Explore the relationship

    7. 7 Graphical Exploration

    8. 8 Graphical Plot The output from this plot appears in figure 1:

    9. 9 Testing Functional Form with Curve Fitting 1b. Curve Fitting What is done here is that the data are subjected to a number of tests of functional form. Application of the data to such a process presumes either a linear relationship or a relationship that can be transformed into a linear one. From a regression analysis of the relationship, an r square is generated. This is the square of the multiple correlation coefficient between the dependent variable and the independent variables in the model. This r square is the proportion of variance of the dependent variable explained by the functional form. The higher the r square, the closer the functional form is to the actual relationship inherent in the data. The program to set up the curve fitting may be found in figure 3 below.

    10. 10 We may fit functional forms Do x = 1 to 100; liny = a + x; quady = a + x**2; cubicy = a + x**3; lnx = log(x); invx = 1/x; expx = exp(x); compound = a*b**x; power = a*x**b; sshapex = exp( a + x**(-1)) ; growth = exp(a + b*x); output; End; Proc print; run; Then run different models. For example, Proc Reg; model y = quadx; run; Use the transformation that yields the highest R2

    11. 11

    12. 12 Curve Fitting Output & Interpretation

    13. 13

    14. 14

    15. 15 Fixes-continued

    16. 16 The General Linear Model

    17. 17

    18. 18 Decomposition of the sum of squares

    19. 19 Decomposition of the sum of squares Total SS = model SS + error SS and if we divide by df This yields the Variance Decomposition: We have the total variance= model variance + error variance

    20. 20 F test for significance and R2 for magnitude of effect R2 = Model var/total var

    21. 21

    22. 22 Derivation of the Intercept

    23. 23 Derivation of the Regression Coefficient

    24. 24 The Multiple Regression Equation We proceed to the derivation of its components: The intercept: a The regression parameters, b1 and b2

    25. 25 The Multiple Regression Formula If we recall that the formula for the correlation coefficient can be expressed as follows:

    26. 26

    27. 27

    28. 28 Significance Tests for the Regression Coefficients We find the significance of the parameter estimates by using the F or t test. The R2 is the proportion of variance explained. Adjusted R2 = 1-(1-R2) (n-1)/(n-p-1)

    29. 29 F and T tests for significance for overall model

    30. 30 Significance tests If we are using a type II sum of squares, we are dealing with the ballantine. DV Variance explained = a + b

    31. 31 Significance tests T tests for statistical significance

    32. 32 Significance tests Standard Error of intercept

    33. 33 SAS Regression Command Syntax

    34. 34 SAS Regression syntax Proc reg simple data=regdat; model y = x1 x2 x3 /spec dw corrb collin r dffits influence ; output out=resdat p=pred r=resid rstudent=rstud; Data check; set resdat; Proc univariate normal plot; var resid; Title ‘Check of Normality of Residuals’; Run; Proc Freq data=resdat; tables rstud; Title ‘Check for Outliers’; run; Proc ARIMA data=resdat; identify var=resid; Title ‘Check of Autocorrelation of the errors’; Run;

    35. 35 SAS Regression Output & Interpretation

    36. 36 More Simple Statistics

    37. 37 Omnibus ANOVA Statistics

    38. 38 Parameter Estimates

    39. 39 Assumptions of the Linear Regression Model Linear Functional form Fixed independent variables Independent observations Representative sample and proper specification of the model (no omitted variables) Normality of the residuals or errors Equality of variance of the errors (homogeneity of residual variance) No multicollinearity No autocorrelation of the errors No outlier distortion

    40. 40 Explanation of the Assumptions 1. Linear Functional form Does not detect curvilinear relationships Independent observations Representative samples Autocorrelation inflates the t and r and f statistics and warps the significance tests Normality of the residuals Permits proper significance testing Equality of variance Heteroskedasticity precludes generalization and external validity This also warps the significance tests Multicollinearity prevents proper parameter estimation. It may also preclude computation of the parameter estimates completely if it is serious enough. Outlier distortion may bias the results: If outliers have high influence and the sample is not large enough, then they may serious bias the parameter estimates

    41. 41 Diagnostic Tests for the Regression Assumptions Linearity tests: Regression curve fitting No level shifts: One regime Independence of observations: Runs test Normality of the residuals: Shapiro-Wilks or Kolmogorov-Smirnov Test Homogeneity of variance if the residuals: White’s General Specification test No autocorrelation of residuals: Durbin Watson or ACF or PACF of residuals Multicollinearity: Correlation matrix of residuals. Condition index or condition number No serious outlier influence: tests of additive outliers: Pulse dummies. Plot residuals and look for high leverage of residuals Lists of Standardized residuals Lists of Studentized residuals Cook’s distance or leverage statistics

    42. 42 Explanation of Diagnostics Plots show linearity or nonlinearity of relationship Correlation matrix shows whether the independent variables are collinear and correlated. Representative sample is done with probability sampling

    43. 43 Explanation of Diagnostics Tests for Normality of the residuals. The residuals are saved and then subjected to either of: Kolmogorov-Smirnov Test: Tests the limit of the theoretical cumulative normal distribution against your residual distribution. Shapiro-Wilks Test Proc reg; model y = x1 x2; Output out=resdat r=resid p=pred; Data check; set resdat; Proc univariate normal plot; var resid; Title ‘Test of Normality of Residuals’; Run;

    44. 44 Test for Homogeneity of variance White’s General Test Proc reg; model y = x1 x2/spec; Run; Chow test

    45. 45 Weighted Least squares A fix for heteroskedasticity is Weighted least squares or WLS There are two ways to do this. If x is proportional to the variance in y, then form the weight by wt = 1/y**2; Proc reg; weight wt; Model y = x; Run;

    46. 46 Collinearity Diagnostics

    47. 47 More Collinearity Diagnostics Watch for eigenvalues much greater than 1 condition numbers = maximum eigenvalue/minimum eigenvalue. If condition numbers are between 100 and 1000, there is moderate to strong collinearity

    48. 48 Collinearity Diagnostics

    49. 49 Outlier Diagnostics Residuals. The predicted value minus the actual value. This is otherwise known as the error. Studentized Residuals the residuals divided by their standard errors without the ith observation Leverage, called the Hat diag This is the measure of influence of each observation Cook’s Distance: the change in the statistics that results from deleting the observation. Watch this if it is much greater than 1.0.

    50. 50 Outlier detection Outlier detection involves the determination whether the residual (error = predicted – actual) is an extreme negative or positive value. We may plot the residual versus the fitted plot to determine which errors are large, after running the regression. The command syntax was already demonstrated with the graph on page 16: rvfplot, border yline(0)

    51. 51 Create Standardized Residuals A standardized residual is one divided by its standard deviation.

    52. 52 Limits of Standardized Residuals If the standardized residuals have values in excess of 3.5 and -3.5, they are outliers. If the absolute values are less than 3.5, as these are, then there are no outliers While outliers by themselves only distort mean prediction when the sample size is small enough, it is important to gauge the influence of outliers.

    53. 53 Outlier Influence Suppose we had a different data set with two outliers. We tabulate the standardized residuals and obtain the following output:

    54. 54 Outlier a does not distort and outlier b does.

    55. 55 Studentized Residuals Alternatively, we could form studentized residuals. These are distributed as a t distribution with df=n-p-1, though they are not quite independent. Therefore, we can approximately determine if they are statistically significant or not. Belsley et al. (1980) recommended the use of studentized residuals.

    56. 56 Studentized Residual

    57. 57 Influence of Outliers Leverage is measured by the diagonal components of the hat matrix. The hat matrix comes from the formula for the regression of Y.

    58. 58 Leverage and the Hat matrix The hat matrix transforms Y into the predicted scores. The diagonals of the hat matrix indicate which values will be outliers or not. The diagonals are therefore measures of leverage. Leverage is bounded by two limits: 1/n and 1. The closer the leverage is to unity, the more leverage the value has. The trace of the hat matrix = the number of variables in the model. When the leverage > 2p/n then there is high leverage according to Belsley et al. (1980) cited in Long, J.F. Modern Methods of Data Analysis (p.262). For smaller samples, Vellman and Welsch (1981) suggested that 3p/n is the criterion.

    59. 59 Cook’s D Another measure of influence. This is a popular one. The formula for it is:

    60. 60 Using Cook’s D in SAS Cook is the option /R Finding the influential outliers List cook, if cook > 4/n Belsley suggests 4/(n-k-1) as a cutoff

    61. 61 Graphical Exploration of Outlier Influence

    62. 62 DFbeta One can use the DFbetas to ascertain the magnitude of influence that an observation has on a particular parameter estimate if that observation is deleted.

    63. 63 Alternatives to Violations of Assumptions 1. Nonlinearity: Transform to linearity if there is nonlinearity or run a nonlinear regression 2. Nonnormality: Run a least absolute deviations regression or a median regression 3. Heteroskedasticity: weighted least squares regression or white estimator. One can use Proc Robustreg to obtain downweighted outlier effect in the estimation. 4. Autocorrelation: Newey-West estimators or autoregression model 4. Multicollinearity: components regression or ridge regression or proxy variables

    64. 64 The Interaction model The product vector x1*x2 is an interaction term. This is the joint effect over and above the main effects (x1 and x2). The main effects must be in the model for the interaction to be properly specified, regardless of whether the main effects remain statistically significant.

    65. 65 Path Diagram of an Interaction model

    66. 66 SAS Regression Interaction Model Syntax data one; input y x1 x2 x3 x4; lincrossprodx1x2=x1*x2; x1sq=x1*x1; time+1; datalines; 112 113 114 39 10 322 230 310 43 23 323 340 250 33 33 112 122 125 144 45 99 100 89 55 34 14 13 10 249 40 40 34 98 39 30 30 32 34 40 40 90 80 93 50 50 89 90 91 60 44 120 130 43 100 34 444 432 430 20 44 proc print; run; proc reg; model y= x1 x2 lincrossprodx1x2/r collin stb spec influence; output out=resdat r=resid p=pred stdr=rstd student=rstud cookd=cooksd; run;

    67. 67 SAS Regression Diagnostic syntax data outck; set resdat; degfree=7; if cooksd > 4/7; /* if cd > p/(n-p-1) where p = # parms, n=sample size */ proc freq; tables rstud; title 'Outlier Indicator'; run; axis1 label=(a=90 'Cooks D Influence Stat'); proc gplot data=resdat; plot cooksd * rstud/vaxis=axis1; title 'Leverage and the Outlier'; run;

    68. 68 Regression Model test of linear interaction

    69. 69 SAS Polynomial Regression Syntax proc reg data=one; model y= x1 x1sq; title 'The Polynomial Regression'; run;

    70. 70 Model Building Strategies Specific to General: Cohen and Cohen General to Specific: Hendry and Richard Extreme Bounds analysis: E. Leamer. F. Harrell, Jr. approach: Regression Modeling Strategies (Springer, 2001). Loess, Splines for mode fitting, data reduction, missing data imputation, validation & simplification.

    71. 71 Goodness of Fit indices R2 = 1 – SSE/SST Explained Variance divided by total variance AIC = -2LL + 2p SC = -2LL + nlog(p) Nested models can be compared on the basis of these criteria to determine which model is best

    72. 72 Robust Regression This procedure permits 4 types of robust regression: M, least trimmed squares, S, and MM estimation These means down-weight the outliers. M (median absolute deviation) estimation used by Huber is performed with proc robustreg data=stack; model y = x1 x2 x3 / diagnostics leverage; id x1; test x3; run; Estimation is done by IRLS.

More Related