520 likes | 1.47k Views
2. Outline. ConceptualizationSchematic Diagrams of Linear Regression processesUsing SPSS, we plot and test relationships for linearityNonlinear relationships are transformed to linear onesGeneral Linear ModelDerivation of Sums of Squares and ANOVA Derivation of intercept and regression coefficientsThe Prediction Interval and its derivationModel AssumptionsExplanationTestingAssessmentAlternatives when assumptions are unfulfilled.
E N D
1. 1 Regression Analysiswith SPSS Robert A. Yaffee, Ph.D.
Statistics, Mapping and Social Science Group
Academic Computing Services
Information Technology Services
New York University
Office: 75 Third Ave Level C3
Tel: 212.998.3402
E-mail: yaffee@nyu.edu
February 04
2. 2
3. 3 Conceptualization of Regression Analysis Hypothesis testing
Path Analytical Decomposition of effects
4. 4 Hypothesis Testing For example: hypothesis 1 : X is statistically significantly related to Y.
The relationship is positive (as X increases, Y increases) or negative (as X decreases, Y increases).
The magnitude of the relationship is small, medium, or large.
If the magnitude is small, then a unit change in x is associated with a small change in Y.
5. 5 Regression AnalysisHave a clear notion of what you can and cannot do with regression analysis Conceptualization
A Path Model of a Regression Analysis
6. 6
7. 7
8. 8 A Precursor to Modeling with Regression Data Exploration: Run a scatterplot matrix and search for linear relationships with the dependent variable.
9. 9 Click on graphs and then on scatter
10. 10 When the scatterplot dialog box appears, select Matrix
11. 11 A Matrix of Scatterplots will appear
12. 12
13. 13
14. 14 Decomposition of the Sums of Squares
15. 15 Graphical Decomposition of Effects
16. 16 Decomposition of the sum of squares
17. 17 Decomposition of the sum of squares Total SS = model SS + error SS
and if we divide by df
This yields the Variance Decomposition: We have the total variance= model variance + error variance
18. 18 F test for significance and R2 for magnitude of effect R2 = Model var/total var
19. 19 ANOVA tests the significance of the Regression Model
20. 20 The Multiple Regression Equation We proceed to the derivation of its components:
The intercept: a
The regression parameters, b1 and b2
21. 21 Derivation of the Intercept
22. 22 Derivation of the Regression Coefficient
23. 23
If we recall that the formula for the correlation coefficient can be expressed as follows:
24. 24
25. 25
26. 26
27. 27 Significance Tests for the Regression Coefficients We find the significance of the parameter estimates by using the F or t test.
The R2 is the proportion of variance explained.
28. 28 F and T tests for significance for overall model
29. 29 Significance tests If we are using a type II sum of squares, we are dealing with the ballantine. DV Variance explained = a + b
30. 30 Significance tests T tests for statistical significance
31. 31 Significance tests Standard Error of intercept
32. 32 Programming Protocol
33. 33 Select a Data Set (we choose employee.sav) and click on open
34. 34 We open the data set
35. 35 To inspect the variable formats, click on variable view on the lower left
36. 36 Because gender is a string variable, we need to recode gender into a numeric format
37. 37 We autorecode gender by clicking on transform and then autorecode
38. 38 We select gender and move it into the variable box on the right
39. 39 Give the variable a new name and click on add new name
40. 40 Click on ok and the numeric variable sex is created
41. 41 To invoke Regression analysis,Click on Analyze
42. 42 Click on Regression and then linear
43. 43 Select the dependent variable: Current Salary
44. 44 Enter it in the dependent variable box
45. 45 Entering independent variables These variables are entered in blocks. First the potentially confounding covariates that have to entered.
We enter time on job, beginning salary, and previous experience.
46. 46 After entering the covariates, we click on next
47. 47 We now enter the hypotheses we wish to test We are testing for minority or sex differences in salary after controlling for the time on job, previous experience, and beginning salary.
We enter minority and numeric gender (sex)
48. 48 After entering these variables, click on statistics
49. 49 We select the following statistics from the dialog box and click on continue
50. 50 Click on plots to obtain the plots dialog box
51. 51 We click on OK to run the regression analysis
52. 52 Navigation window (left) and output window(right)
53. 53 Variables Entered and Model Summary
54. 54 Omnibus ANOVA
55. 55 Full ModelCoefficients
56. 56 We omit insignificant variables and rerun the analysis to obtain trimmed model coefficients
57. 57 Beta weights These are standardized regression coefficients used to compare the contribution to the explanation of the variance of the dependent variable within the model.
58. 58 T tests and signif. These are the tests of significance for each parameter estimate.
The significance levels have to be less than .05 for the parameter to be statistically significant.
59. 59 Assumptions of the Linear Regression Model Linear Functional form
Fixed independent variables
Independent observations
Representative sample and proper specification of the model (no omitted variables)
Normality of the residuals or errors
Equality of variance of the errors (homogeneity of residual variance)
No multicollinearity
No autocorrelation of the errors
No outlier distortion
60. 60 Explanation of the Assumptions 1. Linear Functional form
Does not detect curvilinear relationships
Independent observations
Representative samples
Autocorrelation inflates the t and r and f statistics and warps the significance tests
Normality of the residuals
Permits proper significance testing
Equality of variance
Heteroskedasticity precludes generalization and external validity
This also warps the significance tests
Multicollinearity prevents proper parameter estimation. It may also preclude computation of the parameter estimates completely if it is serious enough.
Outlier distortion may bias the results: If outliers have high influence and the sample is not large enough, then they may serious bias the parameter estimates
61. 61 Diagnostic Tests for the Regression Assumptions Linearity tests: Regression curve fitting
No level shifts: One regime
Independence of observations: Runs test
Normality of the residuals: Shapiro-Wilks or Kolmogorov-Smirnov Test
Homogeneity of variance if the residuals: White’s General Specification test
No autocorrelation of residuals: Durbin Watson or ACF or PACF of residuals
Multicollinearity: Correlation matrix of independent variables.. Condition index or condition number
No serious outlier influence: tests of additive outliers: Pulse dummies.
Plot residuals and look for high leverage of residuals
Lists of Standardized residuals
Lists of Studentized residuals
Cook’s distance or leverage statistics
62. 62 Explanation of Diagnostics Plots show linearity or nonlinearity of relationship
Correlation matrix shows whether the independent variables are collinear and correlated.
Representative sample is done with probability sampling
63. 63 Explanation of Diagnostics Tests for Normality of the residuals. The residuals are saved and then subjected to either of:
Kolmogorov-Smirnov Test: Tests the limit of the theoretical cumulative normal distribution against your residual distribution.
Nonparametric Tests
1 sample K-S test
64. 64 Collinearity Diagnostics
65. 65 More Collinearity Diagnostics condition numbers
= maximum eigenvalue/minimum eigenvalue.
If condition numbers are between 100 and 1000, there is moderate to strong collinearity
66. 66 Outlier Diagnostics Residuals.
The predicted value minus the actual value. This is otherwise known as the error.
Studentized Residuals
the residuals divided by their standard errors without the ith observation
Leverage, called the Hat diag
This is the measure of influence of each observation
Cook’s Distance:
the change in the statistics that results from deleting the observation. Watch this if it is much greater than 1.0.
67. 67 Outlier detection Outlier detection involves the determination whether the residual (error = predicted – actual) is an extreme negative or positive value.
We may plot the residual versus the fitted plot to determine which errors are large, after running the regression.
68. 68 Create Standardized Residuals A standardized residual is one divided by its standard deviation.
69. 69 Limits of Standardized Residuals If the standardized residuals have values in excess of 3.5
and -3.5, they are outliers.
If the absolute values are less than 3.5, as these are, then there are no outliers
While outliers by themselves only distort mean prediction when the sample size is small enough, it is important to gauge the influence of outliers.
70. 70 Outlier Influence Suppose we had a different data set with two outliers.
We tabulate the standardized residuals and obtain the following output:
71. 71 Outlier a does not distort and outlier b does.
72. 72 Studentized Residuals Alternatively, we could form studentized residuals. These are distributed as a t distribution with df=n-p-1, though they are not quite independent. Therefore, we can approximately determine if they are statistically significant or not.
Belsley et al. (1980) recommended the use of studentized residuals.
73. 73 Studentized Residual
74. 74 Influence of Outliers Leverage is measured by the diagonal components of the hat matrix.
The hat matrix comes from the formula for the regression of Y.
75. 75 Leverage and the Hat matrix The hat matrix transforms Y into the predicted scores.
The diagonals of the hat matrix indicate which values will be outliers or not.
The diagonals are therefore measures of leverage.
Leverage is bounded by two limits: 1/n and 1. The closer the leverage is to unity, the more leverage the value has.
The trace of the hat matrix = the number of variables in the model.
When the leverage > 2p/n then there is high leverage according to Belsley et al. (1980) cited in Long, J.F. Modern Methods of Data Analysis (p.262). For smaller samples, Vellman and Welsch (1981) suggested that 3p/n is the criterion.
76. 76 Cook’s D Another measure of influence.
This is a popular one. The formula for it is:
77. 77 Using Cook’s D in SPSS Cook is the option /R
Finding the influential outliers
List cook, if cook > 4/n
Belsley suggests 4/(n-k-1) as a cutoff
78. 78 DFbeta One can use the DFbetas to ascertain the magnitude of influence that an observation has on a particular parameter estimate if that observation is deleted.
79. 79 Programming Diagnostic TestsTesting homoskedasiticitySelect histogram, normal probability plot, and insert *zresid in Yand *zpred in X
80. 80 Click on Save to obtain the Save dialog box
81. 81 We select the following
82. 82 Check for linear Functional Form Run a matrix plot of the dependent variable against each independent variable to be sure that the relationship is linear.
83. 83 Move the variables to be graphed into the box on the upper right, and click on OK
84. 84 Residual Autocorrelation check
85. 85
86. 86
87. 87
88. 88
89. 89
90. 90
91. 91
92. 92
93. 93
94. 94 Alternatives to Violations of Assumptions 1. Nonlinearity: Transform to linearity if there is nonlinearity or run a nonlinear regression
2. Nonnormality: Run a least absolute deviations regression or a median regression (available in other packages or generalized linear models [ SPLUS glm, STATA glm, or SAS Proc MODEL or PROC GENMOD)].
3. Heteroskedasticity: weighted least squares regression (SPSS) or white estimator (SAS, Stata, SPLUS). One can use a robust regression procedure (SAS, STATA, or SPLUS) to obtain downweighted outlier effect in the estimation.
4. Autocorrelation: Run AREG in SPSS Trends module or either Prais or Newey-West procedure in STATA.
4. Multicollinearity: components regression or ridge regression or proxy variables. 2sls in SPSS or ivreg in stata or SAS proc model or proc syslin.
95. 95 Model Building Strategies Specific to General: Cohen and Cohen
General to Specific: Hendry and Richard
Extreme Bounds analysis: E. Leamer.
96. 96 Nonparametric Alternatives If there is nonlinearity, transform to linearity first.
If there is heteroskedasticity, use robust standard errors with STATA or SAS or SPLUS.
If there is non-normality, use quantile regression with bootstrapped standard errors in STATA or SPLUS.
If there is autocorrelation of residuals, use Newey-West autoregression or First order autocorrelation correction with Areg. If there is higher order autocorrelation, use Box Jenkins ARIMA modeling.