460 likes | 1.06k Views
Regression Analysis Project. Michael heidneR. Table of Contents. I. Introduction II. Scatter Plots – Relationship Analysis III. Correlation Coefficients Analysis IV. List of all Sub-Models V. Analysis of Sub-Models VI. Analysis to determine Best Model VII. Summary. Introduction.
E N D
Regression Analysis Project Michael heidneR
Table of Contents I. Introduction II. Scatter Plots – Relationship Analysis III. Correlation Coefficients Analysis IV. List of all Sub-Models V. Analysis of Sub-Models VI. Analysis to determine Best Model VII. Summary
Introduction • Determine relationship between the income of the head of household and five variables listed below • Educational Level • Length of Current Employment • Age of the Head of Household • Household Size (family members) • Size of Residence • Dependent Variable – Income of the head of household • Independent Variable – All other variables
Income vs. Years of Education - Strong Relationship -Higher income relates to more years of education
Income vs. Years Employed - No Relationship
Income vs. Age -Two lines -Upward Sloping -Flat -No Relationship
Income vs. Family Size - Data difficult to analyze - No Relationship
Income vs. Residence Size -Strong Relationship -Higher the income relates to larger residence size
Independent Variable Scatter Plots -1 plot showed strong relationship -Years of Education vs. Residence Size -Relates to what was found in dependent variable scatter plots
Correlation Coefficients Analysis • Analysis of the correlation coefficients will determine how well the regression equation represents the set of data Analysis of Scatter Plots • Measurement will be obtained through the utilization of the coefficient of determination (r 2) for each set of data • r 2 value of greater than ± 0.8 indicates a strong relationship • r 2 value of less than ± 0.5 indicates a weak relationship • r2value of 1 indicates a perfect relationship • The plotting of the residuals (difference from the actual data points and those predicted by the regression equation) allows for a visual interpetation of the r 2
Coefficient of Determination -To the right is an example of the relationship between income and years of education -The r 2 value indicates that 83% of the total variation in income can be explained by the regression line - This is considered a strong relationship
Coefficient of Determination -To the right is an example of the scatter plot with an overlay of the linear regression line -The small number of outliers and the close proximity of the points to the line allows for visual interpretation of the r 2 value - Note: the slope indicates that income increases by over $11,356 per education year.
Coefficient of Determination -To the right is an example of the residual plot -As the data points are somewhat evenly dispersed, the linear regression model is appropriate to use
Coefficient of Determination -To the right is a summary of the r 2values -The first three (enclosed in box) all have strong relationships that were on Scatter Plots on earlier slides
Correlation Coefficients Analysis Multiple Regression Analysis • First, the t Test for the Slope will be utilized to determine if there is a significant linear relationship between the X and Y variables • Each independent variable will be evaluated to see if there’s a relationship with the income of the head of the household • H0: B1 = 0. If the null hypothesis is rejected, there is evidence of a linear relationship • Second, the Confidence interval will also allow for the estimate of the value of the population slope of the dependent variables
t Test for Slope -To the right is a regression analysis table from PhStat2 that will allow for the t Test for Slope to be evaluated -As the P-value is less than the 0.05 significance level for Education Years and Residence Size, the null hypothesis can be rejected
Confidence Interval -An example of the Confidence Interval is to the right -It states that estimated effect of one additional year of education increases the income of the head of house by between $1,387 and $5,148
List of all Sub-Models -31 total sub-models -Uses up to five variables in the regression equation.
Model 21: Y = -36515.5583+ 3957.2436X1 + 0.0116X22 - 0.3690X1X2 • Model 22: Y = 9132.9370 – 20.9153X2 - 140.8232X12 + 0.0140X22 • Model 23: Y = 38739.8128 - 44.7582X2 - 277.0885X12 + 5.6623X1X2 • Model 24:Y = -25759.8750 - 34.8093X12 + 0.0112X22 + 1.7767X1X2 Analysis of Sub-Models 21, 22, 23, 24 -Model 21: r2 of 0.9274 EduyearsXResidence size had p-value of 0.5349 > 0.05 -Model 22: r2 of 0.9395. Intercept p-value indicates no correlation at 0.1691. -Model 23: r2 of 0.9133, one of the weaker adjusted r2 of 0.9106 meaning models explains 91.06% of the variability in the data. -Model 24: r2 = 0.9357. t-test shows signifiance, but does not fit best with Cp statistic measurement.
Model 25: Y =-13282.0474 + 270.4987X12 + 0.0155X22 - 1.7827X1X2 • Model 26: Y = 69749.5689 - 9034.0557X1 -19.6461X2 + 465.1221X12 + 0.0131X22 • Model 27: Y = 73359.26 – 6084.49X1 – 38.26X2 – 17.55 X12 + 5.12X1X2 • Model 28:Y = 26390.86 - 57.80X1 – 35X2 + 0.011X22 + 1.80X1X2 Analysis of Sub-Models 25, 26, 27, 28 -Model 25: r2 = 0.9372. Model parameters do not fit scatter-plots even though indicating signficant. -Model 26: 2ndhighest of all adjusted r2 at 0.9425. All model parameters significant. -Model 27: r2 dipped to 0.9155. Lacks overall significance is that that 2 out of the 4 p-values are greater than α at 0.05 -Model 28: r2 of 0.9377. High p-value and t-statistic in the years of education at 0.9721> significance level of 0.05
Model 29: Y = 64973.70 – 11516.70X1 + 716.08X12 + 0.016X22 – 2.17X1X2 • Model 30: Y = 8586.566 – 20.43X2 + 145.32X12 + 0.014X22 – 0.06X1X2 • Model 31: Y = 67121.22 – 10829.91X1 – 6.37X2 + 650.499X12 + 0.0155X22 – 1.61X1X2 Analysis of Sub-Models 29, 30, 31 -Model 29: Highest adjsutedr2 of 0.9493. P-values would fit 99% confidence interval -Model 30: r2 = 0.9369. Three out of the four variables — residential size, educational years squared and education years X residential size — p-values >0.05 = no correlation -Model 31: r2= 0.9495. Lower F than previous models at 328.82. High influx in p-values at residential size and education yearsXresidentialsize.
Model 26: • Model 29: Analysis to Determine Best Model -Overall regression and model parameters point to Model 26 and Model 29 being “best fits” -Model 29’s adjusted r2 is 0.9434 compared to Model 26’s adjusted R2 at 0.9425. Model 29 barely edges here as it explains 94.34% of the variability in the data. -Overall regression, the F value is higher in Model 29 at 413.90 compared to Model 30, which is at 406.56. Both of these greatly above the critical value of F (df = 4,95 = fcritical= 2.467), but Model 29 is substantially higher meaning this Model is more significant. -Significance F are both 0.000. -The P-values of both Models would fit a 99% confidence interval, making α = 0.01. In relation of that all variables in both Models have a t-statistic that falls into the non-critical level between +/- 1.98.
Analysis to Determine Best Model Cp statistic measures the differences between a fitted regression model and a true model with random error - Further determine the best sub-model by analyzing the Cpstatistic - Looking for Model that Cp equal or less than k+1 Model 26 (X2X3X4): Cp = 5.9318 Model 29 (X3X4X5): Cp = 4.3220 Conclusion: MODEL 29 BEST FIT
Model 29: Y = 64973.70 – 11516.70X1 + 716.08X12 + 0.016X22 – 2.17X1X2 Summary - Model 29 is less volatile in its model parameters and could fit a 99.9% confidence interval in the variables. -In line with Cp statistic for determining best sub-set. -Model has the most significant regression equation that would calculate the most accurate output.