140 likes | 158 Views
This week's focus is on multiple linear regression analysis, extending beyond bivariate regressions. Learn to interpret partial regression coefficients, standardized beta coefficients, intercept, and coefficient of multiple determination in regression models.
E N D
Quantitative Methods – Week 8: Multiple Linear Regression Roman Studer Nuffield College roman.studer@nuffield.ox.ac.uk
Introduction • After the interlude on inductive statistics, were are back to regression analysis… • So far, we have looked at bivariate regressions with one dependent and one explanatory variable: • Y= a + bX • Now, we want to extend the regression analysis to include several explanatory variables; this is called multiple regression: • Y= a + b1X1 + b2X2 + b3X3 + b4X4 • This enables us to investigate the influence of various explanatory variables in turn while controlling for the influence of others • The fundamental underlying theoretical principles and the statistical procedures required for the estimation, and for the evaluation of the coefficients are still the same as in bivariate regressions • However, the formulae and the calculations get more difficult, and the computations can be left to STATA
Interpretation: Partial Regression Coefficients • The partial regression coefficients (b1,b2,b3, …) allow us to examine the influence of each of the explanatory variables while controlling for the influence of the others • The interpretation of the partial regression coefficients is the same as in the bivariate regressions • The partial regression coefficients can change when we include/exclude other relevant explanatory variables • This again points to the problem of omitted variables!! • The interdependence of explanatory variables is another important issue that comes up in this respect. Use correlation analysis to look at the relation between explanatory variables!
Interpretation: Standardised Beta Coefficients • The regression coefficients measure the effects by original units • However, the explanatory variable with the largest coefficient is not necessarily the most important one… • Therefore, if we are interested in the relative importance of each explanatory variable on the dependent variable, Y, we have to adjust for the different units of the explanatory variables • This is done by converting the partial regression coefficients into standardised beta coefficients. Each variable in the regression is replaced by its z-score: • The standardization makes the scale of the regressors irrelevant • Interpretation: “How many standard deviations of movement in Y are caused by 1 standard deviation in an explanatory variable Xi?”
Interpretation: The Intercept • The intercept, a, shows the value of the dependent variable when all explanatory variables are set equal to zero • Great care has to be taken when interpreting the intercept in multiple linear regressions • Character of a “residual”: “impact of all the variables excluded in the model” • At times, an interpretation of the intercept make little sense…
Interpretation: The Coefficient of Multiple Determination, R2 • R2 is a measure of the proportion of the variation in the dependent variable explained by the several explanatory variables in a multiple regression • R²=ESS/TSS ESS = Explained Sum of Squares TSS = Total Sum of Square • The values of R2 always lies between 0 and 1; the higher it is, the more the variation in Y has been explained • Measure of the “goodness of fit” or the explanatory power of the regression
Interpretation: Adjusted R2 • The explained sum of square (ESS) increases with the number of explanatory variables while the total variation in Y (TSS) is unaffected • Adding explanatory variables will always raise the value of R² • Extreme case: There are as many explanatory variables as observations R² =1 • R2-adjusted adjusts R2 for the number of explanatory variables k: • R2-adj. imposes a penalty for adding additional independent variables to a model • Explanatory models with higher R²-adjusted should be prefered - even if R² is smaller • If the sample size is large, the correction from R² toR²-adjusted will be small
Further Issues: Model Specification • Specifying a model includes three basic steps: • Choosing the dependent variable to be explained • Determine the explanatory variables to be included • Determine the mathematical form of the relationships (linear vs. non-linear) • Traditional approach: Testing models from economic theory, estimating the unknown parameters • “Modern” approach: Data play a key role in the formulation of estimation model • Explorative data analysis: „trial and error“ process with ad hoc modifications, adding additional variables, changing the functional form of variations, etc. • Follow a strategy of „general to specific modelling“ to avoid biased regression coefficients due to omitted variables • Estimate a complete model that includes all possibly relevant variables (including those that represent competing explanations) • Then, exclude variables that are not statistically significant (starting with the lowest t-values/ highest p-values) • Reduce the regression model until only significant variables are left in the model.
Further Issues: F-Test We use t-tests to test for the significance of the single regression coefficients. Null-Hypothesis H0: bi = 0. But what about the overall significance of the estimated regression line? To test for the joint significance of the regression coefficients, we use F-tests. Null hypothesis H0: b1 = b2 = b3 = 0. In words: explanatory variables x1, x2 and x3 do not jointly influence y Again, if the calculated F-value exceed the tabulated value of F, then the null hypothesis is rejected, i.e. the variables do significantly explain the variation in the dependent variable and the variables must therefore not be excluded from the regression model Fcrit depends on number of observations (n), the number of estimated coefficients in the unrestricted model (k), and the number of restrictions (m). The exact value can be found in F-distribution tables (Appendix in almost any statistical textbook) However, STATA makes it even easier for us, as it reports the p-value for the joint significance of all regression coefficients!
Review & Further Reading • The main topics covered in this course were • Descriptive statistics • Correlation analysis • Inductive statistics • Regression analysis (simple and multiple regression) • You now know the fundamentals of quantitative methods…. MAKE USE OF THIS! • However, additional issues are likely to come up once you’re dealing with quantitative research. These may include… • Tests (associations between variables, testing for different means, etc.) for nominal and ordinal data See F&T, chapter 7 • Dummy variables F&T, chapter 10 • Lagged variables F&T, chapter 10 • Violating the assumptions of the classical linear regression model (multicollinearity, autocorrelation, heteroscedasticity, outliers, specification, etc.) F&T, chapter 11 • Non-linear models F&T, chapter 12 • Use your textbook as a first guide when you encounter one of these issues!!
Computer Class: • Multiple Linear Regression
Exercises Data set: “Weimar elections” at http://www.nuff.ox.ac.uk/users/studer/teaching.htm Use the whole sample including all 52 observations from the last 4 Weimar elections. • Run a regression of Nazi’s percentage of votes on the unemployment rate, share of workers, Catholics, and farmers. Interpret the results • Which variables can be excluded from the model? • Explain the success of the Communist party. Could the Communists benefit from the rising unemployment after controlling for other factors like share of workers, Catholics, and farmers, voter participation? • Test whether the Communists benefited less from unemployment than the Nazis • Exclude the voter participation from the regression model. Explain the change in the effect of unemployment. Explore the correlations between the explanatory variables
Homework & Take-Home Exam • Problem Set: • Finish the exercises from today’s computer class if you haven’t done so already. Include all the results and answers in the file you send me. • Do exercises 1 and 2 from chapter 8.4 in Feinstein & Thomas, pp. 255-56. Download the data set needed for question 2 from http://www.nuff.ox.ac.uk/users/studer/teaching.htm (“Irish families”). This data set is described in Appendix A.2 in Feinstein & Thomas, pp. 497-501. Present your regression results from question 2 in a nice table (check today’s lecture notes for an example!). Report a, regression coefficients, R2, adjusted R2, N, t-statistics, F-statistic and significance levels • Send it to me by Monday of week 9 • Take-Home Exam • You’ll receive the take-home exam by week 10 • You’ll have to submit it by Friday of week 1 of Trinity Term (27 April) • We will meet later in Trinity Term to discuss the exams (we’ll arrange an exact date once I have corrected the exams)