430 likes | 591 Views
Lecture 8 Relationships between Scale variables: Regression Analysis. Graduate School Quantitative Research Methods Gwilym Pryce g.pryce@socsci.gla.ac.uk. Notices:. Register. Plan:. 1. Linear & Non-linear Relationships 2. Fitting a line using OLS 3. Inference in Regression
E N D
Lecture 8Relationships between Scale variables: Regression Analysis Graduate School Quantitative Research Methods Gwilym Pryce g.pryce@socsci.gla.ac.uk
Notices: • Register
Plan: • 1. Linear & Non-linear Relationships • 2. Fitting a line using OLS • 3. Inference in Regression • 4. Ommitted Variables & R2 • 5. Types of Regression Analysis • 6. Properties of OLS Estimates • 7. Assumptions of OLS • 8. Doing Regression in SPSS
1. Linear & Non-linear relationships between variables • Often of greatest interest in social science is investigation into relationships between variables: • is social class related to political perspective? • is income related to education? • is worker alienation related to job monotony? • We are also interested in the direction of causation, but this is more difficult to prove empirically: • our empirical models are usually structured assuming a particular theory of causation
Relationships between scale variables • The most straight forward way to investigate evidence for relationship is to look at scatter plots: • traditional to: • put the dependent variable (I.e. the “effect”) on the vertical axis • or “y axis” • put the explanatory variable (I.e. the “cause”) on the horizontal axis • or “x axis”
… and so a straight line of best fit is not always very satisfactory:
But we can simulate a non-linear relationship by first transforming one of the variables:
2. Fitting a line using OLS • The most popular algorithm for drawing the line of best fit is one that minimises the sum of squared deviations from the line to each observation: Where: yi = observed value of y = predicted value of yi = the value on the line of best fit corresponding to xi
Regression estimates of a, b:or Ordinary Least Squares (OLS): • This criterion yields estimates of the slope b and y-intercept a of the straight line:
3. Inference in Regression: Hypothesis tests on the slope coefficient: • Regressions are usually run on samples, so what can we say about the population relationship between x and y? • Repeated samples would yield a range of values for estimates of b ~ N(b, sb) • I.e. b is normally distributed with mean = b = population mean = value of b if regression run on population • If there is no relationship in the population between x and y, thenb = 0, & this is our H0
Hypothesis test on b: • (1) H0: b = 0 (I.e. slope coefficient, if regression run on population, would = 0) H1: b 0 • (2) a = 0.05 or 0.01 etc. • (3) Reject H0 iff P < a • (N.B. Rule of thumb: P < 0.05 if tc 2, and P < 0.01 if tc 2.6) • (4) Calculate P and conclude.
Example using SPSS output: • (1) H0: no relationship between house price and floor area. H1: there is a relationship • (2), (3), (4): • P = 1- CDF.T(24.469,554) = 0.000000 Reject H0
4. Ommitted Variables & R2Q/ is floor area the only factor?How much of the variation in Price does it explain?
R-square • R-square tells you how much of the variation in y is explained by the explanatory variable x • 0 < R2 < 1 (NB: you want R2 to be near 1). • If more than one explanatory variable, use Adjusted R2
3D Surface Plots:Construction, Price & UnemploymentQ = -246 + 27P - 0.2P2 - 73U + 3U2
5. Types of regression analysis: • Univariate regression: one explanatory variable • what we’ve looked at so far in the above equations • Multivariate regression: >1 explanatory variable • more than one equation on the RHS • Log-linear regression & log-log regression: • taking logs of variables can deal with certain types of non-linearities & useful properties (e.g. elasticities) • Categorical dependent variable regression: • dependent variable is dichotomous -- observation has an attribute or not • e.g. MPPI take-up, unemployed or not etc.
6. Properties of OLS estimators • OLS estimates of the slope and intercept parameters have been shown to be BLUE (provided certain assumptions are met): • Best • Linear • Unbiased • Estimator
“Best” in that they have the minimum variance compared with other estimators (i.e. given repeated samples, the OLS estimates for α and β vary less between samples than any other sample estimates for α and β). • “Linear” in that a straight line relationship is assumed. • “Unbiased” because, in repeated samples, the mean of all the estimates achieved will tend towards the population values for α and β. • “Estimates” in that the true values of α and β cannot be known, and so we are using statistical techniques to arrive at the best possible assessment of their values, given the information available.
7. Assumptions of OLS: For estimation of a and b to be BLUE and for regression inference to be correct: • 1. Equation is correctly specified: • Linear in parameters (can still transform variables) • Contains all relevant variables • Contains no irrelevant variables • Contains no variables with measurement errors • 2. Error Term has zero mean • 3. Error Term has constant variance
4. Error Term is not autocorrelated • I.e. correlated with error term from previous time periods • 5. Explanatory variables are fixed • observe normal distribution of y for repeated fixed values of x • 6. No linear relationship between RHS variables • I.e. no “multicolinearity”
8. Doing Regression analysis in SPSS • To run regression analysis in SPSS, click on Analyse, Regression, Linear:
Select your dependent (i.e. ‘explained’) variable and independent (i.e. ‘explanatory’) variables:
e.g. Floor area and bathrooms:Floor area = a + b Number of bathrooms + e
Confidence Intervals for regression coefficients • Population slope coefficient CI: • Rule of thumb:
e.g. regression of floor area on number of bathrooms, CI on slope: b = 64.6 2 3.8 = 64.6 7.6 95% CI = ( 57, 72)
Confidence Intervals in SPSS:Analyse, Regression, Linear, click on Statistics and select Confidence intervals:
Our rule of thumb said 95% CI for slope = ( 57, 72). How does this compare?
Past Paper: (C2)Relationships (30%) • Suppose you have a theory that suggests that time watching TV is determined by gregariousness • the less gregarious, the more time spent watching TV • Use a random sample of 60 observations from the TV watching data to run a statistical test for this relationship that also controls for the effects of age and gender. • Carefully interpret the output from this model and discuss the statistical robustness of the results.
Reading: • Regression Analysis: • *Field, A. chapters on regression. • *Moore and McCabe Chapters on regression. • Kennedy, P. ‘A Guide to Econometrics’ • Bryman, Alan, and Cramer, Duncan (1999) “Quantitative Data Analysis with SPSS for Windows: A Guide for Social Scientists”, Chapters 9 and 10. • Achen, Christopher H. Interpreting and Using Regression (London: Sage, 1982).