440 likes | 625 Views
PSC 570. Classroom presentations: We want to be able to understand the purpose of and reasoning for your research question; your argument or hypothesis(es), the conceptualization and your operationalization (the variables used), your sources (data), and potential problems
E N D
Classroom presentations:We want to be able to understand the purpose of and reasoning for your research question; your argument or hypothesis(es), the conceptualization and your operationalization (the variables used), your sources (data), and potential problems • Your presentation should clearly explain the “why, what, how and when” of your research, including how your hypothesis(es) can be refuted. • Provide me with a hardcopy of your presentation
Topics to be covered • The conceptual structure of regression • The uses of regression • The regression coefficient • Hypothesis testing
Regression: introduction • Regression is a means of explaining one variable as a function of other variables, we want to estimate the linear effect of an independent variable X on a dependent variable Y • What we get is a “regression line” or “regression equation” Y = a + bX • The “effect of X on Y” is just b, the slope of the regression line • Regression is a general technique permitting nonlinear as well as linear models
Varieties of regression • We will study the simplest variety of regression (also the most commonly used): • “ordinary least squares” regression • It is appropriate for continuous (ratio-level) dependent variables and… • Independent variables that are interval level , and ordinal-level or nominal-level (after an appropriate transformation)
Varieties of regression • Of course, the dependent variable in social science research is sometimes not ratio-level, for example: • Dichotomous (binary) variables: • does the incumbent candidate “win” or “lose” • is a pair of countries “at war” or “not at war” • Categorical (nominal) variables: • e.g., party identification Ordinal variables: • e.g., policy outcomes (failure, partial success, total success) • And, not all relationships are linear (more on this later)
Varieties of regression • For each of these cases, there is a variety of regression that is appropriate: • binary logit or probit regression • nominal multinomial logit or probit • ordinal ordered logit or probit • integer Poisson regression • nonlinear nonlinear regression (including logarithmic, quadratic, etc.)
Linear regression: concepts • Recall the scatterplot, which depicts values of two variables simultaneously • In regression, we consider the variable on the vertical axis to be a linear function of the variable on the horizontal axis • Regression is a technique for fitting a line through the data in order to evaluate the linear effect of one variable on another
The regression line • There is a negative relationship between the GDP growth rate and the change in the unemployment rate • In other words, increasing GDP is associated with decreasing unemployment • But…is this a “real” relationship? • The graph alone will not tell us • But regression analysis will The regression line is the “best” linear relationship that can be found in the data Note: that the regression line does not perfectly predict each point; and it does not cross the vertical axis at zero
The regression equation • Regression fits a line through the data in a scatterplot, where X is the independent variable and Y is the dependent variable • Like all lines, the regression line has the functional form Y = a + bX • In general, we are interested in “b”, the regression coefficient • we want to know: (1) Is it positive or negative? (2) How large is it? (3) Is it significant?
The uses of regression • The main use of regression in social science is hypothesis testing • Testing for an effect of some independent variables on a dependent variable • In applied settings, regression is used for forecasting (prediction)
The error term in the regression equation • Since the data are not all arranged exactly on the regression line, the regression equation is written Yi = a + bXi + ei, where “ei” = error • What does “ei” stand for? The regression equation predicts the dependent variable, by predicting that the observed values of the variable will fall on the regression line • The error term is the difference between each observedvalue of the dependent variable and the predicted value • in other words, the error is the vertical distance of each point from the regression line
Calculating the Regression Line X represents the horizontal axis and Y the vertical axis. To find the correct regression line you need only know the value of b (slope) and a (y-intercept). To calculate any point along the regression line, simply multiply any X value by b (slope) and add a (y-intercept).
The t test on the regression coefficient • The regression coefficient b can be thought of as the mean of a distribution • In fact, it is the mean of a distribution • Each observation gives a different indication of the effect of X on Y • The regression coefficient tells us (roughly) that on average, the effect of X on Y equals b in our sample of data
The regression coefficient b • Again, the regression coefficient is an estimate of the linear effect of X on Y • That is, we have a sample of data, and we are trying to find the linear relationship that holds in the population (recall that the population is just an infinitely large sample) • We use b, the sample regression coefficient, as an estimate of b, the population regression coefficient
Interpreting the regression coefficient • Think of the regression coefficient as the linear effect of X on Y,in this sample of data • This estimate of the effect of X on Y allows us to test the hypothesis “X has an effect on Y” • or, if we want to speak in the language of cause and effect, “X causes some of the variation in Y” • The estimated regression coefficient may be large but still insignificant (why?) • It may also be small but significant (why?)
regression permits researchers to better predict future events than simply guessing the mean. • regression line uses the slope and y-intercept. Pearson’s R will help determine if the regression is a fluke.
Interpreting the Slope of the LineWhy does it Matter? In this example, for every one unit change in the value of the independent variable, there should be a 2.04 unit increase the dependent variable.
Answer: (Pearson’s R)2 Error
Regression: • The regression procedure is like all other statistical procedures in one important respect: • It is totally ignorant about the way the world works, and in particular it knows nothing about social science! • So crap in-----crap out!
Underlying assumptions of the OLS model • Ordinary least squares is not “just” easy; it also has a theoretical justification • The theoretical justification is that the OLS estimators (the regression coefficients) are BLUE! • “B.L.U.E.” stands for “Best Linear Unbiased Estimator” • We know what a “linear estimator” is, so what is meant by “best” and “unbiased”?
Why bother with assumptions? • Violations of these assumptions are common • so common that techniques have been developed to handle most of them • What we will do with these assumptions: • I will briefly discuss the consequences of violating the assumptions • You will get an idea of which of them are “fatal” to OLS regression estimates and which are only “harmful”
Unbiasedness • An estimator is unbiased if its mean is equal to the population parameter • Remember that regression gives us a “b” coefficient that is an estimate of the population statistic b • we have a sample of data, and we want to know the population statistic, which we can’t observe • “b” is an unbiased estimate of b because its mean is equal to b, under some assumptions
“Best” linear unbiased estimators • The regression coefficient is the “best” estimator because it is “minimum variance among all linear estimators” • So when we say that regression coefficients are “BLUE,” we really mean that they are “MVLUE”, where MV stands for “minimum variance” • What does that mean?
Minimum variance • We estimate “b” with some error; each estimate of “b” will have a variance • It can be shown mathematically that (under some assumptions) the OLS estimator has minimum variance of all linear estimators of the effect of x on y (in other words, it fits the best line through the data) • This “minimum variance” property is commonly referred to as “efficiency”
Assumption 1: correct model specification • A model might be incorrectly specified for three reasons: • (1) Excluded relevant variables • (2) Included irrelevant variables • (3) Nonlinearity • (1) and (2) are “fatal;” (2) is only “harmful”
“Fatal” vs. “harmful” errors • What do we want out of regression? • (1) estimated effects of X on Y • (2) hypothesis tests of whether X is significant • Note that (2) is based on (1): if we estimate incorrectly (according to our output), then our hypothesis tests will be meaningless––so both steps fail to give us good results • Suppose that we estimate correctly, but with “too much” error…then what will happen?
“Best” and “unbiased” estimates • If an estimate is biased, then it is wrong, and both the estimate and the hypothesis tests are problematic • An estimate can be unbiased but “inefficient” • That is, it can be correct on average, but have a larger error than the “best” estimate would have • Then, the estimated coefficient is “correct,” but hypothesis tests are problematic
“Fatal” and “harmful” again • Biased estimates are “fatal”––they damage all of the useful features (estimation and hypothesis testing) of regression • Inefficient estimates are “harmful,” in that they give is high-variance estimators • But notice that this makes it less likely that we will reject the null hypothesis––thus, the damage is not “fatal,” only “harmful”
Back to assumption 1 • Recall the violations of assumption 1: • (1) Excluded relevant variables • (2) Included irrelevant variables • (3) Nonlinearity • Having both 1 and 2 is “fatal;” 2 is only “harmful” • If we include “too many” variables, the problem is that our estimates are inefficient • and thus we may fail to reject a false null hypothesis
Dealing with nonlinearity • Include squared term(s) • Estimate in logarithms (as long as all data is positive!) • This is really a theoretical question • you need to have a good reason to construct a nonlinear model
Assumption 2: normally distributed errors • The errors are i.i.d. (independently and identically distributed) • The errors are normally distributed • This means that • the errors are just as likely to be positive as negative • a positive error of a given size is just as likely as a negative error of the same size • the errors can, in principle, range from positive to negative infinity • A “fatal” error: biased estimates • The usual cause: model misspecification • We will address one source of this type of misspecification: categorical dependent variables
Assumption 3 • Homoskedasticity: • the value of “x” does not influence the size or direction of the errors • that is, the errors are the same at high, low, and intermediate values of the “x” variables • This means that the “x” variable is just as good (or bad) a predictor over the entire range of “x”
Assumption 3: homoskedasticity • If the errors grow larger or smaller as X grows larger, then this assumption is violated, and we say that the model exhibits heteroskedasticity • This is harmful but not fatal: estimates are unbiased, but inefficient • This can be dealt with through more advanced techniques • Feasible generalized least squares (FGLS); White standard errors
Assumption 4 • Nonrandom X variables (often referred to as “X fixed in repeated samples”) • That is, if we took another sample of data on X and Y, we might find different Y values, but we would not find different values on the independent variables • This is one of the things that distinguishes regression from correlation • And, it is the biggest assumption that we make in regression (why?)
Assumption 4, continued • This is a “big” assumption because… • we are claiming that we have perfect information about the X variables • This seems impossible––and it is!! • But there are some cases where it is a big problem, and others where it is not • techniques have been developed to handle the “big” problems
Assumption 5: no perfect multicollinearity • Regression is a method of handling multicollinearity • It assumes that the X variables are correlated with each other, and it tries to determine the effect of each variable while taking account of others • Perfect multicollinearity results when the X variables are so closely related that they predict each other (almost) exactly
Multicollinearity in practice • Sometimes, models cannot be estimated due to multicollinearity • More commonly, “real” relationships between some x and the Y variable are masked by highly multicollinear X variables • In this case, you should estimate separate models using only a subset of the X variables (use 2 /3 or ¾ variables in a regression)
Regression review • The regression procedure uses the variables that you specify in order to estimate a variable chosen by you • Therefore, the regression results are only as good as the theory that you construct • In other words: assuming that your set of independent variables is appropriate, the regression procedure estimates their impact on another (dependent) variable
Regression review • First, regression uses the independent variables that you specify in order to “predict” the dependent variable • An important result of this first step is an estimate of the regression coefficient (the slope, in a particular direction, of the regression plane (or “hyperplane”))
Regression review • Each estimated regression coefficient can be viewed as the mean of a distribution of possible values • The distribution has a standard deviation (called the “standard error”) • Together, the mean and standard deviation are used in hypothesis testing
Regression review • The question we ask in hypothesis testing is: how much error? • There are two basic answers to this question • (1) so much error that the effect of x1 on y might be zero with high probability • (2) so little error that the effect of x1 on y is zero with very low probability