PSC 570

PSC 570

Classroom presentations:We want to be able to understand the purpose of and reasoning for your research question; your argument or hypothesis(es), the conceptualization and your operationalization (the variables used), your sources (data), and potential problems • Your presentation should clearly explain the “why, what, how and when” of your research, including how your hypothesis(es) can be refuted. • Provide me with a hardcopy of your presentation

Topics to be covered • The conceptual structure of regression • The uses of regression • The regression coefficient • Hypothesis testing

Regression: introduction • Regression is a means of explaining one variable as a function of other variables, we want to estimate the linear effect of an independent variable X on a dependent variable Y • What we get is a “regression line” or “regression equation” Y = a + bX • The “effect of X on Y” is just b, the slope of the regression line • Regression is a general technique permitting nonlinear as well as linear models

Varieties of regression • We will study the simplest variety of regression (also the most commonly used): • “ordinary least squares” regression • It is appropriate for continuous (ratio-level) dependent variables and… • Independent variables that are interval level , and ordinal-level or nominal-level (after an appropriate transformation)

Varieties of regression • Of course, the dependent variable in social science research is sometimes not ratio-level, for example: • Dichotomous (binary) variables: • does the incumbent candidate “win” or “lose” • is a pair of countries “at war” or “not at war” • Categorical (nominal) variables: • e.g., party identification Ordinal variables: • e.g., policy outcomes (failure, partial success, total success) • And, not all relationships are linear (more on this later)

Varieties of regression • For each of these cases, there is a variety of regression that is appropriate: • binary logit or probit regression • nominal multinomial logit or probit • ordinal ordered logit or probit • integer Poisson regression • nonlinear nonlinear regression (including logarithmic, quadratic, etc.)

Linear regression: concepts • Recall the scatterplot, which depicts values of two variables simultaneously • In regression, we consider the variable on the vertical axis to be a linear function of the variable on the horizontal axis • Regression is a technique for fitting a line through the data in order to evaluate the linear effect of one variable on another

Plot of unemployment on GDP

Fitting a line through the data

The regression line • There is a negative relationship between the GDP growth rate and the change in the unemployment rate • In other words, increasing GDP is associated with decreasing unemployment • But…is this a “real” relationship? • The graph alone will not tell us • But regression analysis will The regression line is the “best” linear relationship that can be found in the data Note: that the regression line does not perfectly predict each point; and it does not cross the vertical axis at zero

The regression equation • Regression fits a line through the data in a scatterplot, where X is the independent variable and Y is the dependent variable • Like all lines, the regression line has the functional form Y = a + bX • In general, we are interested in “b”, the regression coefficient • we want to know: (1) Is it positive or negative? (2) How large is it? (3) Is it significant?

The uses of regression • The main use of regression in social science is hypothesis testing • Testing for an effect of some independent variables on a dependent variable • In applied settings, regression is used for forecasting (prediction)

The error term in the regression equation • Since the data are not all arranged exactly on the regression line, the regression equation is written Yi = a + bXi + ei, where “ei” = error • What does “ei” stand for? The regression equation predicts the dependent variable, by predicting that the observed values of the variable will fall on the regression line • The error term is the difference between each observedvalue of the dependent variable and the predicted value • in other words, the error is the vertical distance of each point from the regression line

Calculating the Regression Line X represents the horizontal axis and Y the vertical axis. To find the correct regression line you need only know the value of b (slope) and a (y-intercept). To calculate any point along the regression line, simply multiply any X value by b (slope) and add a (y-intercept).

The t test on the regression coefficient • The regression coefficient b can be thought of as the mean of a distribution • In fact, it is the mean of a distribution • Each observation gives a different indication of the effect of X on Y • The regression coefficient tells us (roughly) that on average, the effect of X on Y equals b in our sample of data

The regression coefficient b • Again, the regression coefficient is an estimate of the linear effect of X on Y • That is, we have a sample of data, and we are trying to find the linear relationship that holds in the population (recall that the population is just an infinitely large sample) • We use b, the sample regression coefficient, as an estimate of b, the population regression coefficient

Interpreting the regression coefficient • Think of the regression coefficient as the linear effect of X on Y,in this sample of data • This estimate of the effect of X on Y allows us to test the hypothesis “X has an effect on Y” • or, if we want to speak in the language of cause and effect, “X causes some of the variation in Y” • The estimated regression coefficient may be large but still insignificant (why?) • It may also be small but significant (why?)

regression permits researchers to better predict future events than simply guessing the mean. • regression line uses the slope and y-intercept. Pearson’s R will help determine if the regression is a fluke.

Interpreting the Slope of the LineWhy does it Matter? In this example, for every one unit change in the value of the independent variable, there should be a 2.04 unit increase the dependent variable.

Answer: (Pearson’s R)2 Error

Regression: • The regression procedure is like all other statistical procedures in one important respect: • It is totally ignorant about the way the world works, and in particular it knows nothing about social science! • So crap in-----crap out!

Underlying assumptions of the OLS model • Ordinary least squares is not “just” easy; it also has a theoretical justification • The theoretical justification is that the OLS estimators (the regression coefficients) are BLUE! • “B.L.U.E.” stands for “Best Linear Unbiased Estimator” • We know what a “linear estimator” is, so what is meant by “best” and “unbiased”?

Why bother with assumptions? • Violations of these assumptions are common • so common that techniques have been developed to handle most of them • What we will do with these assumptions: • I will briefly discuss the consequences of violating the assumptions • You will get an idea of which of them are “fatal” to OLS regression estimates and which are only “harmful”

Unbiasedness • An estimator is unbiased if its mean is equal to the population parameter • Remember that regression gives us a “b” coefficient that is an estimate of the population statistic b • we have a sample of data, and we want to know the population statistic, which we can’t observe • “b” is an unbiased estimate of b because its mean is equal to b, under some assumptions

“Best” linear unbiased estimators • The regression coefficient is the “best” estimator because it is “minimum variance among all linear estimators” • So when we say that regression coefficients are “BLUE,” we really mean that they are “MVLUE”, where MV stands for “minimum variance” • What does that mean?

Minimum variance • We estimate “b” with some error; each estimate of “b” will have a variance • It can be shown mathematically that (under some assumptions) the OLS estimator has minimum variance of all linear estimators of the effect of x on y (in other words, it fits the best line through the data) • This “minimum variance” property is commonly referred to as “efficiency”

Assumption 1: correct model specification • A model might be incorrectly specified for three reasons: • (1) Excluded relevant variables • (2) Included irrelevant variables • (3) Nonlinearity • (1) and (2) are “fatal;” (2) is only “harmful”

“Fatal” vs. “harmful” errors • What do we want out of regression? • (1) estimated effects of X on Y • (2) hypothesis tests of whether X is significant • Note that (2) is based on (1): if we estimate incorrectly (according to our output), then our hypothesis tests will be meaningless––so both steps fail to give us good results • Suppose that we estimate correctly, but with “too much” error…then what will happen?

“Best” and “unbiased” estimates • If an estimate is biased, then it is wrong, and both the estimate and the hypothesis tests are problematic • An estimate can be unbiased but “inefficient” • That is, it can be correct on average, but have a larger error than the “best” estimate would have • Then, the estimated coefficient is “correct,” but hypothesis tests are problematic

“Fatal” and “harmful” again • Biased estimates are “fatal”––they damage all of the useful features (estimation and hypothesis testing) of regression • Inefficient estimates are “harmful,” in that they give is high-variance estimators • But notice that this makes it less likely that we will reject the null hypothesis––thus, the damage is not “fatal,” only “harmful”

Back to assumption 1 • Recall the violations of assumption 1: • (1) Excluded relevant variables • (2) Included irrelevant variables • (3) Nonlinearity • Having both 1 and 2 is “fatal;” 2 is only “harmful” • If we include “too many” variables, the problem is that our estimates are inefficient • and thus we may fail to reject a false null hypothesis

Dealing with nonlinearity • Include squared term(s) • Estimate in logarithms (as long as all data is positive!) • This is really a theoretical question • you need to have a good reason to construct a nonlinear model

Assumption 2: normally distributed errors • The errors are i.i.d. (independently and identically distributed) • The errors are normally distributed • This means that • the errors are just as likely to be positive as negative • a positive error of a given size is just as likely as a negative error of the same size • the errors can, in principle, range from positive to negative infinity • A “fatal” error: biased estimates • The usual cause: model misspecification • We will address one source of this type of misspecification: categorical dependent variables

Assumption 3 • Homoskedasticity: • the value of “x” does not influence the size or direction of the errors • that is, the errors are the same at high, low, and intermediate values of the “x” variables • This means that the “x” variable is just as good (or bad) a predictor over the entire range of “x”

Assumption 3: homoskedasticity • If the errors grow larger or smaller as X grows larger, then this assumption is violated, and we say that the model exhibits heteroskedasticity • This is harmful but not fatal: estimates are unbiased, but inefficient • This can be dealt with through more advanced techniques • Feasible generalized least squares (FGLS); White standard errors

Assumption 4 • Nonrandom X variables (often referred to as “X fixed in repeated samples”) • That is, if we took another sample of data on X and Y, we might find different Y values, but we would not find different values on the independent variables • This is one of the things that distinguishes regression from correlation • And, it is the biggest assumption that we make in regression (why?)

Assumption 4, continued • This is a “big” assumption because… • we are claiming that we have perfect information about the X variables • This seems impossible––and it is!! • But there are some cases where it is a big problem, and others where it is not • techniques have been developed to handle the “big” problems

Assumption 5: no perfect multicollinearity • Regression is a method of handling multicollinearity • It assumes that the X variables are correlated with each other, and it tries to determine the effect of each variable while taking account of others • Perfect multicollinearity results when the X variables are so closely related that they predict each other (almost) exactly

Multicollinearity in practice • Sometimes, models cannot be estimated due to multicollinearity • More commonly, “real” relationships between some x and the Y variable are masked by highly multicollinear X variables • In this case, you should estimate separate models using only a subset of the X variables (use 2 /3 or ¾ variables in a regression)

Regression review • The regression procedure uses the variables that you specify in order to estimate a variable chosen by you • Therefore, the regression results are only as good as the theory that you construct • In other words: assuming that your set of independent variables is appropriate, the regression procedure estimates their impact on another (dependent) variable

Regression review • First, regression uses the independent variables that you specify in order to “predict” the dependent variable • An important result of this first step is an estimate of the regression coefficient (the slope, in a particular direction, of the regression plane (or “hyperplane”))

Regression review • Each estimated regression coefficient can be viewed as the mean of a distribution of possible values • The distribution has a standard deviation (called the “standard error”) • Together, the mean and standard deviation are used in hypothesis testing

Regression review • The question we ask in hypothesis testing is: how much error? • There are two basic answers to this question • (1) so much error that the effect of x1 on y might be zero with high probability • (2) so little error that the effect of x1 on y is zero with very low probability

PSC 570

PSC 570

Presentation Transcript

LIS 570

PSC 4011

PSC 4012

PSC 4012

Ling 570

PSC 4010

PSC 4010

PSC 4012

LIS 570

PSC 570

PSC 4011

BigBen @ PSC

EECS 570

LIS 570

570

PSC 570

LIS 570

PSC Contracts

PSC 570

LIS 570

LIS 570

LIS 570