Simple Regression

Simple Regression Department of Applied Economics National Chung Hsing University

Linear Functions • Formula: Y = a + bX • Is a linear formula. If you graphed X and Y for any chosen values of a and b, you’d get a straight line. • It is a family of functions: For any value of a and b, you get a particular line • a is referred to as the “constant” or “intercept” • b is referred to as the “slope” • To graph a linear function: Pick values for X, compute corresponding values of Y • Then, connect dots to graph line

Linear Functions: Y = a + bX 20 10 -10 -20 Y axis Y=14 - 1.5X Y= 3 -1.5X Y= -9 - 1.5X X axis -10 -5 0 5 10 • The “constant” or “intercept” (a) • Determines where the line intersects the Y-axis • If a increases (decreases), the line moves up (down)

Linear Functions: Y = a + bX 20 10 -10 -20 Y axis Y=3-1.5X X axis Y=3+.2X -10 -5 0 5 10 Y=3+3X • The slope (b) determines the steepness of the line

Linear Functions: Slopes 20 10 -10 -20 Change in Y=15 Change in X =5 -10 -5 0 5 10 Y=3+3X • The slope (b) is the ratio of change in Y to change in X Slope: b = 15/5 = 3 The slope tells you how many points Y will increase for any single point increase in X

Linear Functions as Summaries Change in Y = 2 Change in X = 40,000 • A linear function can be used to summarize the relationship between two variables: Slope: b = 2 / 40,000 = .00005 pts/$ If you change units: b = .05 / $1K b = .5 pts/$10K b = 5 pts/$100K

Linear Functions as Summaries • Slope and constant can be “eyeballed” to approximate a formula: Happy = 2 + .00005Income Slope (b): b = 2 / 40,000 = .00005 pts/$ Constant (a) = Value where line hits Y axis a = 2

Linear Functions as Summaries • Linear functions can powerfully summarize data: • Formula: Happy = 2 + .00005Income • Gives a sense of how the two variables are related • Namely, people get a .00005 increase in happiness for every extra dollar of income (or 5 pts per $100K) • Also lets you “predict” values. What if someone earns $150,000? • Happy = 2 + .00005($150,000) = 9.5 • But be careful… You shouldn’t assume that a relationship remains linear indefinitely • Also, negative income or happiness make no sense…

Linear Functions as Summaries • Come up with a linear function that summarizes this real data: years of education vs. job prestige It isn’t always easy! The line you choose depends on how much you “weight” these points.

Computing Regressions • Regression coefficients can be calculated in SPSS • You will rarely, if ever, do them by hand • SPSS will estimate: • The value of the constant (a) • The value of the slope (b) • Plus, a large number of related statistics and results of hypothesis testing procedures

Example: Education & Job Prestige Our estimate: Y = 5 + 3X • Example: Years of Education versus Job Prestige • Previously, we made an “eyeball” estimate of the line

Example: Education & Job Prestige Estimates of a and b: “Constant” = a = 9.427 Slope for “Year of School” = b = 2.487 • The actual SPSS regression results for that data: • Equation: Prestige = 9.4 + 2.5 Education • A year of education adds 2.5 points job prestige

Example: Education & Job Prestige Our estimate: Y = 5 + 3X Actual OLS regression line computed in SPSS • Comparing our “eyeball” estimate to the actual OLS regression line

R-Square • The R-Square statistic indicates how well the regression line “explains” variation in Y • It is based on partitioning variance into: • 1. Explained (“regression”) variance • The portion of deviation from Y-bar accounted for by the regression line • 2. Unexplained (“error”) variance • The portion of deviation from Y-bar that is “error” • Formula:

R-Square 4 2 -2 -4 “Error Variance” “Explained Variance” Y-bar -4 -2 0 2 4 Y=2+.5X • Visually: Deviation is partitioned into two parts

Example: Education & Job Prestige The R and R-Square indicate how well the line summarizes the data This information allows us to do hypothesis tests about constant & slope • R-Square & Hypothesis testing information:

Hypothesis Tests: Slopes • Given: Observed slope relating Education to Job Prestige = 2.47 • Question: Can we generalize this to the population of all Americans? • How likely is it that this observed slope was actually drawn from a population with slope = 0? • Solution: Conduct a hypothesis test • Notation: slope = b, population slope = b • H0: Population slope b = 0 • H1: Population slope b 0 (two-tailed test)

Example: Slope Hypothesis Test t-value and “sig” (p-value) are for hypothesis tests about the slope • The actual SPSS regression results for that data: • Reject H0 if: T-value > critical t (N-2 df) • Or, “sig.” (p-value) less than a (often a = .05)

Hypothesis Tests: Slopes • What information lets us to do a hypothesis test? • Answer: Estimates of a slope (b) have a sampling distribution, like any other statistic • It is the distribution of every value of the slope, based on all possible samples (of size N) • If certain assumptions are met, the sampling distribution approximates the t-distribution • Thus, we can assess the probability that a given value of b would be observed, if b = 0 • If probability is low – below alpha – we reject H0

Hypothesis Tests: Slopes If b=0, observed slopes should commonly fall near zero, too Sampling distribution of the slope b If observed slope falls very far from 0, it is improbable that b is really equal to zero. Thus, we can reject H0. 0 • Visually: If the population slope (b) is zero, then the sampling distribution would center at zero • Since the sampling distribution is a probability distribution, we can identify the likely values of b if the population slope is zero

Regression Assumptions • Assumptions of simple (bivariate) regression • If assumptions aren’t met, hypothesis tests may be inaccurate • 1. Random sample w/ sufficient N (N > ~20) • 2. Linear relationship among variables • Check scatterplot for non-linear pattern; (a “cloud” is OK) • 3. Conditional normality: Y = normal at all values of X • Check histograms of Y for normality at several values of X • 4. Homoskedasticity – equal error variance at all values of X • Check scatterplot for “bulges” or “fanning out” of error across values of X • Additional assumptions are required for multivariate regression…

Bivariate Regression Assumptions Examine sub-samples at different values of X. Make histograms and check for normality. Good Not very good • Normality:

Bivariate Regression Assumptions Examine error at different values of X. Is it roughly equal? • Homoskedasticity: Equal Error Variance Here, things look pretty good.

Bivariate Regression Assumptions At higher values of X, error variance increases a lot. • Heteroskedasticity: Unequal Error Variance This looks pretty bad.

Regression Hypothesis Tests • If assumptions are met, the sampling distribution of the slope (b) approximates a T-distribution • Standard deviation of the sampling distribution is called the standard error of the slope (sb) • Population formula of standard error: • Where se2 is the variance of the regression error

Regression Hypothesis Tests • Finally: A t-value can be calculated: • It is the slope divided by the standard error • Where sb is the sample point estimate of the S.E. • The t-value is based on N-2 degrees of freedom • Reject H0 if observed t > critical t (e.g., 1.96).

Example: Education & Job Prestige SPSS estimates the standard error of the slope. This is used to calculate a t-value The t-value can be compared to the “critical value” to test hypotheses. Or, just compare “Sig.” to alpha. If t > crit or Sig < alpha, reject H0 • T-values can be compared to critical t...

Multiple Regression 1 Department of Applied Economics National Chung Hsing University

Multiple Regression • Question: What if a dependent variable is affected by more than one independent variable? • Strategy #1: Do separate bivariate regressions • One regression for each independent variable • This yields separate slope estimates for each independent variable • Bivariate slope estimates implicitly assume that neither independent variable mediates the other • In reality, there might be no effect of family wealth over and above education

Multiple Regression Both variables have positive, significant slopes • Job Prestige: Two separate regression models

Multiple Regression • Idea #2: Use Multiple Regression • Multiple regression can examine “partial” relationships • Partial = Relationships after the effects of other variables have been “controlled” (taken into account) • This lets you determine the effects of variables “over and above” other variables • And shows the relative impact of different factors on a dependent variable • And, you can use several independent variables to improve your predictions of the dependent var

Multiple Regression Family Income slope decreases compared to bivariate analysis (bivariate: b = 2.07) And, outcome of hypothesis test changes – t < 1.96 Education slope is basically unchanged • Job Prestige: 2 variable multiple regression

Multiple Regression • Ex: Job Prestige: 2 variable multiple regression • 1. Education has a large slope effect controlling for (i.e. “over and above”) family income • 2. Family income does not have much effect controlling for education • Despite a strong bivariate relationship • Possible interpretations: • Family income may lead to education, but education is the critical predictor of job prestige • Or, family income is wholly unrelated to job prestige… but is coincidentally correlated with a variable that is (education), which generated a spurious “effect”.

The Multiple Regression Model • A two-independent variable regression model: • Note: There are now two X variables • And a slope (b) is estimated for each one • The full multiple regression model is: • For k independent variables

Multiple Regression: Slopes • Regression slope for the two variable case: • b1 = slope for X1 – controlling for the other independent variable X2 • b2 is computed symmetrically. Swap X1s, X2s • Compare to bivariate slope:

Multiple Regression Slopes • Let’s look more closely at the formulas: • What happens to b1 if X1 and X2 are totally uncorrelated? • Answer: The formula reduces to the bivariate • What if X1 and X2 are correlated with each other AND X2 is more correlated with Y than X1? • Answer: b1 gets smaller (compared to bivariate)

Regression Slopes • So, if two variables (X1, X2) are correlated and both predict Y: • The X variable that is more correlated with Y will have a higher slope in multivariate regression • The slope of the less-correlated variable will shrink • Thus, slopes for each variable are adjusted to how well the other variable predicts Y • It is the slope “controlling” for other variables.

Multiple Regression Slopes • One last thing to keep in mind… • What happens to b1 if X1 and X2 are almost perfectly correlated? • Answer: The denominator approaches Zero • The slope “blows up”, approaching infinity • Highly correlated independent variables can cause trouble for regression models… watch out

Interpreting Results • (Over)Simplified rules for interpretation • Assumes good sample, measures, models, etc. • Multivariate regression with two variables: A, B • If slopes of A, B are the same as bivariate, then each has an independent effect • If A remains large, B shrinks to zero we typically conclude that effect of B was spurious, or operates through A • If both A and B shrink a little, each has an effect, but some overlap or mediation is occurring

Interpreting Multivariate Results • Things to watch out for: • 1. Remember: Correlation is not causation • Ability to “control” for many variables can help detect spurious relationships… but it isn’t perfect. • Be aware that other (omitted) variables may be affecting your model. Don’t over-interpret results. • 2. Reverse causality • Many sociological processes involve bi-directional causality. Regression slopes (and correlations) do not identify which variable “causes” the other. • Ex: self-esteem and test scores.

Standardized Regression Coefficients • Regression slopes reflect the units of the independent variables • Question: How do you compare how “strong” the effects of two variables if they have totally different units? • Example: Education, family wealth, job prestige • Education measured in years, b = 2.5 • Family wealth measured on 1-5 scale, b = .18 • Which is a “bigger” effect? Units aren’t comparable! • Answer: Create “standardized” coefficients

Standardized Regression Coefficients • Standardized Coefficients • Also called “Betas” or Beta Weights” • Symbol: Greek b with asterisk: b* • Equivalent to Z-scoring (standardizing) all independent variables before doing the regression • Formula of coeficient for Xj: • Result: The unit is standard deviations • Betas: Indicates the effect a 1 standard deviation change in Xj on Y

Standardized Regression Coefficients An increase of 1 standard deviation in Education results in a .52 standard deviation increase in job prestige What is the interpretation of the “family income” beta? • Ex: Education, family income, and job prestige: Betas give you a sense of which variables “matter most”

R-Square in Multiple Regression • Multivariate R-square is much like bivariate: • But, SSregression is based on the multivariate regression • The addition of new variables results in better prediction of Y, less error (e), higher R-square.

R-Square in Multiple Regression • Example: • R-square of .272 indicates that education, parents wealth explain 27% of variance in job prestige • “Adjusted R-square” is a more conservative, more accurate measure in multiple regression • Generally, you should report Adjusted R-square.

Dummy Variables • Question: How can we incorporate nominal variables (e.g., race, gender) into regression? • Option 1: Analyze each sub-group separately • Generates different slope, constant for each group • Option 2: Dummy variables • “Dummy” = a dichotomous variables coded to indicate the presence or absence of something • Absence coded as zero, presence coded as 1.

Dummy Variables • Strategy: Create a separate dummy variable for all nominal categories • Ex: Gender – make female & male variables • DFEMALE: coded as 1 for all women, zero for men • DMALE: coded as 1 for all men • Next: Include all but one dummy variables into a multiple regression model • If two dummies, include 1; If 5 dummies, include 4.

Dummy Variables • Question: Why can’t you include DFEMALE and DMALE in the same regression model? • Answer: They are perfectly correlated (negatively): r = -1 • Result: Regression model “blows up” • For any set of nominal categories, a full set of dummies contains redundant information • DMALE and DFEMALE contain same information • Dropping one removes redundant information.

Dummy Variables: Interpretation • Consider the following regression equation: • Question: What if the case is a male? • Answer: DFEMALE is 0, so the entire term becomes zero. • Result: Males are modeled using the familiar regression model: a + b1X + e.

Dummy Variables: Interpretation • Consider the following regression equation: • Question: What if the case is a female? • Answer: DFEMALE is 1, so b2(1) stays in the equation (and is added to the constant) • Result: Females are modeled using a different regression line: (a+b2) + b1X + e • Thus, the coefficient of b2 reflects difference in the constant for women.

Simple Regression