Linear Regression

Linear Regression

Relationship Between Variables - Often we are interested in how one variable is correlated with another - Does amount of schooling correlated with higher wages? - What variables are correlated with a person’s choice of health insurance coverage?

Scatter Plots Cigarette Usage vs Cigarette Usage vs Lung Cancer Deaths Leukemia Deaths

Correlation Coefficient • The correlation coefficient: where

Correlation Coefficient • Correlation Coefficientr measures the “strength” of the linear correlation between 2 variables (x and y) • Implications of correlation coefficientr : • If r is negative: as X increases, Y tends to decrease • If r is positive: as X increases, Y tends to increase • The “strength” of this relationship increases as the absolute value of r increases • Correlation coefficient r does not tell you how much Y increases for a given change in X

Correlation Coefficients Cigarette Usage vs Cigarette Usage vs Lung Cancer Deaths Leukemia Deaths r = 0.6974 r = -.0685

Regression • We use regression to estimate the relationship between two variables • For a 1 unit change in X, how much can we expect Y to change? • The relationship between X and Y can take any functional form • y = f (x)

Linear Deterministic Model • Assume that the relationship between X and Y is linear Y = α + βx • The parameter α tells you the Y intercept • The parameter β tells you how much Y changes for a 1 unit change in X • β is the slope

Linear Model Y intercept (α) = 4 Slope (β) = .5

Linear Probalistic Model • Real life data doesn’t often all fall on a line • There are random errors of observation, measurement, sampling, etc. • How do we deal with this? • Add an error term (ε ) to the linear model Y = α + βx + ε • εis a random variable with expected value 0 ( E(ε) = 0 ) and variance σ2

: intercept : slope

Linear Probabilistic Model:Assumptions • For the model: • where i = 1, 2, 3, … are different observations • for all i • if i ≠ j (independence) • if i=j (variance)

Linear Probabilistic Model:Assumptions • If we assume the εi’s are normally distributed, this is a visualization of the model

Regression Model: • The X’s and Y’s come from sample data • We want to estimate the α and the β • We can express the goal another way: we want to draw the “best” line through the scatter plot

Ordinary Least Squares (OLS) • Define: estimates of unknown parameters estimated value of Y estimation error • One way to estimate α and β is to minimize the sum of squared residuals Note: RSS is denoted in the book as Sum of Squared Errors (SSE), these are the same thing (residual is a synonym for error in this context)

OLS Estimators

Are OLS Estimators Good? If the assumptions about the error term hold, then: • least squares estimators of the slope and intercept terms are unbiased estimators. • Least squares estimators of the slope and intercept terms are most efficient linear estimators.

OLS Estimates using Least Squares regress lung cancer regress leukemia on smoking on smoking α-hat: 6.47 α-hat: 7.025 β-hat: .529 β-hat: -0.007

STATA Regression Results using OLS . regress lung cig Source | SS df MS Number of obs = 44 -------------+------------------------------ F( 1, 42) = 39.77 Model | 373.878411 1 373.878411 Prob > F = 0.0000 Residual | 394.833147 42 9.40078921 R-squared = 0.4864 -------------+------------------------------ Adj R-squared = 0.4741 Total | 768.711558 43 17.877013 Root MSE = 3.0661 ------------------------------------------------------------------------------ lung | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cig | .5290779 .0838951 6.31 0.000 .3597707 .6983851 _cons | 6.471687 2.140669 3.02 0.004 2.151641 10.79173 ------------------------------------------------------------------------------ • Interpretation: a 1 unit change in cigarette consumption per capita is correlated with a 0.529 increase in lung cancer deaths per 100k • Correlation ≠ Causation!!

Inferences using OLS • How confident are we in the least squares estimates? • Do the data present sufficient evidence that y increases (or decreases) linearly as x increases over the region of observation? • To answer this question, we need to estimate σ2, the variance of the error term

An estimator for σ2 • σ is a measure of how far the observed values of y fall from the predicted value y-hat. • Use the Residual Sum of Squares to estimate σ2 • We’ll rely on STATA to compute this for us

Hypothesis Testing: Slope of the Linear Probability Model • β is the slope of the linear probability model (the average change in Y for a 1 unit change in X) • Is the slope of the line (β) non-zero? • We can use hypothesis testing to determine this

Hypothesis Testing: Slope of the Linear Probability Model • The estimate β-hat depends on the random sample of size n that is drawn from the population • If we draw all possible random samples of size n from the population we could construct a sampling distribution of β-hat

Hypothesis Testing: Slope of the Linear Probability Model • If we assume the error term ε is normally distributed (in addition to the previous assumptions) then the test statistic follows a t distribution with (t-2) degrees of freedom

Hypothesis Test: Slope of the Line Null Hypothesis: Alternative Hypothesis: Test Statistic: Critical Value: or use the p-value

A (1-α)100% Confidence Interval for β • Where tα/2 is based on (n-2) degrees of freedom

STATA Regression Results (OLS) . regress lung cig Source | SS df MS Number of obs = 44 -------------+------------------------------ F( 1, 42) = 39.77 Model | 373.878411 1 373.878411 Prob > F = 0.0000 Residual | 394.833147 42 9.40078921 R-squared = 0.4864 -------------+------------------------------ Adj R-squared = 0.4741 Total | 768.711558 43 17.877013 Root MSE = 3.0661 ------------------------------------------------------------------------------ lung | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cig | .5290779 .0838951 6.31 0.000 .3597707 .6983851 _cons | 6.471687 2.140669 3.02 0.004 2.151641 10.79173 ------------------------------------------------------------------------------ • In this example the 95% confidence interval for the population parameter β is between .3597707 and .6983851 • We reject the null hypothesis that β is equal to 0. • The test statistic under the null hypothesis is (.529 – 0) / (.084) = 6.31. This is greater than the critical value 1.645 for α = .1; 1.96 for α = .05; and 2.58 for α = .01 • The p-value for the hypothesis test that the coefficient β is equal to 0 is 0.000, so we reject the null hypothesis for any value of α greater than 0.00

STATA Regression Results (OLS) . regress leukemia cig Source | SS df MS Number of obs = 44 -------------+------------------------------ F( 1, 42) = 0.20 Model | .082149606 1 .082149606 Prob > F = 0.6587 Residual | 17.4349463 42 .415117769 R-squared = 0.0047 -------------+------------------------------ Adj R-squared = -0.0190 Total | 17.5170959 43 .407374323 Root MSE = .6443 ------------------------------------------------------------------------------ leukemia | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cig | -.0078425 .0176295 -0.44 0.659 -.0434203 .0277352 _cons | 7.025163 .4498349 15.62 0.000 6.117359 7.932966 ------------------------------------------------------------------------------ • In this example the 95% confidence interval for the population parameter β is between -.043 and .028 • We fail to reject the null hypothesis that β is equal to 0. • The test statistic under the null hypothesis is (-.008 - 0) / (.018) = -0.44. This is less than the critical value for α = .1; α = .05; or α = .01 • The p-value for the hypothesis test that the coefficient β is equal to 0 is 0.659, so we reject the null hypothesis for any value of α greater than 0.659

Linear Regression with CI:Lung Cancer Deaths vs Smoking

Linear Regression with CI:Leukemia Deaths vs Smoking

More on OLS • Using similar methodology we can compute an estimate, confidence intervals, and hypothesis tests for α • (I won’t show you the derivations in this class; the results are output in STATA) • Even if we reject the null hypothesis that the parameter β = 0, this doesn’t necessarily mean that the variables X and Y are not related

Linear Probability Models(variations) • Some other possibilities are: which implies which implies

Prediction • We can use our estimates of α, β to generate a predicted value for y • This is the regression line. • The regression line always goes through the point ( E(x), E(y|x) ) • Y-hat is the expected value of y given x

Prediction • Y-hat is a random variable • There is variability due to the random sample used to estimate α and β and there is variability from the error term ε • Y hat follows a t distribution with (n-2) d.f.

Prediction • For a given value X0, the point estimate Y0 – hat is • Interval estimate is: where

Prediction • Use the ‘predict’ command after running a regression in STATA to generate predicted values for Y • Word of Warning: Predicted values of y-hat are only valid over the intervals of x that are observed • Ex: Suppose you regress “Earnings” on “age” for sample data that includes people ages 25 to 55. • If you compute a predicted value for earnings at age 100 you will get a very wrong answer

Goodness of Fit • How well does the regression line explain the variation of dependent variable? • Note: This question is different from are the estimates significant or not? • A perfect regression line would include all data points with no error. 100% of the variation of Y is explained by the regression line

*Total sum of squares (TSS) is the sum of squared deviations of the dependent variable around its mean and is a measure of the total variability of the variable: *Explained sum of squares (ESS) is the sum of squared deviations of predicted values of Y around its mean: *Residual sum of squares (RSS) is the sum of squared deviations of the residuals around their mean value of zero:

*Decomposition of variation of Y: * The measure of goodness-of-fit is the proportion the variation of Y that is explained by the model. This measure is called R2 (The coefficient of determination) * Property of R2: 1) 2) The closer R2 is to 1, the better model fits the data.

STATA Regression Results (OLS) . regress lung cig Source | SS df MS Number of obs = 44 -------------+------------------------------ F( 1, 42) = 39.77 Model | 373.878411 1 373.878411 Prob > F = 0.0000 Residual | 394.833147 42 9.40078921 R-squared = 0.4864 -------------+------------------------------ Adj R-squared = 0.4741 Total | 768.711558 43 17.877013 Root MSE = 3.0661 ------------------------------------------------------------------------------ lung | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cig | .5290779 .0838951 6.31 0.000 .3597707 .6983851 _cons | 6.471687 2.140669 3.02 0.004 2.151641 10.79173 ------------------------------------------------------------------------------ • The R-squared for this regression is 0.4741. The linear regression model can account for 47.4% of the variability in Y (lung cancer deaths).

STATA Regression Results (OLS) . regress leukemia cig Source | SS df MS Number of obs = 44 -------------+------------------------------ F( 1, 42) = 0.20 Model | .082149606 1 .082149606 Prob > F = 0.6587 Residual | 17.4349463 42 .415117769 R-squared = 0.0047 -------------+------------------------------ Adj R-squared = -0.0190 Total | 17.5170959 43 .407374323 Root MSE = .6443 ------------------------------------------------------------------------------ leukemia | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cig | -.0078425 .0176295 -0.44 0.659 -.0434203 .0277352 _cons | 7.025163 .4498349 15.62 0.000 6.117359 7.932966 ------------------------------------------------------------------------------ • The R-squared for this regression is 0.0047. The linear regression model can account for 0.47% of the variability in Y (leukemia deaths).

Linear Regression