1 / 44

Linear Regression

Linear Regression. Relationship Between Variables. - Often we are interested in how one variable is correlated with another - Does amount of schooling correlated with higher wages? - What variables are correlated with a person’s choice of health insurance coverage?. Scatter Plots.

briana
Download Presentation

Linear Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Regression

  2. Relationship Between Variables - Often we are interested in how one variable is correlated with another - Does amount of schooling correlated with higher wages? - What variables are correlated with a person’s choice of health insurance coverage?

  3. Scatter Plots Cigarette Usage vs Cigarette Usage vs Lung Cancer Deaths Leukemia Deaths

  4. Correlation Coefficient • The correlation coefficient: where

  5. Correlation Coefficient • Correlation Coefficientr measures the “strength” of the linear correlation between 2 variables (x and y) • Implications of correlation coefficientr : • If r is negative: as X increases, Y tends to decrease • If r is positive: as X increases, Y tends to increase • The “strength” of this relationship increases as the absolute value of r increases • Correlation coefficient r does not tell you how much Y increases for a given change in X

  6. Correlation Coefficients Cigarette Usage vs Cigarette Usage vs Lung Cancer Deaths Leukemia Deaths r = 0.6974 r = -.0685

  7. Regression • We use regression to estimate the relationship between two variables • For a 1 unit change in X, how much can we expect Y to change? • The relationship between X and Y can take any functional form • y = f (x)

  8. Linear Deterministic Model • Assume that the relationship between X and Y is linear Y = α + βx • The parameter α tells you the Y intercept • The parameter β tells you how much Y changes for a 1 unit change in X • β is the slope

  9. Linear Model Y intercept (α) = 4 Slope (β) = .5

  10. Linear Probalistic Model • Real life data doesn’t often all fall on a line • There are random errors of observation, measurement, sampling, etc. • How do we deal with this? • Add an error term (ε ) to the linear model Y = α + βx + ε • εis a random variable with expected value 0 ( E(ε) = 0 ) and variance σ2

  11. : intercept : slope

  12. Linear Probabilistic Model:Assumptions • For the model: • where i = 1, 2, 3, … are different observations • for all i • if i ≠ j (independence) • if i=j (variance)

  13. Linear Probabilistic Model:Assumptions • If we assume the εi’s are normally distributed, this is a visualization of the model

  14. Regression Model: • The X’s and Y’s come from sample data • We want to estimate the α and the β • We can express the goal another way: we want to draw the “best” line through the scatter plot

  15. Ordinary Least Squares (OLS) • Define: estimates of unknown parameters estimated value of Y estimation error • One way to estimate α and β is to minimize the sum of squared residuals Note: RSS is denoted in the book as Sum of Squared Errors (SSE), these are the same thing (residual is a synonym for error in this context)

  16. OLS Estimators

  17. Are OLS Estimators Good? If the assumptions about the error term hold, then: • least squares estimators of the slope and intercept terms are unbiased estimators. • Least squares estimators of the slope and intercept terms are most efficient linear estimators.

  18. OLS Estimates using Least Squares regress lung cancer regress leukemia on smoking on smoking α-hat: 6.47 α-hat: 7.025 β-hat: .529 β-hat: -0.007

  19. STATA Regression Results using OLS . regress lung cig Source | SS df MS Number of obs = 44 -------------+------------------------------ F( 1, 42) = 39.77 Model | 373.878411 1 373.878411 Prob > F = 0.0000 Residual | 394.833147 42 9.40078921 R-squared = 0.4864 -------------+------------------------------ Adj R-squared = 0.4741 Total | 768.711558 43 17.877013 Root MSE = 3.0661 ------------------------------------------------------------------------------ lung | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cig | .5290779 .0838951 6.31 0.000 .3597707 .6983851 _cons | 6.471687 2.140669 3.02 0.004 2.151641 10.79173 ------------------------------------------------------------------------------ • Interpretation: a 1 unit change in cigarette consumption per capita is correlated with a 0.529 increase in lung cancer deaths per 100k • Correlation ≠ Causation!!

  20. Inferences using OLS • How confident are we in the least squares estimates? • Do the data present sufficient evidence that y increases (or decreases) linearly as x increases over the region of observation? • To answer this question, we need to estimate σ2, the variance of the error term

  21. An estimator for σ2 • σ is a measure of how far the observed values of y fall from the predicted value y-hat. • Use the Residual Sum of Squares to estimate σ2 • We’ll rely on STATA to compute this for us

  22. Hypothesis Testing: Slope of the Linear Probability Model • β is the slope of the linear probability model (the average change in Y for a 1 unit change in X) • Is the slope of the line (β) non-zero? • We can use hypothesis testing to determine this

  23. Hypothesis Testing: Slope of the Linear Probability Model • The estimate β-hat depends on the random sample of size n that is drawn from the population • If we draw all possible random samples of size n from the population we could construct a sampling distribution of β-hat

  24. Hypothesis Testing: Slope of the Linear Probability Model • If we assume the error term ε is normally distributed (in addition to the previous assumptions) then the test statistic follows a t distribution with (t-2) degrees of freedom

  25. Hypothesis Test: Slope of the Line Null Hypothesis: Alternative Hypothesis: Test Statistic: Critical Value: or use the p-value

  26. A (1-α)100% Confidence Interval for β • Where tα/2 is based on (n-2) degrees of freedom

  27. STATA Regression Results (OLS) . regress lung cig Source | SS df MS Number of obs = 44 -------------+------------------------------ F( 1, 42) = 39.77 Model | 373.878411 1 373.878411 Prob > F = 0.0000 Residual | 394.833147 42 9.40078921 R-squared = 0.4864 -------------+------------------------------ Adj R-squared = 0.4741 Total | 768.711558 43 17.877013 Root MSE = 3.0661 ------------------------------------------------------------------------------ lung | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cig | .5290779 .0838951 6.31 0.000 .3597707 .6983851 _cons | 6.471687 2.140669 3.02 0.004 2.151641 10.79173 ------------------------------------------------------------------------------ • In this example the 95% confidence interval for the population parameter β is between .3597707 and .6983851 • We reject the null hypothesis that β is equal to 0. • The test statistic under the null hypothesis is (.529 – 0) / (.084) = 6.31. This is greater than the critical value 1.645 for α = .1; 1.96 for α = .05; and 2.58 for α = .01 • The p-value for the hypothesis test that the coefficient β is equal to 0 is 0.000, so we reject the null hypothesis for any value of α greater than 0.00

  28. STATA Regression Results (OLS) . regress leukemia cig Source | SS df MS Number of obs = 44 -------------+------------------------------ F( 1, 42) = 0.20 Model | .082149606 1 .082149606 Prob > F = 0.6587 Residual | 17.4349463 42 .415117769 R-squared = 0.0047 -------------+------------------------------ Adj R-squared = -0.0190 Total | 17.5170959 43 .407374323 Root MSE = .6443 ------------------------------------------------------------------------------ leukemia | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cig | -.0078425 .0176295 -0.44 0.659 -.0434203 .0277352 _cons | 7.025163 .4498349 15.62 0.000 6.117359 7.932966 ------------------------------------------------------------------------------ • In this example the 95% confidence interval for the population parameter β is between -.043 and .028 • We fail to reject the null hypothesis that β is equal to 0. • The test statistic under the null hypothesis is (-.008 - 0) / (.018) = -0.44. This is less than the critical value for α = .1; α = .05; or α = .01 • The p-value for the hypothesis test that the coefficient β is equal to 0 is 0.659, so we reject the null hypothesis for any value of α greater than 0.659

  29. Linear Regression with CI:Lung Cancer Deaths vs Smoking

  30. Linear Regression with CI:Leukemia Deaths vs Smoking

  31. More on OLS • Using similar methodology we can compute an estimate, confidence intervals, and hypothesis tests for α • (I won’t show you the derivations in this class; the results are output in STATA) • Even if we reject the null hypothesis that the parameter β = 0, this doesn’t necessarily mean that the variables X and Y are not related

  32. Linear Probability Models(variations) • Some other possibilities are: which implies which implies

  33. Prediction • We can use our estimates of α, β to generate a predicted value for y • This is the regression line. • The regression line always goes through the point ( E(x), E(y|x) ) • Y-hat is the expected value of y given x

  34. Prediction • Y-hat is a random variable • There is variability due to the random sample used to estimate α and β and there is variability from the error term ε • Y hat follows a t distribution with (n-2) d.f.

  35. Prediction • For a given value X0, the point estimate Y0 – hat is • Interval estimate is: where

  36. Prediction • Use the ‘predict’ command after running a regression in STATA to generate predicted values for Y • Word of Warning: Predicted values of y-hat are only valid over the intervals of x that are observed • Ex: Suppose you regress “Earnings” on “age” for sample data that includes people ages 25 to 55. • If you compute a predicted value for earnings at age 100 you will get a very wrong answer

  37. Goodness of Fit • How well does the regression line explain the variation of dependent variable? • Note: This question is different from are the estimates significant or not? • A perfect regression line would include all data points with no error. 100% of the variation of Y is explained by the regression line

  38. *Total sum of squares (TSS) is the sum of squared deviations of the dependent variable around its mean and is a measure of the total variability of the variable: *Explained sum of squares (ESS) is the sum of squared deviations of predicted values of Y around its mean: *Residual sum of squares (RSS) is the sum of squared deviations of the residuals around their mean value of zero:

  39. *Decomposition of variation of Y: * The measure of goodness-of-fit is the proportion the variation of Y that is explained by the model. This measure is called R2 (The coefficient of determination) * Property of R2: 1) 2) The closer R2 is to 1, the better model fits the data.

  40. STATA Regression Results (OLS) . regress lung cig Source | SS df MS Number of obs = 44 -------------+------------------------------ F( 1, 42) = 39.77 Model | 373.878411 1 373.878411 Prob > F = 0.0000 Residual | 394.833147 42 9.40078921 R-squared = 0.4864 -------------+------------------------------ Adj R-squared = 0.4741 Total | 768.711558 43 17.877013 Root MSE = 3.0661 ------------------------------------------------------------------------------ lung | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cig | .5290779 .0838951 6.31 0.000 .3597707 .6983851 _cons | 6.471687 2.140669 3.02 0.004 2.151641 10.79173 ------------------------------------------------------------------------------ • The R-squared for this regression is 0.4741. The linear regression model can account for 47.4% of the variability in Y (lung cancer deaths).

  41. STATA Regression Results (OLS) . regress leukemia cig Source | SS df MS Number of obs = 44 -------------+------------------------------ F( 1, 42) = 0.20 Model | .082149606 1 .082149606 Prob > F = 0.6587 Residual | 17.4349463 42 .415117769 R-squared = 0.0047 -------------+------------------------------ Adj R-squared = -0.0190 Total | 17.5170959 43 .407374323 Root MSE = .6443 ------------------------------------------------------------------------------ leukemia | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- cig | -.0078425 .0176295 -0.44 0.659 -.0434203 .0277352 _cons | 7.025163 .4498349 15.62 0.000 6.117359 7.932966 ------------------------------------------------------------------------------ • The R-squared for this regression is 0.0047. The linear regression model can account for 0.47% of the variability in Y (leukemia deaths).

More Related