310 likes | 384 Views
EC339: Lecture 6. Chapter 5: Interpreting OLS Regression. Regression as Double Compression. DoubleCompression.xls EastNorthCentralFTWorkers.xls [DoubleCompression] SATData Math and Verbal scores Note simple statistics summary and correlation Conditional Mean E[Y|X] Start with Scatterplot.
E N D
EC339: Lecture 6 Chapter 5: Interpreting OLS Regression
Regression as Double Compression • DoubleCompression.xls • EastNorthCentralFTWorkers.xls • [DoubleCompression]SATData • Math and Verbal scores • Note simple statistics summary and correlation • Conditional Mean E[Y|X] • Start with Scatterplot
First Compression • Examine values of Y, over various small ranges of X • Recognize the ‘variance’ of Y in these strips • Create a conditional mean function • Average values of Y, given X (VerticalStrips) • Slide bar along X-axis • Move to Accordian • Examine 0, 2, 4, and Many intervals
First Compression • Before Compressing: What is best guess at Math SAT unconditional on X? • After Compressing: What is best guess at Math SAT conditional on X? • Individual variation is hidden in “Graph of Averages” • Graphical equivalent of PivotTable (See AccordianPivot) • 500+ observations summarized with 35
Second Compression • Linearizing graph of averages • ‘Regression line w/ observations & averages • Smooth linear version of plot of averages • Predicted Math SAT = 318.9 + 0.539 Verbal SAT • This equation, gives SIMILAR results to the previous compression • Now summarized with two numbers • Interpret as predicted Y given X
Only an interpretation not a method • “Double Compression” of the data is an interpretation of regression, not how it is done • Method is either analytical or numerical (or using an algorithm) • In [EastNorthCentralFTWorkers]Regression • Given education, estimate annual earnings • PivotTable: E[earn|edu=12] = 27,933 • Regression: E[earn|edu=12] = 28,399 What else might be going on here???
Regression and SD Line • [DoubleCompression]SATData • Examine two lines • SD Line: y = 94.6 + 0.978x • Reg Line: y = 318.94 + 0.539x • Notice slope of Regression line is shallower • Remember equation for slope in SIMPLE regression • Notice poor prediction with SD line • Calculate residuals
Another Example (SD vs. OLS) • Go back to Reg.xls, and calculate SSR from both lines. (SDx = 4.18, SDy = 50.56, xbar = 7.5, ybar = 145.6 • OLS Line: y = 88.956 + 7.558x • Calculate SD Line • Compare SSR from SD and OLS lines • What does point of averages mean here?
OLS Regression vs. SD • Simple Linear Regression (SLR) • Slope is SD line slope * correlation • Must have slope less than SD line
Two Regression Lines • Open TwoRegressionLines.xls • In PivotTables what do you notice? • What happens in table TwoLines? • Do equations change when you switch axes? • Compare with SDLine • How do you phrase the different regression lines? • Do these lines have different meanings? • Can you just solve one regression line to find the other? • “Someone who is 89 points above the mean in Verbal (1 sd) is predicted to be 0.55 x 87 (r x SDmath) or 48 points above the mean Math score” (thus regress!)
Two Regression Lines • Given a verbal score what is the best guess of a person’s math score? • Predicted Math SAT = 318 + 0.54 Verbal SAT • If Verbal = 600, Predicted Math SAT = 642 • Given a math score, what is the best guess of a person’s verbal score • Solve for verbal? From above • Predicted Verbal SAT = -589 + 1.85 Math SAT • NO!!! This is not correct! • Must regress verbal on math (verbal is predicted) • Predicted Verbal SAT = 176.1 + 0.564 Math SAT • If Math SAT = 642, you would predict Verbal SAT = 538
Properties of Sample Average and Regression Line • Examine OLSFormula.xls, SampleAveIsOLS • Sample average is a weighted sum • Sample average is also the least squares estimator of central tendency • Here, weights sum to 1 • Examine Excel’s “Auditing Tool” • Average SSR is never greater than Median SSR • Sample average is least squares estimate of measure of central tendency
Mean minimizes SSR (or OLS estimate) • Run Solver, minimize SSR by changing “Solver Estimate” starting at 100 • Note the sum of the residuals (F16) • Try with “Draw another dataset” in Live sheet • What happens to sum of residuals? • Try using the median
[OLSFormula.xls]Example • Recall “weighted sum” calculation of slope coefficient • Regression goes through point of averages • Slope is weighted sum of Y’s • Weights bigger in absolute value the farther the x-value is from the average value of x • Weights sum to zero • Change in Y value has a predictable effect on OLS slope and intercept
Residuals and Root Mean Squared Error (RMSE) (ResidualPlot.xls) • Residual Plots are “diagnostic tools” • There should be no discernable pattern in the plot • Residual plots can be done in Excel and SPSS • Try using LINEST method here to calculate the residuals. First, find the equation, then calculate predicted values, and then calculate residuals • Remember (Residual = Actual Y – Predicted Y) • Now, square the residuals, and find the average, and take the square root… • Root (of the) Mean (of the) Squared Errors • Measures your average mistake • Examine a scatterplot and histogram of the residuals
RMSE is “like” the standard deviation of the residuals (but slightly different, see RMSE.xls for true difference) • General measure of dispersion • Also known as “Standard Error of the Regression”
RMSE.xls • For many data sets, 68% of the observations fall within +/- 1 RMSE and 95% fall within +/- 2 RMSEs. When the RMSE is working as advertised, it should reflect these facts • Try changing spread in Computation • Examine Histograms in SATData and Accordian to understand that RMSE is the spread of the residuals • Pictures sheet shows residual plots and regressions
RSquared.xls • Play the game to convey the idea of the improvement in prediction. • R2 measures the percentage improvement in prediction over just guessing average Y. • R2 ranges from 0 to 1. • R2 is a dangerous statistic because it is sometimes confused as measuring the quality of a regression. Notice that Excel’s Trendline offers R2 (and only this statistic) as an option. There is no single statistic that can be used to decide if a particular regression equation is good or bad.
R2 Calculation • Total Sum of Squares (TSS or SST) • Sum of Squared Residuals (SSR) • Sum of Squares Explained (SSE)
Ordinary Least Squares-Fit Note: This is the RATIO of what variation in y is EXPLAINED by the regression!
Infant Mortality IMRGDPReg.xls • Start by moving data into SPSS • Plot data: What is relationship? • Save the residuals and plot against the independent variable • What and does this plot tell us?
Real Data: HourlyEarnings.xls • Residuals: Why are data shown in “strips”? • Regress & Save residuals
Regression Interpretation: Summary • Simplified Conditional Mean (more on this) • Intercept and Slope coefficient • Need theory to guide causation • BH call this “Two Regression Lines” • OLS is weighted average • RMSE and R2 are helpful