380 likes | 414 Views
Learn how to utilize summary statistics and visual representations to analyze data before conducting regression analysis. Understand the importance of transforming variables for nonlinearity in regression models.
E N D
FUNctional Form Adding forms of nonlinearity to Regression Models
Getting to know your data • You need to have a basic “feel” for your data to know how to analyze it. • Two things you should do before you analyze: • Summary statistics (general characteristics) • Graphic views of data
Summary Statistics • You should know basic information about your data: • Measures of Central Tendency • Measures of Dispersion • Minimum, Maximum • If you are going to talk about 1 unit changes in a variable, you might want to know how much change there is • If you are going to graph a relationship, you want to know what range of the relationship your data actually support (in-sample vs. out-of-sample predictions)
How to obtain Summary Stats • In Stata, type summarize y x1 • Note that x1 is a 7-point interval scale • There is a problem with x1 . summarize Variable | Obs Mean Std. Dev. Min Max -------------+------------------------------------------------ y | 46 4.608696 2.12371 2 9 x1 | 46 5.130435 10.94351 1 77
How to Obtain Summary Stats • You can get this info for specific variables • Command: summarize var1 var2 … . summarize x1 Variable | Obs Mean Std. Dev. Min Max -------------+------------------------------------------------- x1 | 46 5.130435 10.94351 1 77
How to Obtain Summary Stats • You can get more detailed info • Command: summarize x1, detail . summarize x1, detail x1 ------------------------------------------------------------- Percentiles Smallest 1% 1 1 5% 1 1 10% 2 1 Obs 46 25% 3 1 Sum of Wgt. 46 50% 3 Mean 5.130435 Largest Std. Dev. 10.94351 75% 5 6 90% 6 7 Variance 119.7604 95% 7 7 Skewness 6.353026 99% 77 77 Kurtosis 42.25927
Graphic Views of 1 Variable • Box and Whisker Plot • Essentially Puts the Data in Numerical Order and divides it into four parts. • Middle two boxes show where the middle 50% are • Lines show minimum and maximum • Command: • graph box varname
Graphic Views of Two Variables • Use the Scatter command • scatter y x
Why are Graphic Views of relationships important? . regress y x Source | SS df MS Number of obs = 63 ---------+----------------------------- F( 1, 61) = 0.17 Model | .141749236 1 .141749236 Prob > F = 0.6840 Residual | 51.7056947 61 .847634339 R-squared = 0.0027 ---------+----------------------------- Adj R-squared = -0.0136 Total | 51.8474439 62 .836249096 Root MSE = .92067 ------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------+---------------------------------------------------------- x | -.0518887 .1268869 -0.41 0.684 -.3056147 .2018373 _cons | 3.84064 .1168212 32.88 0.000 3.607041 4.074238 ----------------------------------------------------------------- Interpret this regression…
Let’s Try again . regress y3 x Source | SS df MS Number of obs = 63 ---------+----------------------------- F( 1, 61) = 221.56 Model | 379.83978 1 379.83978 Prob > F = 0.0000 Residual | 104.575914 61 1.71435924 R-squared = 0.7841 ---------+----------------------------- Adj R-squared = 0.7806 Total | 484.415694 62 7.81315635 Root MSE = 1.3093 ----------------------------------------------------------------- y3 | Coef. Std. Err. t P>|t| [95% Con. Interval] ---------+------------------------------------------------------- x | 2.686041 .1804527 14.89 0.000 2.325203 3.046878 _cons | 6.027621 .1661377 36.28 0.000 5.695408 6.359833 ----------------------------------------------------------------- Interpret these results…
What went wrong? • What assumption did we violate? • Sometimes there is a clear need • Other times, weigh tradeoffs between parsimony and detail • This Line perfectly predicts every point, but it doesn’t tell us a great deal about the relationship between x and y in general • Hard to generalize out of sample * This line summarizes the relationship very parsimoniously, without sacrificing too much
How can we account for non-linearity? • Model is linear in the parameters (a and b) • We cannot (with OLS) do: • But we can “transform” the variables • Examples:
Benefits of Transformation • Allows us to Account for Non-linearity (we get better fitting models) • Allows us to stick with the OLS framework • We don’t know anything else • Even when you do, the simple and desirable properties of OLS can make OLS with transformations a better choice than some high-fangled nonlinear models
Which Transformation? • It is up to the researcher to specify this • This may seem ad hoc, but • Let Theory be your guide • It is no less ad hoc than constraining the model to be linear • Let me show you a few transformations
Diminishing Marginal Returns • One Popular option is to take the “Natural Log” of x • This transforms x so that it will have a linear relationship with y Relationship between y and ln(x) Linear– OLS is A-O-K Relationship between y and x Non-linear – OLS not OK
How does this work? • So you graphed the relationship between y and x and found that it looks like y has a diminishing marginal effect on y • You decide to transform x with the natural log to deal with this • generate lnx = ln(x) • regress y lnx • But how do we interpret?
First, Let’s Compare Fit . regress y x Source | SS df MS Number of obs = 64 ----------+------------------------------ F( 1, 62) = 197.65 Model | 363.742893 1 363.742893 Prob > F = 0.0000 Residual | 114.098806 62 1.84030333 R-squared = 0.7612 ----------+------------------------------ Adj R-squared = 0.7574 Total | 477.841699 63 7.58478888 Root MSE = 1.3566 . regress y lnx Source | SS df MS Number of obs = 64 ----------+------------------------------ F( 1, 62) = 481.32 Model | 423.313321 1 423.313321 Prob > F = 0.0000 Residual | 54.528378 62 .879489967 R-squared = 0.8859 ----------+------------------------------ Adj R-squared = 0.8840 Total | 477.841699 63 7.58478888 Root MSE = .93781 Which Model Fits Better?
Second, Let’s Interpret . regress y lnx Source | SS df MS Number of obs = 64 ----------+------------------------------ F( 1, 62) = 481.32 Model | 423.313321 1 423.313321 Prob > F = 0.0000 Residual | 54.528378 62 .879489967 R-squared = 0.8859 ----------+------------------------------ Adj R-squared = 0.8840 Total | 477.841699 63 7.58478888 Root MSE = .93781 --------------------------------------------------------------------- y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------+------------------------------------------------------------- lnx | 3.023069 .1377947 21.94 0.000 2.747621 3.298516 _cons | 2.018158 .1843283 10.95 0.000 1.649691 2.386625 --------------------------------------------------------------------- • A 1-unit increase in the natural log of x results in a 3.02 unit increase in y. • What does a 1-unit increase in ln(x) mean?
We can “back out” our results • First, get predicted values of y from the “logged” equation • Then graph the relationship between y and x (NOT y and lnx). The graph shows the predicted relationship between y and x. • Example…
. predict predincrease . scatter predincrease x, c(l) sort(predincrease) || scatter y x
Warning About Logs • The natural log is only defined for positive, non-zero values. If your variable is not always > 0, you can get into trouble here
Another Transformation • Taking the “square root” of x also can model diminishing marginal returns • Use the Same Procedures • Has the same problem as log with negative numbers
1/x transformation • Instead of putting x into the regression, we can include 1/x to get a non-linearity like this:
Problems with 1/x • When x is 0, 1/x is undefined
Exponential transformation • Instead of including x, we can use • This gives us a relationship like:
What transformations are OK • Any transformation of x is OK if it is justified by theory • You should be able to get back to your original scale • Be careful about transforming with functions that could be undefined for certain parts of your data
What you should know and be able to do: • Examine variables using summary statistics and graphs • Diagnose non-linear relationships using graphs • Choose an appropriate transformation to model nonlinearity • Estimate and interpret regression models with transformations of the data using STATA.
One obvious transform is left out • X2 • This is actually an interaction term of x with itself. • The effect of x on y depends on the value of x that is being observed! • As such, we must include x and x2 (why?)
The Effect of x on y depends on its own value: • When x is low, a 1-unit increase in x has a large, negative effect • When x is just below its mid-range, a 1-unit increase in x has a small negative effect • When x gets over its mid-range, a 1-unit increase in x has a small positive effect • When x is high, a 1-unit increase in x has a large positive effect
Example • Remember this? Let’s think about age… . regress volunteer tvhours sibs educ age Source | SS df MS Number of obs = 1187 ----------+------------------------------ F( 4, 1182) = 17.17 Model | 425.853521 4 106.46338 Prob > F = 0.0000 Residual | 7330.93334 1182 6.20214326 R-squared = 0.0549 ----------+------------------------------ Adj R-squared = 0.0517 Total | 7756.78686 1186 6.54029246 Root MSE = 2.4904 --------------------------------------------------------------------- volunteer | Coef. Std. Err. t P>|t| [95% Con. Interval] ----------+---------------------------------------------------------- tvhours | -.0551014 .0328821 -1.68 0.094 -.1196152 .0094124 sibs | .0232574 .020254 1.15 0.251 -.0164803 .0629951 educ | .1912462 .0266417 7.18 0.000 .1389758 .2435166 age | .0141891 .0043829 3.24 0.001 .0055899 .0227882 _cons | -.9510863 .4783878 -1.99 0.047 -1.88967 -.0125024 ---------------------------------------------------------------------
What transformations might we use on age? . regress volunteer tvhours sibs educ age age2 Source | SS df MS Number of obs = 1187 ----------+------------------------------ F( 5, 1181) = 14.55 Model | 449.99501 5 89.9990021 Prob > F = 0.0000 Residual | 7306.79185 1181 6.1869533 R-squared = 0.0580 ----------+------------------------------ Adj R-squared = 0.0540 Total | 7756.78686 1186 6.54029246 Root MSE = 2.4874 -------------------------------------------------------------------- volunteer | Coef. Std. Err. t P>|t| [95% Con. Interval] ----------+--------------------------------------------------------- tvhours | -.0528947 .0328608 -1.61 0.108 -.1173668 .0115774 sibs | .0191269 .0203369 0.94 0.347 -.0207736 .0590275 educ | .1797836 .0272345 6.60 0.000 .1263502 .2332169 age | .0622066 .0246994 2.52 0.012 .013747 .1106662 age2 | -.0004876 .0002468 -1.98 0.048 -.0009718 -3.30e-06 _cons | -1.824354 .6509468 -2.80 0.005 -3.101495 -.5472129 --------------------------------------------------------------------
What is the result? • Compare Goodness of Fit: • Adj. R2 improves (0.0540 vs. before 0.0517) • RMSE decreases (2.4874 vs. before 2.4904) • Improvement is minimal • Why such slight improvement? Look at the graph. • Stata Command: gen vol_hat = _b[_cons] + _b[tvhours]*2 + _b[sibs]*1 + _b[educ]*16 + _b[age]*age + _b[age2]*age2 • Stata Command: scatter vol_hat age , c(l) sort(age) msize(medium) || scatter volunteer age, msize(vsmall)
Which model… • Weigh parsimony (simplicity, ease of interpretation • Weight precision (increases in goodness of fit) • Weigh theoretical appropriateness
One last note: • Because polynomial terms are really interaction terms, the standard errors you see in the Stata output are not correct. • Effect of x depends on the value of x • The conditional slope is obtained by calculus • So the standard error of x is.
Calculate conditional Std. Err. • gen condslope = _b[age] + 2*_b[age2]*age • gen condse = sqrt( .000016 + 4*age*.000000061 + 4*age*.000006 ) • gen low95 = condslope - 1.96*condse • gen hi95 = condslope + 1.96*condse • scatter condslope age, c(l) || scatter low95 age, c(l) sort(age) || scatter hi95 age, c(l) sort(age) yline(0)