140 likes | 154 Views
StataWorkshop #2 Linear Regression. Chiu-Hsieh (Paul) Hsu Associate Professor College of Public Health pchhsu@email.arizona.edu. Outline. Review of linear regression Model fitting Variable selection Model selection Model interpretation. Linear Regression. Expression
E N D
StataWorkshop #2Linear Regression Chiu-Hsieh (Paul) Hsu Associate Professor College of Public Health pchhsu@email.arizona.edu
Outline • Review of linear regression • Model fitting • Variable selection • Model selection • Model interpretation
Linear Regression • Expression • Y=β0 + β1x1 + β2x2+ ε • Linear relationship between y and x • Given certain x2, as x1 increases one unit, y changes β1 units. • Assumptions • ε(residual)~N(0,σ2) (independent and identical) • Need to evaluate the assumptions • R square (coefficient of determination) presents the percentage of variation of Y explained by all Xs.
Data Set • Lead exposure data • Effects of lead exposure on neurological and psychological function in children • Neurological endpoint • Maxfwt: maximum finger wrist tapping • Independent variables: Group (exposed to lead or not), age, sex, area
Data Management • Drop missing data, i.e. maxfwt=99 • Stata command: drop if maxfwt==99 • Generate dummy variables for area • Stata command: xi i.area • Two dummy variables: _Iarea_2 and _Iarea_3, i.e. Area 1 as the reference group
Data Description • Group • Stata command: tab Group • Age by Group • Stata command: by Group, sort: sum ageyrs • Stata command: ttest ageyrs, by(Group) • Sex by Group • Stata command: tab sex Group,exact • Area by Group • Stata command: tab area Group,exact
Estimation of the regression line • Stata command • reg maxfwt Group sex ageyrs _Iarea_2 _Iarea_3
Variable Selection • Stepwise • Can add and remove variables • Need to specify both entry p-value (pe) and removal p-value (pr) • Forward • Begin from the simplest model and only add “important” variables • Only need to specify pe • Backward • Begin with full model and only remove “not important” variables • Only need to specify pr
Variable Selection (cont’d) • Keep the main interest variable, Group • Stepwise command • sw, pe(0.1) pr(0.2) lock: reg maxfwt Group sex ageyrs (_Iarea_2 _Iarea_3) • Forward command • sw, pe(0.1) lock: reg maxfwt Group sex ageyrs (_Iarea_2 _Iarea_3) • Backward command • sw, pr(0.2) lock: reg maxfwt Group sex ageyrs (_Iarea_2 _Iarea_3)
Model Selection • R^2 vs. adj. R^2 • R^2 increases with # of the covariates in the model. So not a good idea to use it to select a model. • Adj. R^2 penalizes including not so useful covariates in the model. So usually people use it to select a model.
Model 1 vs. Model 2 Model 1 Model 2
Prediction • Stata command • predict yhat, xb • predict ŷ using xb from the regression model • predict seyhat, stdp • predict standard error for the average value • predict sey, stdf • Predict standard error for the individual value
Residual Plots • Stata command • predict studentresid, rstudent • Generate studentized residuals • scatter studentresid yhat,yline(0) • Generate the residual plot Can use rvfplot command too but it uses the original residuals to make the plot!
Stata command qnorm studentresid Generate normal QQ plot for studentized residuals swilk studentresid Perform Shapiro Wilk test http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter1/statareg1.htm Normality Assumption