630 likes | 1.02k Views
STATA 学习系列 第 6 讲 贾明. STATA 学习系列. Regress 部分 ( 续 )- 回归诊断分析 1.Census 数据实际操作处理 ( 分析模型 ) 2.Auto 数据回归诊断分析 ( 图象分析方法 ) 3.Exdata 数据分析实际应用. 基本的数据转换 :excel stata. 1. 将 excel 数据导入 stata 第一步 : 将 excel 文件另存为用 制表符隔开 的 txt 文件 ; 第二步 : 用命令 : insheet using d:stata/name.txt;
E N D
STATA学习系列 第6讲 贾明
STATA学习系列 • Regress部分(续)-回归诊断分析 • 1.Census数据实际操作处理(分析模型) 2.Auto数据回归诊断分析(图象分析方法) 3.Exdata数据分析实际应用
基本的数据转换:excel stata 1.将excel数据导入stata 第一步:将excel文件另存为用制表符隔开的txt 文件; 第二步:用命令: insheet using d:\stata/name.txt; 2.将stata数据导出用excel打开 第一步:outsheet using d:/stata\name .out(生成文件位置) 第二步:用excel打开.out文件即可.
1.Census数据实际操作处理 Use d:/stata/census 1.数据说明: • . describe • Contains data from d:\stata/census.dta • obs: 50 1980 Census data by state • vars: 12 6 Jul 2000 17:06 • size: 3,000 (99.4% of memory free) • ------------------------------------------------------------------------------- • storage display value • variable name type format label variable label • ------------------------------------------------------------------------------- • state str14 %-14s State • region int %-8.0g cenreg Census region • pop long %12.0gc Population • poplt5 long %12.0gc Pop, < 5 year • pop5_17 long %12.0gc Pop, 5 to 17 years • pop18p long %12.0gc Pop, 18 and older • pop65p long %12.0gc Pop, 65 and older • popurban long %12.0gc Urban population • medage float %9.2f Median age • death long %12.0gc Number of deaths • marriage long %12.0gc Number of marriages • divorce long %12.0gc Number of divorces • -------------------------------------------------------------------------------
1.Census数据,对模型分析 • 数据分析目的: 研究—— 死亡率(drate)与medage,medagesq,pcturban(城市人口率)之间的线性关系
1.Census数据,对模型分析 • 基本数据处理,生成模型中需要的变量: • . gen pcturban= popurban/ pop • . gen drate= death/ pop • . gen medagesq= medage* medage
1.Census数据,对模型分析 • . regress drate medage medagesq pcturban • Source | SS df MS Number of obs = 50 • -------------+------------------------------ F( 3, 46) = 31.47 • Model | .00005593 3 .000018643 Prob > F = 0.0000 • Residual | .000027249 46 5.9236e-07 R-squared = 0.6724 • -------------+------------------------------ Adj R-squared = 0.6510 • Total | .000083179 49 1.6975e-06 Root MSE = .00077 • ------------------------------------------------------------------------------ • drate | Coef. Std. Err. t P>|t| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • medage | .0004851 .001207 0.40 0.690 -.0019446 .0029147 • medagesq | 2.37e-06 .0000206 0.12 0.909 -.000039 .0000437 • pcturban | -.0035348 .0008293 -4.26 0.000 -.0052042 -.0018655 • _cons | -.005598 .0178979 -0.31 0.756 -.0416246 .0304286 • ------------------------------------------------------------------------------
1.Census数据,对模型分析 . test medage=2*medagesq ( 1) medage - 2.0 medagesq = 0.0 F( 1, 46) = 0.15 Prob > F = 0.7021 • 注意medage和medagesq的系数 • . test • medage medagesq • ( 1) medage = 0.0 • ( 2) medagesq = 0.0 • F( 2, 46) = 44.03 • Prob > F = 0.0000 . test medage=200*medagesq ( 1) medage - 200.0 medagesq = 0.0 F( 1, 46) = 0.00 Prob > F = 0.9982
1.Census数据,对模型分析 • . vce • | medage medagesq pcturban _cons • -------------+------------------------------------ • medage | 1.5e-06 • medagesq | -2.5e-08 4.2e-10 • pcturban | 3.2e-07 -5.7e-09 6.9e-07 • _cons | -.000022 3.7e-07 -5.0e-06 .00032 • . vce,rho • | medage medagesq pcturban _cons • -------------+------------------------------------ • medage | 1.0000 • medagesq | -0.9985 1.0000 • pcturban | 0.3235 -0.3352 1.0000 • _cons | -0.9984 0.9942 -0.3385 1.0000
1.Census数据,对模型分析 • . regress drate medage pcturban • Source | SS df MS Number of obs = 50 • -------------+------------------------------ F( 2, 47) = 48.22 • Model | .000055922 2 .000027961 Prob > F = 0.0000 • Residual | .000027256 47 5.7993e-07 R-squared = 0.6723 • -------------+------------------------------ Adj R-squared = 0.6584 • Total | .000083179 49 1.6975e-06 Root MSE = .00076 • ------------------------------------------------------------------------------ • drate | Coef. Std. Err. t P>|t| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • medage | .0006238 .0000658 9.48 0.000 .0004915 .0007562 • pcturban | -.0035028 .0007731 -4.53 0.000 -.0050581 -.0019476 • _cons | -.0076466 .0019034 -4.02 0.000 -.0114756 -.0038175 • ------------------------------------------------------------------------------
1.Census数据,对模型分析 • 对回归模型进行估计:. • predict dhat • (option xb assumed; fitted values) • . summarize drate dhat • Variable | Obs Mean Std. Dev. Min Max • -------------+----------------------------------------------------- • drate | 50 .008436 .0013029 .0039915 .0106902 • dhat | 50 .008436 .0010683 .0044936 .0110485
1.Census数据,对模型分析 • 影响因素分析: • . predict influs,cooksd • (cook’sd用来衡量每个收集到的数值对回归系数的影响强度。) • . summarize influs,detail • Cook's D • ------------------------------------------------------------- • Percentiles Smallest • 1% 1.35e-08 1.35e-08 • 5% 6.25e-06 4.54e-06 • 10% .0000502 6.25e-06 Obs 50 • 25% .0010358 .0000109 Sum of Wgt. 50 • 50% .0043872 Mean .0639731 • Largest Std. Dev. .2560158 • 75% .0200719 .1914291 • 90% .0610564 .3090287 Variance .0655441 • 95% .3090287 .5059252 Skewness 5.857965 • 99% 1.735909 1.735909 Kurtosis 38.08436
1.Census数据,对模型分析 list state if influ >4/50(>4/n) • state • 2. Alaska • 9. Florida • 11. Hawaii • 44. Utah • . lvr2plot,s([state]) trim (12) border (图象)
1.Census数据,对模型分析 • . regress drate medage medagesq pcturban if influs<1 • Source | SS df MS Number of obs = 49 • -------------+------------------------------ F( 3, 45) = 30.43 • Model | .000050006 3 .000016669 Prob > F = 0.0000 • Residual | .000024651 45 5.4780e-07 R-squared = 0.6698 • -------------+------------------------------ Adj R-squared = 0.6478 • Total | .000074657 48 1.5553e-06 Root MSE = .00074 • ------------------------------------------------------------------------------ • drate | Coef. Std. Err. t P>|t| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • medage | .0028685 .0015954 1.80 0.079 -.0003448 .0060817 • medagesq | -.0000364 .0000266 -1.37 0.178 -.0000899 .0000172 • pcturban | -.0037377 .0008029 -4.66 0.000 -.0053549 -.0021205 • _cons | -.0420036 .023994 -1.75 0.087 -.0903301 .0063229 • ------------------------------------------------------------------------------
2.Auto数据回归诊断分析(图象分析) Three key issues in identifying model sensitivity to dindividual observations. 1.Residual 2.Leverage:small residual,but if u delete the point,the estimates would change markedly,such a point is said to have high leverage. 3.influential:we might ask which points in our data have a large effect on our estimated a or b etc.
2.Auto数据回归诊断分析(图象分析) • . use d:/stata\auto • . describe • Contains data from d:/stata\auto.dta • obs: 74 1978 Automobile Data • vars: 12 7 Jul 2000 13:51 • size: 3,478 (99.4% of memory free) • ------------------------------------------------------------------------------- • storage display value • variable name type format label variable label • ------------------------------------------------------------------------------- • make str18 %-18s Make and Model • price int %8.0gc Price • mpg int %8.0g Mileage (mpg) • rep78 int %8.0g Repair Record 1978 • headroom float %6.1f Headroom (in.) • trunk int %8.0g Trunk space (cu. ft.) • weight int %8.0gc Weight (lbs.) • length int %8.0g Length (in.) • turn int %8.0g Turn Circle (ft.) • displacement int %8.0g Displacement (cu. in.) • gear_ratio float %6.2f Gear Ratio • foreign byte %8.0g origin Car type • ------------------------------------------------------------------------------- • Sorted by: foreign
2.Auto数据回归诊断分析(图象分析) • 分析目的: • 汽车价格price与汽车里程mpg,重量weight,产地foreign以及产地和里程相互关系forxmpg之间的关系
2.Auto数据回归诊断分析(图象分析) • . gen forxmpg= foreign* mpg • . regress price weight mpg forxmpg foreign • Source | SS df MS Number of obs = 74 • -------------+------------------------------ F( 4, 69) = 21.22 • Model | 350319665 4 87579916.3 Prob > F = 0.0000 • Residual | 284745731 69 4126749.72 R-squared = 0.5516 • -------------+------------------------------ Adj R-squared = 0.5256 • Total | 635065396 73 8699525.97 Root MSE = 2031.4 • ------------------------------------------------------------------------------ • price | Coef. Std. Err. t P>|t| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • weight | 4.613589 .7254961 6.36 0.000 3.166264 6.060914 • mpg | 263.1875 110.7961 2.38 0.020 42.15527 484.2197 • forxmpg | -307.2166 108.5307 -2.83 0.006 -523.7294 -90.70369 • foreign | 11240.33 751.681 4.08 0.000 5750.878 16729.78 • _cons | -14449.58 4425.72 -3.26 0.002 -23278.65 -5620.51 • ------------------------------------------------------------------------------
2.Auto数据回归诊断分析(图象分析) • . vce,rho _cons| weight mpg forxmpg foreign • -------------+--------------------------------------------- • weight | 1.0000 • mpg | 0.8408 1.0000 • forxmpg | -0.5594 -0.7695 1.0000 • foreign | 0.6431 0.7747 -0.9715 1.0000 • _cons | -0.9611 -0.9536 0.6861 -0.7407 1.0000
2.Auto数据回归诊断分析(图象分析) • Rvfplot: graphs a residual-versus-fitted plot, a graph of the residuals versus the fitted values.
2.Auto数据回归诊断分析(图象分析) • 图象分析: • 1.price 和自变量之间存在线性关系 • 2.residuals表现出一定的增加或者减少的特征------异方差(heteroskedasticity):the increasing or decreasing variation in the residuals with fitted values(拟合值).
对图象检验分析 • ovtest:检查是否忽略掉了变量 • ovtest • Ramsey RESET test using powers of the fitted values of price • Ho: model has no omitted variables • F(3, 66) = 7.77 • Prob > F = 0.0002 • 说明存在忽略变量
2.Auto数据回归诊断分析(图象分析) • . hettest • Cook-Weisberg test for heteroskedasticity using fitted values of price • Ho: Constant variance • chi2(1) = 6.50 • Prob > chi2 = 0.0108 • 说明存在异方差
2.Auto数据回归诊断分析(图象分析) • lvr2plot :graphs a leverage-versus-squared residual plot,a graph of leverage against the (normalized) redisuals squared.
2.Auto数据回归诊断分析(图象分析) • 分析: VW Diesel是数据中唯一的柴油发动机,而Plym. Arrow的数据输入错误.(用这个方法检验数据).
2.Auto数据回归诊断分析(图象分析) • avplot graphs an added-variable plot (a.k.a. partial-regression leverage plot, a.k.a. partial regression plot, a.k.a. adjusted partial residual plot) after regression.
2.Auto数据回归诊断分析(图象分析) Added-variable plot 图象的三个属性: 1.图象中是针对每个Xi与Y做出的,数据还是原始数据; 2.图象中的直线的斜率和回归模型中Xi的系数相同,同时标准误也和原回归模型一样; 3.在原回归模型中影响斜率的每个变量的outlierness(观察值不在拟合直线上的点)保留下来.
2.Auto数据回归诊断分析(图象分析) • 说明:Cadillac Eldorado ,Lincoln ver,Cadillac Seville 这三个数据很突出.而这三种车占据了100%的奢侈型车的市场.从而说明原来的模型是不恰当的(misspecified).而右下脚的Plymouth Arrow前面说过了,数据输入错误.
2.Auto数据回归诊断分析(图象分析) • avplotsgraphs all the added-variable plots in a single image. • 通过这个命令来在一张表格里面看y与每个xi的关系,进一步的分析回归模型,并对原始数据进行检查.
2.Auto数据回归诊断分析(图象分析) • Avplot(s)对于分析outliers很适用,但是不能用于分析变量间的函数关系. • Cprplot(component-plus-residual plot)不能分析outliers,但是可以用来检查估计模型的函数形式(直线?曲线?). • 相同点:两个图象中的直线斜率都是模型中的系数.
2.Auto数据回归诊断分析(图象分析) • 重新构建模型: • . regress price mpg weight • Source | SS df MS Number of obs = 74 • -------------+------------------------------ F( 2, 71) = 14.74 • Model | 186321280 2 93160639.9 Prob > F = 0.0000 • Residual | 448744116 71 6320339.67 R-squared = 0.2934 • -------------+------------------------------ Adj R-squared = 0.2735 • Total | 635065396 73 8699525.97 Root MSE = 2514.0 • ------------------------------------------------------------------------------ • price | Coef. Std. Err. t P>|t| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • mpg | -49.51222 86.15604 -0.57 0.567 -221.3025 122.278 • weight | 1.746559 .6413538 2.72 0.008 .467736 3.025382 • _cons | 1946.069 3597.05 0.54 0.590 -5226.244 9118.382 • ------------------------------------------------------------------------------
2.Auto数据回归诊断分析(图象分析) • Acprplot(augmented component –plus-redisual plot)对检查非线性更加敏感.
2.Auto数据回归诊断分析(图象分析) • 现在分析mpg对price的影像是不是线性的.如果给模型新加入一个变量: • mpgsq=mpg*mpg,构建回归模型,得到的结果是:
2.Auto数据回归诊断分析(图象分析) • . gen mpgsq= mpg* mpg • . regress price mpg mpgsq weight • Source | SS df MS Number of obs = 74 • -------------+------------------------------ F( 3, 70) = 12.70 • Model | 223815416 3 74605138.6 Prob > F = 0.0000 • Residual | 411249980 70 5874999.72 R-squared = 0.3524 • -------------+------------------------------ Adj R-squared = 0.3247 • Total | 635065396 73 8699525.97 Root MSE = 2423.8 • ------------------------------------------------------------------------------ • price | Coef. Std. Err. t P>|t| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • mpg | -981.0308 377.9748 -2.60 0.011 -1734.878 -227.1838 • mpgsq | 17.32961 6.859794 2.53 0.014 3.648184 31.01104 • weight | .8344929 .7160289 1.17 0.248 -.5935816 2.262567 • _cons | 16106.35 6591.341 2.44 0.017 2960.333 29252.36 • ------------------------------------------------------------------------------
2.Auto数据回归诊断分析(图象分析) • 比较前后两张表: • 1.mpgsq的t检验值是2.53,mpg的t检验值变为-2.60. • 2.weight在第二个模型中所发挥的效用只有第一个模型的1/3左右,并且系数是不显著的. • 这说明:mpg对price的影响不是线性的.
2.Auto数据回归诊断分析(图象分析) • Rvpplot:residual versus predictor plots,如果模型是正确有效的,那么图象中的点就应该是均匀分布而不表现出任何的增加或者减少的趋势.
2.Auto数据回归诊断分析(图象分析) • 分析:图象中残差随着mpg增大而减小.这说明模型是有问题的.
3.Exdata数据分析实际应用 1.将excel数据导入stata 第一步:将excel文件另存为用制表符隔开的txt 文件; 第二步:用命令: insheet using d:\stata/name.txt; 2.将stata数据导出用excel打开 第一步:outsheet using d:/stata\name .out(生成文件位置) 第二步:用excel打开.out文件即可.
3.Exdata数据分析实际应用 • 假设1:分类的R&D投入效果存在明显差异; • 假设2:低技术类的R&D投入效果一直呈增加 趋势; • 假设3:高技术类的R&D投入效果并不存在单 一的增减趋势,在实验的前期呈现减少趋势 而后期将表现为增加趋势。
3.Exdata数据分析实际应用 • 假设1:分类的R&D投入效果存在明显差异; • 使用数据:insheet using d:/stata\hvsl.txt • . insheet using d:/stata\hvsl.txt • (4 vars, 40 obs) • . describe • Contains data • obs: 40 • vars: 4 • size: 680 (99.7% of memory free) • ------------------------------------------------------------------------------- • storage display value • variable name type format label variable label • ------------------------------------------------------------------------------- • experimentid str10 %10s ExperimentID • period byte %8.0g Period • rdoutcomel byte %8.0g R&d outcomel • rdoutcomeh byte %8.0g R&d outcomeh • ------------------------------------------------------------------------------- • Sorted by: • Note: dataset has changed since last saved