离散选择模型

离散选择模型 王志刚 2003.12

线性概率模型(LPM) • Yi= E(Yi|Xi) + ui. pi=p(Yi=1|Xi) • Y can take only two values. It is 1 with probability pi and 0 with probability (1 − pi): • E(Yi|Xi) = 1 × pi+ 0 × (1 − pi) =pi=β1 +β2Xi. • This means we can rewrite the model as • Yi= β1+ β2Xi+ ui. • Marginal effects: βk • 问题: • 1)Pi<0; • 2)var(ui|Xi)=(β1+ β2Xi )(1- β1-β2Xi )

概率单位模型(probit model)和对数单位模型(logit model) • .

三个模型估计系数的大概关系 • LPM*2.5=Probit • LPM*4=logit • Probit*1.6=logit • 以上只是一个大概的关系.

估计方法—极大似然法(MLE)

模型的拟合指标 • LogL1是关注的模型的极大似然值;LogL0是所有参数除了截距项以外均为0时的极大似然值;显然LogL0 <LogL1<0;LogL1会在软件包中输出. • 1.Amemiya(1981) • Pseudo R^2=1-1/[1+2(LogL1-LogL0)/N]. • 2.McFadden(1974): • McFadden R^2=1- LogL1/ LogL0 • 计算受约束极大似然函数可以直接从样本大小N和样本发生频率N1计算; • P{Yi=1|x)=p; MLE of p=N1/N, where • logL0=N1log(N1/N)+(N-N1)log(1-N1/N);

模型的估计 • 3.正确预测百分数:对每个i,都计算Yi取1的概率pi,如果pi>0.5,则Yi取1,否则取0.预测的Yi与实际Yi相符的次数所占百分比称为正确预测百分数. • Probit model估计系数的含义:

模型估计

发生比(odds)和发生比率(odds ratio)

连续自变量的发生比率 • .

离散自变量的发生比率

有序向应模型(以probit模型为例) • . • 识别问题:不同的参数组合可能产生相同的极大似然值.只要保持一定的参数比率; • 因此要进行标准化:

多重选择模型 • 当存在多种选择,而且这些选择之间没有程度的不同,不涉及排序问题,就应用多重的probit,或logit 模型. • 假设残差项独立.这意味着(控制可观测变量的基础上),任何两个可选择的效用是独立的;问题在于当多个选择相似时,例如交通方式的选择,坐车,乘船,坐飞机;但是有人对颜色有不同的偏好,就把每种方式菜系分为红黄蓝三色,此时有六种选择,但是统一交通方式之间相似,这就不能用多重选择模型,而要用分层选择模型.

Probit model using stata • probit depvar [indepvar][weight][if exp][in range][,level(#) nocoef noconstant robust cluster(varname) score(newvarname) asis offset(varname) maximize_options] • dprobit depvar indepvars [weight][if exp][in range][,at(matname) classic probit_options]. • predict [type] newvarname [if exp][in range]{,p|xb|stdp|}rules asif nooffset]

explanation • probit estimates a MLE • dprobit estimates alternative MLE,reports the change in probability for an infinitesimal change in each independent,continuous variable (default).reports discrete change for dummy variables. • scores(.) create newvar • asis:request that all specified variables and obervatioins be retained in maximization process. • offset() specifies varname to be included in model with coefficient constrained to be 1.

Explanation for options • Options for dprobit • at(matname) specifies point around which the transformation of results is made. Default is perform transformation around • classic request mean effects be calculated using formula for continuous; • if for dummy,use • Prediction: • P:probability of a positive outcome. • Xb: calculate the linear predictioin. • stdp:standard error of linear prediction.

Output after dprobit • ereturn list (在dprobit之后使用)输出各种结果; • 其中: • e(xbar):表示在样本均值处的指数值;e(b)指的是对应的分布函数值;e(r2_p):pseudo R2; • e(par):fraction of success observed in data. • e(dfdx):marginal effects. • e(se_dfdx):standard error of marginal effects.

例子:参与工会模型 • 因变量:参加工会union • 自变量:潜在经验potexp,工作经验的平方exp2,受教育年数grade,婚否married,工会化程度high. • 分别采用线性概率模型,probit(包括mfx),dprobit,logit 模型进行分析,结果如下:

LPM results • reg union potexp exp2 grade married high /*LPM*/ • ------------------------------------------------------------------------------ • union | Coef. Std. Err. t P>|t| [95% Conf. Interval] • ----------+---------------------------------------------------------------- • potexp| .0200388 .0038969 5.14 0.000 .0123916 .0276859 • exp2 | -.0003706 .0000819 -4.53 0.000 -.0005313 -.0002099 • grade| -.0124636 .0051005 -2.44 0.015 -.0224725 -.0024547 • married| .0133428 .030001 0.44 0.657 -.0455298 .0722153 • high| .1439396 .0256785 5.61 0.000 .0935492 .1943299 • _cons| .1021368 .0749337 1.36 0.173 -.0449096 .2491832 • ------------------------------------------------------------------------------

Probit results • Probit estimates Number of obs = 1000 LR chi2(5)= 93.09 Prob > chi2 = 0.0000 Log likelihood =-475.2514 Pseudo R2 = 0.0892 • ------------------------------------------------------------------------------ • union | Coef. Std. Err. z P>|z| [95% Conf. Interval] • ---------+---------------------------------------------------------------- • potexp | .0835091 .0156087 5.35 0.000 .0529166 .1141016 • exp2 | -.0015308 .0003179 -4.82 0.000 -.0021538 -.0009078 • grade | -.042078 .0189089 -2.23 0.026 -.0791388 -.0050171 married | .0622516 .1125836 0.55 0.580 -.1584083 .2829115 • high | .5612953 .099662 5.63 0.000 .3659613 .7566292 • _cons | -1.468412 .2958112 -4.96 0.000 -2.048192 -.8886332 • ------------------------------------------------------------------------------

using mfx after probit estimation • Marginal effects after probit ; y=Pr(union) (predict)=.19047616 • ------------------------------------------------------------------------------ variable| dy/dx Std. Err. z P>|z| [ 95% C.I. ] X • -----+-------------------------------------------------------------------- potexp| .0226964 .00415 5.47 0.000 .014557 .030836 18.884 exp2 | -.000416 .00008 -4.90 0.000 -.000583 -.00025 519.882 grade | -.0114361 .00514 -2.23 0.026 -.021506 -.001366 13.014 married*| .0167881 .03011 0.56 0.577 -.042234 .07581 .641 high*| .1470987 .0247 5.96 0.000 .098687 .195511 .568 • ------------------------------------------------------------------------------ • (*) dy/dx is for discrete change of dummy variable from 0 to 1

marginal effectscoefficient estimation using dprobit (和前一页结果一样) • ------------------------------------------------------------------------------ union | dF/dx Std. Err. z P>|z| x-bar [ 95% C.I. ] • ---+-------------------------------------------------------------------- potexp|.0226964 .0041529 5.35 0.000 18.884 .014557 .030836 exp2| -.000416 .000085 -4.82 0.000 519.882 -.000583 -.00025 grade| -.0114361 .0051379 -2.23 0.026 13.014 -.021506 -.001366 married*|.0167881.0301137 0.55 0.580 .641 -.042234 .07581 high*|.1470987 .0247005 5.63 0.000 .568 .098687 .195511 • -----+-------------------------------------------------------------------- • obs. P | .216 • pred. P | .1904762 (at x-bar) 默认的;当然可以指定其他值; • ------------------------------------------------------------------------------ • (*) dF/dx is for discrete change of dummy variable from 0 to 1 • z and P>|z| are the test of the underlying coefficient being 0

Marginal index effcts vs marginal probability effects evaluated at sample mean • 边际指数效应:就是对应的估计系数值; • 在样本均值处的边际概率效应等于相应的估计系数值乘以相应的在样本均值处的概率密度函数值.

以教育为例的结果 • scalar pdfxallbar=normden(e(xbar)) (注依次是dprobit 之后没有at(.);如果有at(.),则可以写为: scalar pdfxallbar=normden(e(at)) . • lincom _b[grade]*pdfxallbar (在dprobit之后使用) • ( 1) .2717829 grade = 0 -------------------------------------------------------------------- union|Coef. Std. Err. z P>|z| [95% Conf. Interval] • ---+---------------------------------------------------------- (1)|-.0114361 .00513 -2.230.026 -.02150 -.00136 • ---------------------------------------------------------------- • 蓝色的是概率偏效应系数;绿色是指数偏效应(probit)系数; • 具体细节见程序说明;

t 检验和F检验 • test potexp=grade • ( 1) potexp - grade = 0 • chi2( 1) = 24.58; Prob > chi2 = 0.0000 • . test potexp grade exp2 married high • ( 1) potexp = 0 • ( 2) grade = 0 • ( 3) exp2 = 0 • ( 4) married = 0 • ( 5) high = 0 • chi2( 5) = 81.40;Prob > chi2 = 0.0000

Ordered probit model • oprobit depvar [varlist][weight][if exp][in range][,table robust cluster(varname) score(newvarlist) level(#) maximize_options] • predict [type] newvarname [if exp][in range][,{p|xb|stdp}outcome(outcome) nooffset] • Options:k:number of categories • table showing how probability are computed from the fitted equation. • cluster : specifies observations are independent across groups(clusters) but not necessary within groups;it equal to robust; • score:first is dLnLj/dln(xjb);second is dLnLj/d_cut1j • Kth isdLnLj/d_cut(k-1)j

Options for prediction • p:default calculate predicted probabilities. If don’t specify outcome(),you must specify k new variables,k is number of categories of dependent variables;if you specify outcome(),you must specify one new variable; • xb,stdp,nooffset as the same as probit option • outcome(outcome) :specifies for which outcome the predicted probabilities to be computed,it should contain either one single value of dependent variable,or one of #1,#2,… • #1 represent the first category of dependent variable.

例子：对自然资源愿意支付的程度 • 每个人都面临着三个标的：初始的，以及上标和下标；是观测不到的个人支付意愿。因变量depvar;自变量：Age,income 均为分组变量；

输出结果 • Ordered probit estimates LR chi2(3) = 37.90 • Prob > chi2 = 0.0000 • Log likelihood = -359.24085 Pseudo R2 = 0.0501 • ------------------------------------------------------------------------------ • depvar| Coef. Std. Err. z P>|z| [95% Conf. Interval] • -- ---+---------------------------------------------------------------- • age | -.1750912 .0437968 -4.00 0.000 -.2609313 -.0892511 female | -.2436467 .128588 -1.89 0.058 -.4956745 .0083811 income | .1507535 .0507301 2.97 0.003 .0513242 .2501827 • depvar | Probability Observed • -------------+------------------------------------- • 1 | Pr( xb+u<_cut1) 0.3942 • 2 | Pr(_cut1<xb+u<_cut2) 0.0577 • 3 | Pr(_cut2<xb+u<_cut3) 0.3622 • 4 | Pr(_cut3<xb+u) 0.1859

四种选择的概率 • Coef. Std. Err. --+---------------------------------------------------------------- _cut1 | -.5733944 .2272618 (Ancillary parameters) _cut2 | -.4122675 .226091 _cut3 | .6812162 .2280778 • ------------------------------------------------------------------------------ • 上面估计出的是临界点的值 • dis proby1 proby2 proby3 proby4 • .32413623.05984678.40371147.21230552 • 或predict p1 p2 p3 p4

Logistic regression

Logistic regression using stata • logistic depvar varlist [weight][if exp][in range][,level(#) robust cluster(varname) score(newvarname) asis offset(varname) coef maximize_options] • lfit [depvar][weight][if exp][in range][,group(#) table outsample all beta(matname)] • lstat [depvar][weight][if exp][in range][,cutoff(#) all beta(matname)] • lroc [depvar][weight][if exp][in range][,nograph graph_options all beta(matname)] • Lsens [depvar][weight][if exp][in range][,nograph genprob(varname) gensens(varname) genspec(varname) replace all beta(matname)]

Syntax for predict • predict [type] newvarname [if exp][in range][, statistic rules asif nooffset]. • When statistics is : • {p|xb|stdp} • All kinds of statistics as the same as logit prediction options.

各种诊断 • lfit :displays either pearson goodness of fit test or Hosmer-Lemeshow goodness of fit test. • lstat: display various summary statistics • lroc:graphs and calculates area under ROC( • lsens:graphs sensitivity and specificity vs probability cutoff and optioinally creates new variables containing these data.

例子：前面的加入工会例子 • logistic union potexp exp2 grade married high • Logistic regression LR chi2(5) = 92.49 • Prob > chi2= 0.0000 Log likelihood = -475.55411 Pseudo R2 = 0.0886 • ------------------------------------------------------------------------------ • union | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] • ------------+---------------------------------------------------------------- • potexp| 1.15882 .0325594 5.25 0.000 1.09673 1.224425 • exp2 | .9973167 .0005639 -4.75 0.000 .9962121 .9984225 • grade | .9320947 .0299594 -2.19 0.029 .8751866 .9927031 • married | 1.122393 .2208633 0.59 0.557 .7632141 1.650606 • high | 2.664832 .4798005 5.44 0.000 1.872457 3.79252 • ------------------------------------------------------------------------------

皮尔逊拟合优度检验 • lfit • Logistic model for union, goodness-of-fit test • number of observations = 1000 • number of covariate patterns = 556 • Pearson chi2(550)=529.72 • Prob > chi2 = 0.7255 • 表明模型充分地拟合了数据。

正确预测的概率 • lstat • Logistic model for union • -------- True -------- • Classified | D ~D | Total • -----------+--------------------------+----------- • + | 0 3 | 3 • - | 216 781 | 997 • -----------+--------------------------+----------- • Total | 216 784 | 1000 • Correctly classified 78.10% • 78.10%=1-(216+3)/1000

Receiver operating characteristic(ROC) curve • Logistic model for union :area under ROC curve = 0.7083

ROC • Plot sensitivity vs (1-specificity) as the cutoff c varied,and calculate area under it; • (0,0)对应的c=1;(1,1)对应的c=0; • 横轴(specificity):fraction of observed negative outcome cases that are correctly classified. • 纵轴(sensitivity)：fraction of observed positive outcome cases that are correctly classified. • 模型越有解释力，则曲线弯曲的程度越大；最没有解释力的是在45度直线上（对应的面积为0.5）。具有完全解释力的模型对应的面积为1。 • 本例子模型解释力还可以。

dbeta 距离指标检验异常点

Wald test • test exp2 • ( 1) exp2=0;chi2(1)=22.58;Prob > chi2 =0.0000 • . test potexp=grade • ( 1) potexp - grade = 0 • chi2( 1) = 24.12;Prob > chi2 = 0.0000

Logit model • logit depvar [indepvars][weight][if exp][in range][,nocoef noconstant robust cluster(varname) score(newvarlist) level(#) maximize_options]. • predict [type] newvarname [if exp][in range][,statistic rules asif nooffset] • Statistic is : • {p|xb|stdp}as the same as probit.

Predict • dbeta :Pregibon(1991) influence statistic. • deviance:deviation residual;越大，模型越差； • dx2:Hosmer,and Leme influence statistic. • ddeviance:Hosmer,Leme influence statistic. • hat:pregibon,leverage. • residuals,pearson residuals， • rstandard:standard pearson residuals;或标准化残差;大样本时服从正态分布N(0,1). • or:reports estimated coefficients transformed into odds ratios.exp(beta) instead of beta;

Mlogit procedure • mlogit depvar [indepvars][weight][if exp][in range][,basecategory(#) constraints(clist) noconstant robust cluster(varname) score(newvarlist) rrr level(#) maximize_options]. • predict [type] newvarname [if exp][in range][,{p|xb|stdp|stddp }outcome(outcome).

explanation • basecategory(#):specifies the value of depvar that treated as base category;default is to choose the most frequent category. • constraints(clist):linear constrained estimation. • score(newvarlist):create k-1 variables,k is number of observed outcomes;the mth is dLnLj/dlnxjbm; • Predict:p,xb,stdp as before;stdpp:calculates standard error of the difference in two linear predictions,must specifies outcome(outcome); • e.g:predict sed1,stdpp(1,3) • rrr:reports estimated coefficients transformed to relative risk ratio.I.e:exp(b) instead of b;

例子：加入不同行业 • 数据仍然使用前面工会例子所使用的;若选区第一类产业（自然资源）作为基准； • 因变量：产业选择ind1; • 自变量：potexp,exp2,grade,married,union,high • Multinomial logit model results:

输出结果 • ………………………………………………………..省略 • 13 | • potexp | -.2179493 .0665262 -3.28 0.001 -.3483383 -.0875603 • exp2 | .004946 .0015428 3.21 0.001 .0019222 .0079698 • union | .7321836 .8778095 0.83 0.404 -.9882914 2.452659 • grade | .376377 .0877074 4.29 0.000 .2044736 .5482805 • high | -59.32135 4738540 -0.00 1.000 -9287427 9287308 • married | -.8431111 .4845398 -1.74 0.082 -1.792792 .1065695 • _cons | 22.52589 1.349662 16.69 0.000 19.8806 25.17118 • -------------+---------------------------------------------------------------- • 14 | • potexp | -.1024596 .1043477 -0.98 0.326 -.3069773 .102058 • exp2 | .0023012 .0021146 1.09 0.276 -.0018433 .0064458 • union | .47805 .5203087 0.92 0.358 -.5417363 1.497836 • grade | .3177647 .1038616 3.06 0.002 .1141998 .5213296 • high | .5012028 1.846817 0.27 0.786 -3.118492 4.120898 • married | -1.060803 .7061995 -1.50 0.133 -2.444929 .3233226 • _cons | -2.141218 . . . . . • (Outcome ind1==1 is the comparison group)

例子：选择行业(logit….,rrr） • …… rrr (odds ratio) • -------------+---------------------------------------------------------------- • 14 | • potexp | .8988697 .0933506 -1.03 0.305 .7333253 1.101785 • exp2 | 1.001984 .0021047 0.94 0.345 .9978678 1.006118 • union | 1.284149 .6549112 0.49 0.624 .4726132 3.489193 • high | 1.233808 1.457804 0.18 0.859 .1217621 12.50211 • married | .4224411 .295379 -1.23 0.218 .1072975 1.663193 • ------------------------------------------------------------------------------ • (Outcome ind1==1 is the comparison group)

其他过程 • Ordered logit: ologit • Conditional logistic regression:clogit • Weighted least squares for grouped data:glogit;gprobit; • Nested logit model:nlogit • Panel logit/probit:xtlogit,xtprobit • Sas/stat/ proc logistic; • 推荐参考书： • 1。Logistic 回归模型方法与应用;高教出版社。 • 2。Statistics with stata5.0 • 3.handbook of stata;

离散选择模型

离散选择模型

Presentation Transcript