370 likes | 493 Views
COMPARISONS OF TWO-PART MODELS WITH COMPETITORS . PETER A. LACHENBRUCH OREGON STATE UNIVERSITY DEPARTMENT OF PUBLIC HEALTH. Clumping at 0. Some subjects show no response, others have a continuous, or at least ordered response Examples: Hospitalization expense in an HMO
E N D
COMPARISONS OF TWO-PART MODELS WITH COMPETITORS PETER A. LACHENBRUCH OREGON STATE UNIVERSITY DEPARTMENT OF PUBLIC HEALTH
Clumping at 0 • Some subjects show no response, others have a continuous, or at least ordered response • Examples: • Hospitalization expense in an HMO • Cell growth on plates • Urinary output in shock patients • Usual normal theory doesn’t apply
UO Analysis • Survival: 27/70 had UO=0; mean=127.9, s=148.13, skewness=1.13 • Deaths: 22/43 had UO=0; mean=31.0, s=71.76, skewness=3.37 • For these data: • t=3.01 (p=0.0032) • Wilcoxon z=2.794 (p=0.0052) • Kolmogorov-Smirnov p=0.001 • 2 part X2=15.86 (p0.00036)
Statistical Model • fi(x,d)=pi1-d{(1-pi)hi(x)}d • H0: p1=p2h1=h2 • Tests: • t-test on full data set • Wilcoxon rank sum test • Kolmogorov-Smirnov • Two part Models: Bin+Z; Bin+W; Bin+KS
What are the relative properties? • Right size? Is=0.05 when it’s supposed to be? • Are the null distributions correct? • What is the power of these procedures under various alternatives? (Use log-normal model) • Difference only in proportions • Difference only in means • Difference in both
Two-part Tests • Define • Then the two-part tests are: B2+Z2 (denoted as BZ), B2+W2 (denoted as BW) and B2+K2 (denoted as BK), where K2 is the chi-squared value corresponding to the p-value of the KS statistic. • Since these are independent, we have the sum of two 1 d.f. (central) chi-squared statistics (under the null)
Size of Tests • n1=n2=50, Equal means
Power: n= 50,100 • P1=0.1, P2=0.2; MEAN DIFFERENCE=0
Power: n=50, 100Differ only in means • P=0.1,0.2, mean=0.5
Power:n=100,p1=0.1,p2=0.2mean=0.3, 0.5 • Proportion and mean are consonant
Power:n=100,p1=0.2,p2=0.1mean=0.3, 0.5 • Proportion and mean are dissonant
Conclusions • These results are similar to those for other sample sizes and parameter combinations • Size is appropriate • Distributions match expectations, except for largest values • For differences only in proportions (low proportions), the BZ, BW and BK methods did well, Z did poorly
Conclusions (2) • For differences only in means, the W, K, Z, BW and BK did well • For consonant differences (mean and proportion in same direction), W, K, BW and BK did well, Z and BZ poorly • For dissonant differences, BW, BK and BZ were far superior to the others
Conclusions (3) • Theoretical results indicate that computing sample size or power with the non-central 2 distribution gives an excellent agreement with the simulated powers • Papers: • Comparisons - Statistics in Medicine 2001, p. 1215 • Non-central - Statistics in Medicine 2001, p. 1235
Selecting Variables for Two-Part Models Peter A. Lachenbruch and John Molitor Oregon State University
The Two-part Model • Some data have an excess of zero values. These aren’t be easily modeled because of the spike at 0. • Can use a mixture model if one cannot distinguish a sampling zero from a structural zero. Example: telephone calls in a short period of time. If phone is turned on, some time periods may have no calls. If phone is turned off, there are no calls registered. • Can use two-part model if all zeros are structural. Example: hospitalization cost when an insured was not hospitalized. Size of growth on an agar plate if all activity is inhibited.
An equation or two • Let y be the response. It is zero if no response, and non-zero otherwise. Let h(y) be the conditional distribution of y given y>0 • Let d be an indicator of non-zero response and p=probability that z=1 • For a two part model, we have • The log-likelihood is easy to compute and the solution is simply the likelihood estimate for p and for the mean (regression) of y. 20
Inference • One estimates parameters using the individual components of the likelihood. These are standard estimates. For the zero-nonzero part we use a logistic regression, and for the nonzero values we use a multiple regression. • An issue is how to select variables for inclusion in a model. • Select variables separately for each part of the model? • Select variables for the model as a whole using the 0 as if it were a regular observation.
Variable selection criteria • What criterion: • R2 =1-RSS/SST • R2adj =1-(n-1)/(n-k-1)*RSS/SST • AIC=n*ln(RSS/n)+2k+n+n*ln(2) • BIC =n*ln(RSS/n)+k*ln(n)+n*ln(2) (these are for normal distribution models) • Use forward or backward stepping • P to enter 0.15, 0.05 • P to remove 0.15, 0.05 • Best subsets models? • For generalized linear models, the deviance is proposed.
Variable Selection • For the multivariate regression, we can use stepwise regression. There are the usual concerns about stepwise. • We can use AIC, BIC, R2 to select the best model. AIC and BIC penalize the selection based on the number of variables in the model. For normal distributions we have • AIC=n*ln(RSS/n)+2k+n+n*ln(2) • BIC =n*ln(RSS/n)+k*ln(n)+n*ln(2) • Bias adjusted versions of R2 and AIC are also available
More on selection • For the logistic part of the model, we use stepwise logistic regression and specify a p(enter) or p(remove) – this is based on the test of the odds ratio for each candidate variable. • For variable selection, most programs use a stepwise routine that selects on the basis of the test on the odds ratio (basically a normal theory test).
Single model methods • There are two single model methods we consider: • Include the 0 values in a multiple regression • This is obviously inappropriate, but users often have done this • In practice, it selects more variables and includes the ones that have been selected by the logistic and multiple regression models. • Conduct a Bayesian analysis of the variable selection problem. This is work in progress.
Computing - Stata • We use Stata for computing because it has some convenient selection commands. • The recently developed command, vselect, due to Lindsay and Sheather, allows one to do variable selection using AIC, BIC, R2 and forward or backward stepping, as well as finding the best set of variables for each number of variables. • The Best subsets option uses the “leaps and bounds” algorithm that vastly reduces the amount of computations. This was due to Furnival and Wilson.
More on selection • Unfortunately, at present, vselect works only for multiple regression and not for logistic regression. Thus, we considered two strategies: • Use stepwise logistic regression directly • Regress the 0-1 variable using regression and perform the variable selection operation on the results. • The vselectcommand first computes a multiple regression on all variables, then it computes the stepwise variable selection from the X’X matrix • It allows the use of R2 , AIC, BIC, Mallows’ C, and Best subsets regression. In the example, we use the Best option that gives all of the above • The Bayesian methods will be presented separately.
Example data • We use a data set courtesy of Lisa Rider. • lald=ln(aldosterone) (response) aldind – indicator for 0 -1 • Dx2 – Polymyositis (1) or Dermatomyositis (2) Agedx – age at diagnosis • Yeardx – year of diagnosis gender – male (0) female (1) • Ild – interstitial lung disease Y/N Arthritis – Y/N • Fever >100 – Y/N Raynaud’s sign Y/N • Mechhand – mechanics hands Y/N palpitations Y/N • Dysphagia Y/N Proximal weakness Y/N • Race – W/NW Realonspeed – onset speed 1
The prediction problem • We wish to predict laldo. However, 72 out of 420 are 0. This leads to a clump of zero values. • We may wish to have a single set of predictors for lald, or we may wish to have a set of predictors for the non-zero values and a (possibly distinct) set of predictors for the 0 values. • A related question is how can we evaluate the prediction ability of the resulting equations?
Example of vselect • . regress laldoagedxyeardx dx2 gender ild arthritis fever raynaudmechhandpalpitadysphagproxweakracewnwrealonspeed • Source | SS df MS Number of obs = 347 • -------------+------------------------------ F( 14, 332) = 4.45 • Model | 44.1754461 14 3.15538901 Prob > F = 0.0000 • Residual | 235.26075 332 .708616718 R-squared = 0.1581 • -------------+------------------------------ Adj R-squared = 0.1226 • Total | 279.436196 346 .807619065 Root MSE = .84179 • ------------------------------------------------------------------------------ • laldo | Coef. Std. Err. t P>|t| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • agedx | 0.0061 0.0120 0.51 6.1e-01 -0.0176 0.0298 • yeardx | -0.0015 0.0086 -0.18 8.6e-01 -0.0185 0.0154 • dx2 | -0.7198 0.1617 -4.45 1.2e-05 -1.0379 -0.4016 • gender | -0.1017 0.1016 -1.00 3.2e-01 -0.3015 0.0982 • ild | -0.0200 0.1802 -0.11 9.1e-01 -0.3744 0.3345 • arthritis | 0.0548 0.0957 0.57 5.7e-01 -0.1334 0.2430 • fever | -0.0830 0.1000 -0.83 4.1e-01 -0.2798 0.1138 • raynaud | 0.3457 0.1490 2.32 2.1e-02 0.0526 0.6389 • mechhand | -0.0275 0.1822 -0.15 8.8e-01 -0.3859 0.3310 • palpita | -0.2085 0.1973 -1.06 2.9e-01 -0.5966 0.1797 • dysphag | 0.2590 0.0983 2.63 8.8e-03 0.0656 0.4525 • proxweak | 0.4575 0.8487 0.54 5.9e-01 -1.2119 2.1270 • racewnw | -0.0937 0.0991 -0.95 3.4e-01 -0.2887 0.1012 • realonspeed | -0.1849 0.0445 -4.16 4.1e-05 -0.2723 -0.0974 • _cons | 6.6862 17.2356 0.39 7.0e-01 -27.2186 40.5910 • ------------------------------------------------------------------------------ The next slide gives the vselect command and output. Note the restriction that lald>0 and u80 (an indicator variable that the patient was first diagnosted after 1980.
Vselect output This is the vselectoutput on the non-zero values. We truncated at 5 variables selected – the actual output includes all 14 variables • . vselectlaldoagedxyeardx dx2 gender ild arthritis fever raynaudmechhandpalpitadysphagproxweakracewnwrealonspeed ,best • 1 Observations Containing Missing Predictor Values • Response : laldo • Fixed Predictors : • Selected Predictors: dx2 realonspeeddysphagraynaudpalpita gender racewnw fever a • > rthritisproxweakagedxyeardxmechhandild • Actual Regressions 37 • Possible Regressions 16384 • Optimal Models Highlighted: • # Preds R2ADJ C AIC AICC BIC • 1 .0663986 24.09272 888.755 1873.568 896.4537 • 2 .1044985 10.09118 875.2897 1860.15 886.8377 • 3 .1207073 4.734216 869.9412 1854.861 885.3385 • 4 .1356839 -.1055272 864.96691849.957884.2135 • 5 .1361631 .7231399 865.7583 1850.832 888.8543 • 6 .1365321 1.595634 866.591 1851.76 893.5363 • Selected Predictors • 1 : dx2 • 2 : dx2 realonspeed • 3 : dx2 realonspeedraynaud • 4 : dx2 realonspeeddysphagraynaud • 5 : dx2 realonspeeddysphagraynaudracewnw • 6 : dx2 realonspeeddysphagraynaudpalpitaracewnw In this case, the program computed 27 regressions out of 16384 (=214 possible regressions)
Selecting predictors for 0 indicator For the logistic regressions we use stepwise logistic regression that selects variables based on odds ratios. We use forward stepping with a p-to-enter of 0.15 • stepwise, pe(.15): logistic aldindagedxyeardx dx2 gender ild arthritis fever raynaudmechhandpalpitadysphagproxweakracewnwrealonspeed if u80 • note: proxweak dropped because of estimability • note: 1 obs. dropped because of estimability • begin with empty model • p = 0.0036 < 0.1500 adding palpita • p = 0.0322 < 0.1500 adding arthritis • p = 0.0340 < 0.1500 adding gender • Logistic regression Number of obs = 418 • LR chi2(3) = 17.40 • Prob > chi2 = 0.0006 • Log likelihood = -183.34326 Pseudo R2 = 0.0453 • ------------------------------------------------------------------------------ • aldind | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • palpita | 0.3060 0.1217 -2.98 2.9e-03 0.1403 0.6674 • arthritis | 1.8598 0.5150 2.24 2.5e-02 1.0809 3.2000 • gender | 0.4839 0.1657 -2.12 3.4e-02 0.2474 0.9466 • ------------------------------------------------------------------------------ estatic ----------------------------------------------------------------------------- Model | Obs ll(null) ll(model) df AIC BIC -------------+--------------------------------------------------------------- . | 418 -192.0435 -183.3433 4 374.6865 390.8284 ----------------------------------------------------------------------------- Note: N=Obs used in calculating BIC; see [R] BIC note We see that the dx2 and onset speed variables did not enter, so somewhat different variables predict 0-ness than the magnitude of response
Selecting predictors for 0 with regression, ignoring binomial form We display only results for first five selected variables. • regress aldindagedxyeardx dx2 gender ild arthritis fever raynaudmechhandpalpi • > tadysphagproxweakracewnwrealonspeed if u80 • Source | SS df MS Number of obs = 419 • -------------+------------------------------ F( 14, 404) = 1.84 • Model | 3.56544676 14 .254674768 Prob > F = 0.0319 • Residual | 56.0622382 404 .138767916 R-squared = 0.0598 • -------------+------------------------------ Adj R-squared = 0.0272 • Total | 59.627685 418 .142649964 Root MSE = .37252 • ------------------------------------------------------------------------------ • aldind | Coef. Std. Err. t P>|t| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • agedx | -0.0053 0.0047 -1.14 2.5e-01 -0.0145 0.0038 • yeardx | 0.0017 0.0035 0.50 6.2e-01 -0.0051 0.0085 • dx2 | -0.0281 0.0646 -0.43 6.6e-01 -0.1550 0.0988 • gender | -0.0857 0.0416 -2.06 4.0e-02 -0.1675 -0.0039 • ild | -0.0459 0.0714 -0.64 5.2e-01 -0.1862 0.0944 • arthritis | 0.0789 0.0380 2.08 3.8e-02 0.0043 0.1535 • fever | 0.0636 0.0396 1.61 1.1e-01 -0.0143 0.1414 • raynaud | 0.0049 0.0599 0.08 9.4e-01 -0.1129 0.1226 • mechhand | 0.0803 0.0765 1.05 2.9e-01 -0.0701 0.2306 • palpita | -0.2003 0.0701 -2.86 4.5e-03 -0.3382 -0.0624 • dysphag | -0.0360 0.0390 -0.92 3.6e-01 -0.1127 0.0407 • proxweak | -0.2055 0.3751 -0.55 5.8e-01 -0.9429 0.5319 • racewnw | 0.0280 0.0395 0.71 4.8e-01 -0.0496 0.1057 • realonspeed | -0.0053 0.0178 -0.30 7.6e-01 -0.0404 0.0297 • _cons | -2.2499 6.9270 -0.32 7.5e-01 -15.8673 11.3676 • ------------------------------------------------------------------------------
Selecting predictors for 0 with regression, ignoring binomial form, 2 • . . vselectaldindagedxyeardx dx2 gender ild arthritis fever raynaudmechhandpalpi • > tadysphagproxweakracewnwrealonspeed if u80,best • 2 Observations Containing Missing Predictor Values • Response : aldind • Fixed Predictors : • Selected Predictors: palpita arthritis gender fever agedxmechhanddysphagracewnw • > ildproxweakyeardx dx2 realonspeedraynaud • Actual Regressions 62 • Possible Regressions 16384 • Optimal Models Highlighted: • # Preds R2ADJ C AIC AICC BIC • 1 .0197545 5.197552 366.7613 1555.89 374.837 • 2 .028156 2.597088 364.1486 1553.316 376.2622 • 3 .0365444 .0194683 361.5079 1550.724 377.6594 • 4 .0389249 .0159628 361.4605 1550.735 381.6499 • 5 .0403595 .4189426 361.8213 1551.164 386.0485 • Selected Predictors • 1 : palpita • 2 : palpita arthritis • 3 : palpita arthritis gender • 4 : palpita arthritis gender fever • 5 : palpita arthritis gender fever agedx • Note that the selected variables are identical to the stepwise logistic regression.
Multiple regression with 0 in the data set We now consider the model including 0 as part of the data. This may be made a bit easier having taken logs of the non-zero values, so the 0s aren’t quite so obviously different. • . regress laldoagedxyeardx dx2 gender ild arthritis fever raynaudmechhandpalpita • dysphagproxweakracewnwrealonspeed if u80 • Source | SS df MS Number of obs = 419 • -------------+------------------------------ F( 14, 404) = 2.84 • Model | 62.68539 14 4.47752786 Prob > F = 0.0004 • Residual | 638.017201 404 1.5792505 R-squared = 0.0895 • -------------+------------------------------ Adj R-squared = 0.0579 • Total | 700.702591 418 1.67632199 Root MSE = 1.2567 • ------------------------------------------------------------------------------ • laldo | Coef. Std. Err. t P>|t| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • agedx | -0.0075 0.0157 -0.48 6.4e-01 -0.0383 0.0234 • yeardx | 0.0024 0.0117 0.21 8.4e-01 -0.0206 0.0254 • dx2 | -0.6763 0.2178 -3.11 2.0e-03 -1.1044 -0.2482 • gender | -0.3182 0.1404 -2.27 2.4e-02 -0.5941 -0.0423 • ild | -0.1800 0.2408 -0.75 4.6e-01 -0.6533 0.2933 • arthritis | 0.2548 0.1280 1.99 4.7e-02 0.0031 0.5065 • fever | 0.1069 0.1336 0.80 4.2e-01 -0.1557 0.3695 • raynaud | 0.3104 0.2021 1.54 1.3e-01 -0.0868 0.7076 • mechhand | 0.2043 0.2580 0.79 4.3e-01 -0.3029 0.7115 • palpita | -0.7101 0.2366 -3.00 2.9e-03 -1.1753 -0.2449 • dysphag | 0.1165 0.1315 0.89 3.8e-01 -0.1422 0.3751 • proxweak | -0.0250 1.2653 -0.02 9.8e-01 -2.5124 2.4625 • racewnw | -0.0079 0.1332 -0.06 9.5e-01 -0.2698 0.2541 • realonspeed | -0.1742 0.0601 -2.90 4.0e-03 -0.2924 -0.0560 • _cons | -0.8421 23.3682 -0.04 9.7e-01 -46.7806 45.0964 • ------------------------------------------------------------------------------
Using vselect on the full data set Displaying best five • . vselectlaldoagedxyeardx dx2 gender ild arthritis fever raynaudmechhandpalpitadysphagproxweakracewnwrealonspeed if u80,best • 2 Observations Containing Missing Predictor Values • Response : laldo • Fixed Predictors : • Selected Predictors: dx2 palpitarealonspeed gender arthritis raynauddysphag fever • > mechhandildagedxyeardxracewnwproxweak • Actual Regressions 47 • Possible Regressions 16384 • Optimal Models Highlighted: • # Preds R2ADJ C AIC AICC BIC • 1 .0154376 20.79848 1401.003 2590.131 1409.079 • 2 .0322276 14.33945 1394.79 2583.957 1406.904 • 3 .048014 8.358132 1388.891 2578.106 1405.042 • 4 .0580737 4.926931 1385.429 2574.703 1405.618 • 5 .0673386 1.865516 1382.274 2571.617 1406.501 • 6 .0695667 1.901132 1382.256 2571.677 1410.521 • 7 .0699354 2.752656 1383.071 2572.582 1415.374 • Selected Predictors • 1 : dx2 • 2 : dx2 palpita • 3 : dx2 palpitarealonspeed • 4 : dx2 palpitarealonspeed arthritis • 5 : dx2 palpitarealonspeed gender arthritis • 6 : dx2 palpitarealonspeed gender arthritis raynaud • 7 : dx2 palpitarealonspeed gender arthritis raynauddysphag There are some differences in the variables selected by logistic regression and multiple regression. Raynaud’s and dysphagia were selected in the multiple regression
Future Steps • Develop a full Bayesian analysis/model • May include a model that involves selection of variables with 0 values in the variable selection set or may involve a Bayesian model on the non-zero values and a model for the variable of zero and non-zero values • Develop a model using a bootstrap and select based on Wald statistics • Stay tuned…