490 likes | 604 Views
Estimation of Effect Size in Trials Stopped Early. Janet Wittes Statistics Collaborative University of Pennsylvania Annual Conference on Statistical Issues in Clinical Trials April 13, 2011. The problem. We know how to stop trials early for benefit Most common boundary: O’Brien-Fleming
E N D
Estimation of Effect Size in Trials Stopped Early Janet Wittes Statistics Collaborative University of Pennsylvania Annual Conference on Statistical Issues in Clinical TrialsApril 13, 2011
The problem • We know how to stop trials early for benefit • Most common boundary: O’Brien-Fleming • We do not generally operate algorithmically • So “boundary” is a “guideline” • We know the observed effect overestimates truth • So…how do we estimate effect size?
The problem is very hard… • So, my recommendations will not be “wrong”
The problem is very hard… • So, my recommendations will not be “wrong” • But, unfortunately, they won’t be “right”
The solution • Frequentist – we have many choices • Bayesian/likelihood – we have no problem • It is what it is….
Some examples of early stopping • MERIT-HF • Study stopped at 2nd interim analysis • RR 0.66; 95% CI= (0.53, 0,81) • Sunitinib in pancreatic islet cell tumors • Study stopped with ½ patients • PFS 5.5 mo vs 11.1 mo; HR=0.4; p<0.001 • RALES – will discuss later
The Bassler paper • Objective: to compare the treatment effect from truncated RCTs with that from meta-analyses of RCTs addressing the same question but not stopped early (nontruncated RCTs) and to explore factors associated with overestimates of effect
The Bassler paper • Objective: to compare the treatment effect from truncated RCTs with that from meta-analyses of RCTs addressing the same question but not stopped early (nontruncated RCTs) and to explore factors associated with overestimates of effect • Conclusions: Truncated RCTs were associated with greater effect sizes than RCTs not stopped early. • This difference was independent of the presence of statistical stopping rules and was greatest in smaller studies.
The responses: paper is off base • JAMA allows at most three letters!!!! • Scott Berry, Carlin, Connor • Ellenberg, DeMets, Fleming • Goodman, Don Berry, Wittes • Korn, Freidlin, Mooney
Reasons: math faulty • Berry-Carlin-Connor • “paper incorporated an important scientific and logical error that led to invalid conclusions” • Goodman-Berry-Wittes • “unfortunately, their conclusions are based on faulty mathematical reasoning.”
Reasons: don’t prevent early stopping • Korn-Freidlin-Mooney • “stopping a trial and releasing the information early allows current and future patients to benefit from new therapies as soon as possible.” • Ellenberg-DeMets-Fleming • “they seem to be warning against early trial termination. This is a much more complex issue on which the problem of modest upward bias of the effect estimate, readily remediable by existing methodology, should have little bearing”
Frequentist approach: the p-value • No monitoring – two definitions • Smallest a for which results would be stat sign’t • Prob under Ho that the test stat is observed
Definition #1: p is smallest a needed for statistical significance • Imagine we observe a z=2.94 • What is the smallest a for significance?
Definition #1: p is smallest a needed for statistical significance • Imagine we observe a z=2.94 • What is the smallest a for significance?
Definition #1: p is smallest a needed for statistical significance • Imagine we observe a z=2.94 • Smallest a for significance=0.0016 =0.0016
Definition #2: p: probability under Ho that test stat is observed • Imagine we observe a z=2.94 • Under Ho, Prob(z 2.94) = o.oo16 0.0016
So the definitions are equivalent! • But this is not true in group sequential designs
Group-sequential p-values • Smallest a for which results would be stat sign’t • Think of class of similar boundaries with different a • E.g. O-F boundaries with k equally spaced looks • What is the smallest a giving stat’l significance? • Example: 5 planned looks
But this is not exactly right… • What is the probability of all paths with B2.28 • Answer: 0.010 • (why? some paths are not possible – if we had observed B(1/5) = 1.8, B(2/5) = 2.2, we would have stopped at look 2).
And something is odd… • Our exit probabilities depend on the future • But the future hasn’t happened yet • “Prediction is hard – especially about the future” • Niels Bohr (not Yogi Berra)
Definition #2. p: probability under Ho that test stat is observed • We have observed (t, Z(t)). • What is more extreme?
Orderings • Unadjusted: 0.0016 • B-value ordering: 0.010 • Z-value ordering: 0.003 • MLE ordering: 0.002
Stagewise ordering • Earlier stopping: more compelling evidence • Two trials stopping at same time • one with larger z is more extreme • If our last observation is (tj, zj), the p-value is: • PHo {stop before tj}+ • PHo {don’t stop before tjand Z(tjzj) • Note: this does not need to consider future looks
Stagewise p • PHo {stop before tj}+ • PHo {don’t stop before tjand Z(tjzj)= • .000395+(.999605)(.001568) = 0.0020
Orderings • Unadjusted 0.0016 • B-value ordering 0.010 • Z-value ordering 0.003 • MLE ordering 0.002 • Stagewise 0.0020
Confidence intervals • Can use stagewise ordering for CL • Find upper and lower bounds such that • PL{[t, z(t)][t, z(t)]}=a/2 • PU{[t, z(t)]≤[t, z(t)]}=a/2 • Confidence limit may exclude the MLE
Estimates • Use Emerson & Fleming, Biometrika 1990 • See also Liu & Hall, Biometrika 1999 • These are not easy to calculate
Estimates • These are not easy to calculate • Instead of x
Proposal • For p-value, use stagewise ordering • For confidence intervals, back-calculate from p • Use numerical integration • Or grid search on landem • For estimate, back-calculate from CI • Note: the CI may not include naïve estimator
Example: RALES • Spironolactone • Class III and IV heart failure • Primary outcome: total mortality (a=0.04) • O-F boundaries
Summary • Protocols that include monitoring rules • Suggest that we add how you will calculate • p-values • Confidence intervals • Estimates • My recommendation: use stagewise ordering