Estimation of Effect Size in Trials Stopped Early

Estimation of Effect Size in Trials Stopped Early Janet Wittes Statistics Collaborative University of Pennsylvania Annual Conference on Statistical Issues in Clinical TrialsApril 13, 2011

The problem • We know how to stop trials early for benefit • Most common boundary: O’Brien-Fleming • We do not generally operate algorithmically • So “boundary” is a “guideline” • We know the observed effect overestimates truth • So…how do we estimate effect size?

The problem is very hard… • So, my recommendations will not be “wrong”

The problem is very hard… • So, my recommendations will not be “wrong” • But, unfortunately, they won’t be “right”

The solution • Frequentist – we have many choices • Bayesian/likelihood – we have no problem • It is what it is….

Some examples of early stopping • MERIT-HF • Study stopped at 2nd interim analysis • RR 0.66; 95% CI= (0.53, 0,81) • Sunitinib in pancreatic islet cell tumors • Study stopped with ½ patients • PFS 5.5 mo vs 11.1 mo; HR=0.4; p<0.001 • RALES – will discuss later

The Bassler paper • Objective: to compare the treatment effect from truncated RCTs with that from meta-analyses of RCTs addressing the same question but not stopped early (nontruncated RCTs) and to explore factors associated with overestimates of effect

The Bassler paper • Objective: to compare the treatment effect from truncated RCTs with that from meta-analyses of RCTs addressing the same question but not stopped early (nontruncated RCTs) and to explore factors associated with overestimates of effect • Conclusions: Truncated RCTs were associated with greater effect sizes than RCTs not stopped early. • This difference was independent of the presence of statistical stopping rules and was greatest in smaller studies.

The responses: paper is off base • JAMA allows at most three letters!!!! • Scott Berry, Carlin, Connor • Ellenberg, DeMets, Fleming • Goodman, Don Berry, Wittes • Korn, Freidlin, Mooney

Reasons: math faulty • Berry-Carlin-Connor • “paper incorporated an important scientific and logical error that led to invalid conclusions” • Goodman-Berry-Wittes • “unfortunately, their conclusions are based on faulty mathematical reasoning.”

Reasons: don’t prevent early stopping • Korn-Freidlin-Mooney • “stopping a trial and releasing the information early allows current and future patients to benefit from new therapies as soon as possible.” • Ellenberg-DeMets-Fleming • “they seem to be warning against early trial termination. This is a much more complex issue on which the problem of modest upward bias of the effect estimate, readily remediable by existing methodology, should have little bearing”

Frequentist approach: the p-value • No monitoring – two definitions • Smallest a for which results would be stat sign’t • Prob under Ho that the test stat is  observed

Definition #1: p is smallest a needed for statistical significance • Imagine we observe a z=2.94 • What is the smallest a for significance?

Definition #1: p is smallest a needed for statistical significance • Imagine we observe a z=2.94 • Smallest a for significance=0.0016 =0.0016

Definition #2: p: probability under Ho that test stat is  observed • Imagine we observe a z=2.94 • Under Ho, Prob(z  2.94) = o.oo16 0.0016

So the definitions are equivalent!

So the definitions are equivalent! • But this is not true in group sequential designs

Group-sequential p-values • Smallest a for which results would be stat sign’t • Think of class of similar boundaries with different a • E.g. O-F boundaries with k equally spaced looks • What is the smallest a giving stat’l significance? • Example: 5 planned looks

Imagine we observe Z=2.94

Z=2.94  “p” = 0.002

Z=2.94  B=zt1/2=2.28

Z=2.94  B=2.28

Z=2.94  B=2.28 p=0.013

But this is not exactly right… • What is the probability of all paths with B2.28 • Answer: 0.010 • (why? some paths are not possible – if we had observed B(1/5) = 1.8, B(2/5) = 2.2, we would have stopped at look 2).

And something is odd… • Our exit probabilities depend on the future • But the future hasn’t happened yet • “Prediction is hard – especially about the future” • Niels Bohr (not Yogi Berra)

Definition #2. p: probability under Ho that test stat is  observed • We have observed (t, Z(t)). • What is more extreme?

Which point is more extreme?

Orderings • Unadjusted: 0.0016 • B-value ordering: 0.010 • Z-value ordering: 0.003 • MLE ordering: 0.002

Stagewise ordering • Earlier stopping: more compelling evidence • Two trials stopping at same time • one with larger z is more extreme • If our last observation is (tj, zj), the p-value is: • PHo {stop before tj}+ • PHo {don’t stop before tjand Z(tjzj) • Note: this does not need to consider future looks

Stagewise p • PHo {stop before tj}+ • PHo {don’t stop before tjand Z(tjzj)= • .000395+(.999605)(.001568) = 0.0020

Orderings • Unadjusted 0.0016 • B-value ordering 0.010 • Z-value ordering 0.003 • MLE ordering 0.002 • Stagewise 0.0020

Confidence intervals • Can use stagewise ordering for CL • Find upper and lower bounds such that • PL{[t, z(t)][t, z(t)]}=a/2 • PU{[t, z(t)]≤[t, z(t)]}=a/2 • Confidence limit may exclude the MLE

Estimates • Use Emerson & Fleming, Biometrika 1990 • See also Liu & Hall, Biometrika 1999 • These are not easy to calculate

Estimates • These are not easy to calculate • Instead of x

Estimates

Proposal • For p-value, use stagewise ordering • For confidence intervals, back-calculate from p • Use numerical integration • Or grid search on landem • For estimate, back-calculate from CI • Note: the CI may not include naïve estimator

Example: RALES • Spironolactone • Class III and IV heart failure • Primary outcome: total mortality (a=0.04) • O-F boundaries

Results

Summary • Protocols that include monitoring rules • Suggest that we add how you will calculate • p-values • Confidence intervals • Estimates • My recommendation: use stagewise ordering

Estimation of Effect Size in Trials Stopped Early