760 likes | 1.64k Views
Stratified Analysis of A Binary Endpoint and “Beyond”. Christy Chuang-Stein Statistical Research and Consulting Center Pfizer Inc ASA Biopharm Section Webinar May 7 2009. Related Webinars Offered Previously. October 21, 2008
E N D
Stratified Analysis of ABinary Endpoint and “Beyond” Christy Chuang-Stein Statistical Research and Consulting Center Pfizer Inc ASA Biopharm Section Webinar May 7 2009
Related Webinars Offered Previously • October 21, 2008 • Devan Mehrotra - Stratified Analyses: Tips for Improving Power (http://www.biopharmnet.com/doc/2008_10_21_webinar.pdf ) • April 3, 2009 • Frank Harrell – Case Study in Parametric Survival Modeling • First 16 slides or so on “Covariable Adjustment in Randomized Clinical Trials” (http://www.biopharmnet.com/doc/2009_04_03_webinar.pdf )
Outline of This Webinar • Stratified Analysis of a Binary Endpoint • Inverse vs CMH Weighting • Simpson’s Paradox and Collapsibility • Beyond • Stratified Randomization vs Stratified Analysis • Stratification and Subgroup Analysis • Sample Sizing for a Multi-regional Trial • Regulatory Guidances on Global Trials, Data Extrapolation • Conclusion
A Sepsis Study • A confirmatory trial in severe sepsis, a double-blind placebo control trial; IV with 96 hours duration; randomization stratified by center. • Primary analysis was 28-day mortality rate after treatment onset, stratified by 3 pre-specified covariates: APACHE II score, age and protein C activity. • Trial was terminated by an independent DSMB for efficacy after 2nd interim analysis of 1520 patients. • Many subgroup analyses were conducted, including APACHE II subgroups (4 defined by the observed quartiles), subgroups defined by the components of the APACHE II score, and subgroups defined by 1, or 2, or 3, or at least 4 organ dysfunctions.
When Dealing with Binary Outcome • Three measures are commonly used to assess efficacy within the j th APACHE II stratum • Risk difference dj : p1j – p2j • Relative risk rj : p1j / p2j • Odds ratio oj : { p1j (1 - p2j ) } / { (1 - p1j ) p2j } • Denote the observed rate by pij, pij = nij1 /nij+. • We will focus on risk difference. In each stratum, estimate p1j – p2j by p1j – p2j. We will get an overall treatment effect estimate and construct a test statistic.
Test an Overall Treatment Effect • A common approach is to form a weighted average and construct a test statistic for the overall effect as X2 has an asymptotic chi-square distribution with 1 degree of freedom if Sj wj dj = 0.
Choice of Weights – Method I • Inverse variance – {wi} is equal to the inverse of the sample variance of . In this case, X2 will be When dj = d (the risk difference is uniform across the strata), the inverse variance weighting produces the minimum variance estimate for the common risk difference d, which is unbiased for large samples. This method is favored by meta analysts.
Choice of Weights – Method 2 • CMH method – {wi} is equal to the inverse of the harmonic mean of n1j+ and n2j+. This method produces the X2 test by Cochran, which is asymptotically equivalent to a test developed by Mantel and Haenszel. Continuity correction could be applied.
CMH Method • Let fi represent the relative frequency of patients in the jth stratum in the population. When the study population mimics the target population, CMH estimate is approximately unbiased for Sj fj dj. • The above makes CMH weighting attractive when one is not sure if the treatment effect is the same across the strata.
Assumptions on True Mortality Rates When the mortality rate is low, there is not much room to improve. Most of the benefit is in the high-risk population.
Impact of Weighting • Weighting by the relative frequency of a stratum within the population leads to an overall treatment effect Sj fj dj of 0.25*(0)+0.25*(3%)+0.25*(9%)+0.25*(12%)= 6% . • Assume equal allocation within each stratum. The overall treatment effect estimate under the CMH weighting will approach 6% for large samples. • If we use the inverse variance weighting, we will weigh treatment effects in the 1Q, 2Q, 3Q and 4Q by 2.23 : 1.38 : 1.20 : 1.00. The effect estimate will approach 4.5% for large samples. • The inverse variance weighting will underestimate the parameter Sj fj dj of interest in this case.
Findings from the Sepsis Trial • The CMH test statistic has a value 7.310 with 1 degree of freedom (no continuity correction). The two-sided P-value is 0.0068. The CMH test statistic computes the variance assuming p1j = p2jfor all j. • A 95% CI for the overall difference in the mortality rate (new treatment – placebo) under the CMH weighting is (-9.8%,-1.6%). The calculation of variance in this case does not assume p1j = p2j . • The inverse variance approach produces a 95% for the difference in the mortality rate (new treatment – placebo) of (-8.1%, -0.1%).
Comparing across Strata • The difference in the mortality rates (new treatment – placebo) in the 4 APACHEII strata range between 3% to –12%. • The graph suggests a possible interaction that might be qualitative in nature. • We will look at an approach proposed by Gail and Simon (1985, Biometrics, 41:361-372) to test for qualitative interaction. Dmitrienko et al (2005). Analysis of Clinical Trials Using SAS.
Test for Qualitative Interaction • Let O+ = {di³ 0} = set of non-negative differences • Let O- = {di£ 0} = set of non-positive differences • Q > c can be used to test the null hypothesis of no qualitative interaction. • Q follows a fairly complex distribution based on a weighted sum of chi-square distribution. SAS codes are available in the book by Dmitrienko et al.
Test for Qualitative Interaction • Q+ can be used to test the null hypothesis of all differences being negative. Q- can be used to test the null hypothesis of all differences being positive. • For the sepsis study, the two-sided Gail-Simon test has a P-value of 0.4822. • The one-sided P-value for H0 of positive differences (new treatment – placebo) is 0.0030. The one-sided P-value for H0 of negative differences is 0.6005. • Like other interaction tests, G-S test requires strong evidence before we can reject the no qualitative interaction hypothesis.
In the End… • Data from this single study led to the approval of Xigris® • Xigris® INDICATIONS AND USAGE Xigrisis indicated for the reduction of mortality in adult patients with severe sepsis (sepsis associated with acute organ dysfunction) who have a high risk of death (e.g., as determined by APACHE II). Safety and efficacy have not been established in adult patients with severe sepsis and lower risk of death.
APACHE II Quartile score Xigris Placebo Total Mortality rate Total Mortality rate 1st + 2nd (3-24) 436 18.8% 437 19.0% 3rd + 4th (25-53) 414 30.9% 403 43.7% Table in the Package Insert • Patients who have a high risk for death are represented by an APACHE II score in the 3rd and 4th APACHE II score categories. • Treatment effects need to differ more than what shown in this case for Gail-Simon test to conclude interaction.
Questions • Could one have anticipated this extent of treatment difference before the trial? • If yes, what would have been a good design and analysis strategy? • Options • Specify the high risk population as the primary analysis population and enroll adequate patients in this group. • Test both the high risk population and the entire population with adjustment for multiplicity. • Analysis follows the design strategy.
The LIFE Study • Losartan Intervention For Endpoint Reduction in Hypertension Study. • Conducted at 945 sites in 7 countries. • Enrolled 9193 hypertensive patients with left ventricular hypertrophy (LVH) • The primary endpoint is a composite endpoint of cardiovascular deaths, stroke, and myocardial infarction. • Results reviewed by the FDA Cardiovascular and Renal Drugs AC on Jan 6 2003 for a new proposed indication Cozaar is indicated to reduce the risk of cardiovascular morbidity and mortality as measured by the combined incidence of cardiovascular death, stroke, and myocardial infarction in hypertensive patients with left ventricular hypertrophy.
Some Background • Losartan’s then label states that the effect in blood pressure reduction in blacks was somewhat less than in that in whites (a common statement for beta-blockers). • FDA statistician quoted data from three endpoint studies of other drugs. These studies demonstrated less or no treatment effect in blacks when compared to whites. • On the primary endpoint, when compared to atenolol, losartan had a hazards ratio of 0.869 (95% CI from 0.772 to 0.979) with a P-value of 0.021. The effect came primarily from the stroke component of the composite. • The issue of how losartan compared to atenolol in blacks came up.
Gail-Simon Test • Nominal p-value for Black vs. Non-Black Qualitative Interaction = 0.016. • Impossible to correctly adjust this p-value for multiple comparisons post hoc. • 3 subgroups pre-specified for special importance (U.S. region, Diabetics, ISH) • To do it correctly, the formal analysis plan would need to list all important subgroups and specify a method to correctly adjust for the number of tests. Source: John Lawrence’s (FDA Statistical Reviewer) slides at the January 6 2003 FDA AC meeting. For more discussion, see http://www.fda.gov/ohrms/dockets/ac/03/slides/3920s1.htm
COZAAR® Package Insert Indications and Usage … COZAAR is indicated to reduce the risk of stroke in patients with hypertension and left ventricular hypertrophy, but there is evidence that this benefit does not apply to Black patients. … Clinical Pharmacology In the LIFE study, Black patients treated with atenolol were at lower risk of experiencing the primary composite endpoint compared with Black patients treated with COZAAR…. This finding could not be explained on the basis of differences in the populations other than race or on any imbalances between treatment groups… the LIFE study provides no evidence that the benefits of COZAAR on reducing the risk of cardiovascular events in hypertensive patients with left ventricular hypertrophy apply to Black patients.
Observations • In the case of Xigris, subgroups defined by APACHE II score were pre-specified. Statistical significance was not achieved by the Gail-Simon test at the 5% level. • In the case of COZAAR, race subgroups were not pre-specified. They are, however, among the “usual” demographic subgroups and there is a priori reason for looking at this subgroup. A post hoc Gail-Simon test produced a value less than 0.05. • The end results (language in the product package insert) are similar – the label describes differential treatment effects in the subgroups.
Clinical Summary of Safety 13% vs 9.5%: a two-sided P-value of 0.023.
Clinical Summary of Safety 95% CI for the diff (A – B) using inverse variance weighting is (-0.017, 0.018) with a point estimate of 0.001. What happens?
Clinical Summary of Safety The study with the highest AE rates had twice as many subjects on Drug A as on Drug B.
Simpson’s Paradox • Within each study, the two groups have the same event rates. • Study 1 randomized patients 1:1:1:1 to 3 doses and 1 control. • Study 2 randomized patients 1:1 to one dose and control.
Treatment Event No Event Combined New 240 (48%) 260 (52%) 500 Control 120 (40%) 180 (60%) 300 Results Pooled over Studies • Pooling produces an event rate of 48% for the new treatment and 40% for the control. • The chi-square statistic has a two-sided P- value = 0.028. • Conducting un-stratified (un-adjusted) analysis in this case will lead to an erroneous conclusion.
Collapsibility • In this example, the risk difference is not collapsible over the studies (i.e., we can’t ignore “study”). • Randomization (treatment assignment) is not independent of study in the two-way marginal table of treatment by study.
Collapsibility • When both randomization ratio and risk difference are the same across studies, risk difference is collapsible over studies. • In this case, the proportion of event for each treatment is a weighted average of the proportions in individual studies with weights proportional to the study sizes.
In General • If the two treatments have the same effect in all studies (null hypothesis) and in addition, the randomization ratio is the same, then risk difference, risk ratio, and odds ratio are all collapsible across studies. • In the above case, the risk difference is 0 and the relative risk and odds ratio are 1. • Otherwise, collapsibility depends on the chosen measure for association (risk difference, risk ratio, odds ratio) - Greenlander, 1998, Encyclopedia of Biostatistics.
Collapsibility Depends on Measure 1:1 randomization, equal risk difference in two studies
Observations • Meta analysis procedure is frequently used to combine efficacy results. • Should use meta analysis (stratified analysis) when summarizing safety data from different studies, especially when studies have different patient populations and/or different randomization ratios. • If there is no a priori information suggesting different risk differences for different studies, inverse variance weighting would be a good choice. • Should always consider stratified analysis when covariates are highly correlated with the response.
Stratified (Adjusted) Analysis • Factor defining strata is prognostic of response. • Allowing comparison within more homogeneous groups. • Factor defining strata is predictive of treatment effect. • Issue of interaction • Evaluating treatment effect with subgroups • Overall treatment effect might be less meaningful if the interaction between treatment and factor is substantial
Stratified Randomization vs Analysis • If we employ stratified randomization, the convention is to include the stratifying factor in the analysis (CPMP/EWP/2863/99 on adjustment for baseline covariates). • When there are >=50 patients in each treatment group, Grizzle found that there was little advantage to using stratified randomization with two strata when the strata are roughly equally represented (Grizzle, Controlled Clinical Trials, 1982). • The incremental benefit of stratified randomization beyond that due to the stratified analysis is minimum (Permutt, DIJ 2007).
Stratified Randomization vs Analysis • The above is due to the fact that, for a reasonable sample size, the chance that the randomization will produce the type of imbalance that will substantially affect the inference is low. • If a stratum is small, stratified randomization could reduce the chance of imbalance. • If we are forced to treat un-stratified analysis as the primary analysis, stratified randomization could generally give us results close to those from an adjusted analysis. • Stratified allocation is used to ensure adequate (or even greater) representation of a particular type of patients in the study.
Permutt, DIJ 2007 • 50 subjects will be randomized to one of two treatments. • There are 50 men and 50 women. Gender is a prognostic factor and could be used as a stratifying factor for randomization and/or analysis, resulting in 4 options: stratified randomization and analysis (R&A), stratified randomization only (R Only), stratified analysis only (A Only), Neither. • Assume standard deviation is 10, and a treatment effect that will result in 80% power with 25 per group per gender under the R&A option (i.e., D = 5.6). • Assuming no treatment by gender interaction, but gender effect varies between 0 and 20.
Permutt, DIJ 2007 • Under “A Only” (stratified analysis without stratified randomization), the power was calculated for each possible (treatment,gender) allocation combination. The power was then averaged using probability under the hypergeometric distribution as the weight. • Under option “R Only” (stratified randomization without stratified analysis), Type I error could be lower than the nominal level (two-sided 5%) because the reduction in the variance of the estimated treatment effect due to stratified randomization is not properly accounted for in the analysis. (See the original paper.)
Stratification & Subgroup Analysis • How does the treatment perform in patients with mild disease? • Do patients with mild/moderate disease respond to the treatment similarly as patients with severe disease? • This is typically phrased as an interaction between treatment and disease severity at baseline • If heterogeneous effect (interaction) exists, is it qualitative or quantitative?
Subgroup Analysis: Issues • Multiplicity leading to inflated false positive rate • Lack of statistical power leading to inflated false negative rate • Treatment group incomparable because randomization was not done within the subgroups • Appropriate reporting/interpretation to ensure scientifically defensible and balanced conclusion We will focus on the first two issues here.
False Positive • Multiplicity • With multiple subgroup analyses, probability of a false positive finding substantial. • With 10 independent tests (α=0.05), chance of at least one false positive > 40%. Lagakos (2006) NEJM 354;16
Forest Plot of Treatment Effect Typical Result • Hypothetical study • 4000 patients in 20 countries (200 patients each) with a control arm risk of 20% and an experimental arm risk of 15% • Homogenous absolute risk reduction of 5% in all countries. Marschner (DIA Annual Meeting)
Simulation Study of Country Differences • In 10,000 simulations of similar studies, the largest and smallest treatment effect among the 20 countries was calculated • On average the largest treatment effect among the 20 countries was a 15%absolute risk reduction onthe experimental therapy • On average the smallest treatment effect among the 20 countries was a 5%absolute risk increase on the experimental therapy • Purely by chance, the observed experimental treatment effect in different countries can be expected to range from extremely beneficial to apparently harmful. Marschner (DIA Annual Meeting)
Prob of Neg Result for ³ One Subgroup Assuming two groups and a continuous endpoint: • Factors increasing the probability • Substantial imbalance between treatment groups • Substantial differences in the subgroup size • A large number of subgroups • Factors decreasing the probability • Balanced treatments and subgroup size • A large treatment effect size • A large sample size
Disjoint Subgroups • 2-sided a = 0.05 • 1:1 ratio with perfect balance between treatments • Various scenarios for subgroup size Li, Chuang-Stein, Hoseyni, DIJ (2007), 41:47-56.
Overlapping Subgroups (Simulations) • Each baseline covariate defines 3 subgroups with equal proportions (2 or 5 covariates). • Probabilities based on simulations (1000 replicates). • Unconditional on the overall result.