Use of Candidate Predictive Biomarkers in the Design of Phase III Clinical Trials

Use of Candidate Predictive Biomarkers in the Design of Phase III Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://brb.nci.nih.gov

Predictive biomarkers • Measured before treatment to identify who is likely or unlikely to benefit from a particular treatment • ER, HER2, KRAS, EGFR

Biomarker Validity • Analytical validity • Measures what it’s supposed to • Reproducible and robust • Clinical validity (correlation) • It correlates with something clinically • Medical utility • Actionable resulting in patient benefit

Developing a drug with a companion test increases complexity and cost of development but should improve chance of success and has substantial benefits for patients and for the economics of health care How can we do it in a way that provides the kind of reliable answers we expect from phase III trials?

When the Biology is Clear • Develop a completely specified classifier of the patients likely (or unlikely) to benefit from a new drug • Classifier is based on either a single gene/protein or composite score • Develop an analytically validated • Design a focused clinical trial to evaluate effectiveness of the new treatment and how it relates to the test

Develop Predictor of Response to New Drug Using phase II data, develop predictor of response to new drug Patient Predicted Responsive Patient Predicted Non-Responsive Off Study New Drug Control Targeted (Enrichment) Design

Evaluating the Efficiency of Targeted Design • Simon R and Maitnourim A. Evaluating the efficiency of targeted designs for randomized clinical trials. Clinical Cancer Research 10:6759-63, 2004; Correction and supplement 12:3229, 2006 • Maitnourim A and Simon R. On the efficiency of targeted clinical trials. Statistics in Medicine 24:329-339, 2005.

Relative efficiency of targeted design depends on • proportion of patients test positive • effectiveness of new drug (compared to control) for test negative patients • When less than half of patients are test positive and the drug has little or no benefit for test negative patients, the targeted design requires dramatically fewer randomized patients than the standard design in which the marker is not used

Comparing T vs C on Survival or DFS5% 2-sided Significance and 90% Power

Hazard ratio 0.60 for test + patients • 40% reduction in hazard • Hazard ratio 1.0 for test – patients • 0% reduction in hazard • 33% of patients test positive • Hazard ratio for unselected population is • 0.33*0.60 + 0.67*1 = 0.87 • 13% reduction in hazard

To have 90% power for detecting 40% reduction in hazard within a biomarker positive subset • Number of events within subset = 162 • To have 90% power for detecting 13% reduction in hazard overall • Number of events = 2172

DevelopPredictor of Response to New Rx Predicted Responsive To New Rx Predicted Non-responsive to New Rx New RX Control New RX Control Stratification Design

Develop prospective analysis plan for evaluation of treatment effect and how it relates to biomarker • type I error should be protected for multiple comparisons • Trial sized for evaluating treatment effect overall and in subsets defined by test • Stratifying” (balancing) the randomization is useful to ensure that all randomized patients have the test performed but is not necessary for the validity of comparing treatments within marker defined subsets Post-stratification provides more time for development of analytically validated tests but risks validity of the results if adequate specimens are not collected in -> 100% of cases

Fallback Analysis Plan • Compare the new drug to the control overall for all patients ignoring the classifier. • If poverall ≤ 0.01 claim effectiveness for the eligible population as a whole • Otherwise perform a single subset analysis evaluating the new drug in the classifier + patients • If psubset ≤ 0.04 claim effectiveness for the classifier + patients.

Sample size for Analysis Plan • To have 90% power for detecting uniform 33% reduction in overall hazard at 1% two-sided level requires 370 events. • If 33% of patients are positive, then when there are 370 total events there will be approximately 123 events in positive patients • 123 events provides 90% power for detecting a 45% reduction in hazard at a 4% two-sided significance level.

To detect a 40% reduction in hazard in an a-priori defined subset with 90% power and a 5% significance level requires 162 events in the subset. To detect a 40% reduction in hazard in an a-priori defined subset with 90% power and a 4% two-sided significance level requires 171 events in the subset. If the prevalence of the marker is 33%, then the trial might be sized for 3*171= total 513 events.

R Simon. Using genomics in clinical trial design, Clinical Cancer Research 14:5984-93, 2008 • R Simon. Designs and adaptive analysis plans for pivotal clinical trials of therapeutics and companion diagnostics, Expert Opinion in Medical Diagnostics 2:721-29, 2008

Web Based Software for Planning Clinical Trials of Treatments with a Candidate Predictive Biomarker • http://brb.nci.nih.gov

The Biology is Often Not So Clear • Cancer biology is complex and it is not always possible to have the right single completely defined predictive classifier identified and analytically validated by the time the pivotal trial of a new drug is ready to start accrual

K Candidate Biomarkers Design Based on Adaptive Threshold Design W Jiang, B Freidlin & R Simon JNCI 99:1036-43, 2007

K Candidate Biomarkers Design • Have identified K candidate binary classifiers B1 , …, BK thought to be predictive of patients likely to benefit from T relative to C • Eligibility not restricted by candidate markers

Compare T vs C for all patients • If results are significant at level .01 claim broad effectiveness of T • Otherwise proceed as follows • Compare T vs C for the subset of patients positive for marker 1; compute p1 • Similarly compare T vs C for the subset of patients positive for marker 2 (p2), positive for marker 3 (p3), …positive for marker K (pk) • Compute p* = min{p1 , p2 , …, pK} • Compute whether a value of p* is statistically significant when adjusted for multiple testing • Adjust for multiple testing using permutation of treatment labels to adjust for correlation among tests

To detect a 40% reduction in hazard in an a-priori defined subset with 90% power and a 4% two-sided significance level requires 171 events in the subset. • If the prevalence of the marker is 33%, then the trial might be sized for 3*171= total 513 events. • To adjust for multiplicity with 4 independent tests, 171 -> 224; 513 -> 672 total events.

Designs When there are Many Candidate Markers and too Much Patient Heterogeneity for any Single Marker

Adaptive Signature Design Boris Freidlin and Richard Simon Clinical Cancer Research 11:7872-8, 2005

Biomarker Adaptive Signature Design • Randomized trial of T vs C • Large number of candidate predictive biomarkers available • Eligibility not restricted by any biomarker • This approach can be used with any set of candidate markers

End of Trial AnalysisFallback Analysis • Compare T to C for all patients at significance level α0 (eg 0.01) • If overall H0 is rejected, then claim effectiveness of T for eligible patients • Otherwise proceed as follows

Using only a randomly selected subset of patients of pre-specified size (e.g. 1/3) to be used as a training set T, develop a binary classifier M based of whether a patient is likely to benefit from T relative to C • The classifier may use multiple markers • The classifier classifies patients into only 2 subsets; those predicted to benefit from T and those for whom T is not predicted better than C

Apply the classifier M to classify patients in the validation set V=D-T • Compare T vs C in the subset of V who are predicted to benefit from T using a threshold of significance of 0.04

This approach can also be used to identify the subset of patients who don’t benefit from T in cases where T is superior to C overall at the 0.01 level.

Cross-Validated Adaptive Signature Design Freidlin B, Jiang W, Simon R Clinical Cancer Research 16(2) 2010

At the conclusion of the trial randomly partition the patients into K approximately equally sized sets P1 , … , PK • Let D-i denote the full dataset minus data for patients in Pi • Omit patients in P1 • Apply the defined algorithm to analyze the data in D-1 to obtain a classifier M-1 • Classify each patient j in P1 using model M-1 • Record the treatment recommendation T or C

Repeat the above for all K loops of the cross-validation • All patients have been classified once as what their optimal treatment is predicted to be

Let ST denote the set of patients for whom treatment T is predicted optimal • Compare outcomes for patients in ST who actually received T to those in ST who actually received C • Compute Kaplan Meier curves of those receiving T and those receiving C • Let zT = standardized log-rank statistic

Test of Significance for Effectiveness of T vs C • Compute statistical significance of zT by randomly permuting treatment labels and repeating the entire cross-validation procedure • Do this 1000 or more times to generate the permutation null distribution of treatment effect for the patients in each subset

By applying the analysis algorithm to the full RCT dataset D, recommendations are developed for how future patients should be treated

The size of the T vs C treatment effect for the indicated population is (conservatively) estimated by the Kaplan Meier survival curves of T and of C in ST

70% Response to T in Sensitive Patients25% Response to T Otherwise25% Response to C30% Patients Sensitive

506 prostate cancer patients were randomly allocated to one of four arms: Placebo and 0.2 mg of diethylstilbestrol (DES) were combined as control arm C 1.0 mg DES, or 5.0 mg DES were combined as T. The end-point was overall survival (death from any cause). Covariates: Age: In years Performance status (pf): Not bed-ridden at all vs other Tumor size (sz): Size of the primary tumor (cm2) Index of a combination of tumor stage and histologic grade (sg) Serum phosphatic acid phosphatase levels (ap)

Figure 1: Overall analysis. The value of the log-rank statistic is 2.9 and the corresponding p-value is 0.09. The new treatment thus shows no benefit overall at the 0.05 level.

Figure 2: Cross-validated survival curves for patients predicted to benefit from the new treatment. log-rank statistic = 10.0, permutation p-value is .002

Figure 3: Survival curves for cases predicted not to benefit from the new treatment. The value of the log-rank statistic is 0.54.

Acknowledgements • Boris Freidlin • Yingdong Zhao • Wenyu Jiang • Aboubakar Maitournam

Use of Candidate Predictive Biomarkers in the Design of Phase III Clinical Trials