Validation of Predictive Classifiers

Validation of Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://linus.nci.nih.gov/brb

Biomarker =Biological Measurement • Surrogate endpoint • A measurement made on a patient before, during and after treatment to determine whether the treatment is working • Prognostic factor • A measurement made before treatment that correlates with outcome, often for a heterogeneous set of patients • Predictive factors • A measurement made before treatment to predict whether a particular treatment is likely to be beneficial

Prognostic Factors • Most prognostic factors are not used because they are not therapeutically relevant • Many prognostic factor studies use a convenience sample of patients for whom tissue is available. Generally the patients are too heterogeneous to support therapeutically relevant conclusions

Pusztai et al. The Oncologist 8:252-8, 2003 • 939 articles on “prognostic markers” or “prognostic factors” in breast cancer in past 20 years • ASCO guidelines only recommend routine testing for ER, PR and HER-2 in breast cancer • “With the exception of ER or progesterone receptor expression and HER-2 gene amplification, there are no clinically useful molecular predictors of response to any form of anticancer therapy.”

Predictive Biomarkers • Most cancer treatments benefit only a minority of patients to whom they are administered • Being able to predict which patients are likely to benefit would • save patients from unnecessary toxicity, and enhance their chance of receiving a drug that helps them • Improve the efficiency of clinical development • Help control medical costs

In new drug development, the role of a classifier is to select a target population for treatment • The focus should be on evaluating the new drug, not on validating the classifier • Adoption of a classifier to restrict the use of a treatment in wide use should be based on demonstrating that use of the classifier leads to better clinical outcome

Targeted clinical trials can be much more efficient than untargeted clinical trials, if we know who to target

Developmental Strategy (I) • Develop a diagnostic classifier that identifies the patients likely to benefit from the new drug • Develop a reproducible assay for the classifier • Use the diagnostic to restrict eligibility to a prospectively planned evaluation of the new drug • Demonstrate that the new drug is effective in the prospectively defined set of patients determined by the diagnostic

Develop Predictor of Response to New Drug Using phase II data, develop predictor of response to new drug Patient Predicted Responsive Patient Predicted Non-Responsive Off Study New Drug Control

Evaluating the Efficiency of Strategy (I) • Simon R and Maitnourim A. Evaluating the efficiency of targeted designs for randomized clinical trials. Clinical Cancer Research 10:6759-63, 2004. • Maitnourim A and Simon R. On the efficiency of targeted clinical trials. Statistics in Medicine 24:329-339, 2005. • reprints at http://linus.nci.nih.gov/brb

For Herceptin, even a relatively poor assay enabled conduct of a targeted phase III trial which was crucial for establishing effectiveness • Recent results with Herceptin in early stage breast cancer show dramatic benefits for patients selected to express Her-2

Develop Predictor of Response to New Rx Predicted Responsive To New Rx Predicted Non-responsive to New Rx New RX Control New RX Control Developmental Strategy (II)

Developmental Strategy II • Do not use the diagnostic to restrict eligibility, but to structure a prospective analysis plan. • Compare the new drug to the control for classifier positive patients • If p+>0.05 make no claim of effectiveness • If p+ 0.05 claim effectiveness for the classifier positive patients and • Test treatment effect for classifier negative patients at 0.05 level

Key Features of Design (II) • The purpose of the RCT is to evaluate treatment T vs C for the two pre-defined subsets defined by the binary classifier; not to re-evaluate the components of the classifier, or to modify, refine or re-develop the classifier

Guiding Principle • The data used to develop the classifier must be distinct from the data used to test hypotheses about treatment effect in subsets determined by the classifier • Developmental studies are exploratory • Studies on which treatment effectiveness claims are to be based should be definitive studies that test a treatment hypothesis in a patient population completely pre-specified by the classifier

Adaptive Signature Design An adaptive design for generating and prospectively testing a gene expression signature for sensitive patients Boris Freidlin and Richard Simon Clinical Cancer Research 11:7872-8, 2005

Adaptive Signature DesignEnd of Trial Analysis • Compare E to C for all patients at significance level 0.04 • If overall H0 is rejected, then claim effectiveness of E for eligible patients • Otherwise

Otherwise: • Using only the first half of patients accrued during the trial, develop a binary classifier that predicts the subset of patients most likely to benefit from the new treatment E compared to control C • Compare E to C for patients accrued in second stage who are predicted responsive to E based on classifier • Perform test at significance level 0.01 • If H0 is rejected, claim effectiveness of E for subset defined by classifier

Biomarker Adaptive Threshold Design Wenyu Jiang, Boris Freidlin & Richard Simon JNCI 99:1036-43, 2007 http://linus.nci.nih.gov/brb

Biomarker Adaptive Threshold Design • Randomized pivotal trial comparing new treatment E to control C • Survival or DFS endpoint • Have identified a univariate biomarker index B thought to be predictive of patients likely to benefit from E relative to C • Eligibility not restricted by biomarker • No threshold for biomarker determined • Biomarker value scaled to range (0,1)

Evaluating a Classifier • Fit of a model to the same data used to develop it is no evidence of prediction accuracy for independent data. • When the number of candidate predictors (p) exceeds the number of cases (n), perfect prediction on the same data used to create the predictor is always possible

Evaluating a Classifier • Validation does not mean that repeating classifier process results in similar gene sets • Validation means predictions for independent cases are accurate

Internal Validation of a Predictive Classifier • Split-sample validation • Often applied with too small a validation set • Don’t combine the training and validation set • Don’t validate multiple models and select the “best” • Cross-validation • Often misused by pre-selection of genes

Split Sample Approach • Separate training set of patients from test set • Patients should represent those eligible for a clinical trial that asks a therapeutically relevant question • Do not access information about patients in test set until a single completely specified classifier is agreed upon based on the training set data

Re-Sampling Approach • Partition data into training set and test set • Develop a single fully specified classifier of outcome on training set • Use the classifier to predict outcome for patients in the test set and estimate the error rate • Repeat the process for many random training-test partitions

Re-sampling is only valid if the training set is not used in any way in the development of the model. Using the complete set of samples to select genes violates this assumption and invalidates the process • With proper re-sampling, the model must be developed from scratch for each training set. This means that gene selection must be repeated for each training set.

Re-sampling, e.g. leave-one-out cross-validation is widely misunderstood even by statisticians and widely misused in the published clinical literature • It is only applicable when there is a completely pre-defined algorithm for gene selection and classifier development that can be applied blindly to each training set

Myth • Split sample validation is superior to LOOCV or 10-fold CV for estimating prediction error

Types of Clinical Outcome • Survival or disease-free survival • Response to therapy

90 publications identified that met criteria • Abstracted information for all 90 • Performed detailed review of statistical analysis for the 42 papers published in 2004

Major Flaws Found in 40 Studies Published in 2004 • Inadequate control of multiple comparisons in gene finding • 9/23 studies had unclear or inadequate methods to deal with false positives • 10,000 genes x .05 significance level = 500 false positives • Misleading report of prediction accuracy • 12/28 reports based on incomplete cross-validation • Misleading use of cluster analysis • 13/28 studies invalidly claimed that expression clusters based on differentially expressed genes could help distinguish clinical outcomes • 50% of studies contained one or more major flaws

Validation of Predictive Classifiers for Use with Available Treatments • Should establish that the classifier is reproducibly measurable and has clinical utility • Better patient outcome or equivalent outcome with less morbidity • Improvement relative to available staging tools

Developmental vs Validation Studies • Developmental studies should select patients sufficiently homogeneous for addressing a therapeutically relevant question • Developmental studies should develop a completely specified classifier • Developmental studies should provide an unbiased estimate of predictive accuracy • Statistical significance of association between prediction and outcome is not the same as predictive accuracy

Limitations to Developmental Studies • Sample handling and assay conduct are performed under controlled conditions that do not incorporate real world sources of variability • Poor analysis may result in biased estimates of prediction accuracy • Small study size limits precision of estimates of predictive accuracy • Cases may be unrepresentative of patients at other sites • Developmental studies may not estimate to what extent predictive accuracy is greater than that achievable with standard prognostic factors • Predictive accuracy is often not clinical utility

Independent Validation Studies • Predictive classifier completely pre-specified • Patients from different clinical centers • Specimen handling and assay simulates real world conditions • Study addresses medical utility of new classifier relative to practice standards

Types of Clinical Utility • Identify patients whose prognosis is sufficiently good without cytotoxic chemotherapy • Identify patients who are likely to benefit from a specific therapy or patients who are unlikely to benefit from it

Establishing Clinical Utility • Develop prognostic classifier for patients not receiving cytotoxic chemotherapy • Identify patients for whom • current practice standards imply chemotherapy • Classifier indicates very good prognosis without chemotherapy • Withhold chemotherapy to test predictions

Prospectively Planned Validation Using Archived MaterialsOncotype-Dx • Fully specified classifier developed using data from NSABP B20 applied prospectively to frozen specimens from NSABP B14 patients who received Tamoxifen for 5 years • Long term follow-up available • Good risk patients had very good relapse-free survival

Prospective Validation Design • Randomize patients with node negative ER+ breast cancer receiving TAM to chemotherapy vs classifier determined therapy • Determine whether classifier determined arm has equivalent outcome to arm in which all patients receive chemotherapy • Therapeutic equivalence trial • Gold standard but rarely performed • Very inefficient because most patients get same treatment in both arms and so the trial must be sized to detect miniscule difference in outcome

Measure classifier for all patients and randomize only those for whom classifier determined therapy differs form standard of care

M-rx  SOC • SOC involves chemotherapy • M-rx does not • SOC does not involve chemotherapy • M-rx does

M-rx  SOC • SOC involves chemotherapy • M-rx does not • Validation by withholding chemotherapy and observing outcome of cases in single arm study • SOC does not involve chemotherapy • M-rx does • Validation by withholding chemotherapy and observing outcome in single arm study? • Validation by randomization chemo vs no chemo

US Intergroup Study • OncotypeDx risk score <15 • Tam alone • OncotypeDx risk score >30 • Tam + Chemo • OncotypeDx risk score 15-30 • Randomize to Tam vs Tam + Chemo

Key Steps in Development and Validation of Therapeutically Relevant Genomic Classifiers • Develop classifier for addressing a specific important therapeutic decision: • Patients sufficiently homogeneous and receiving uniform treatment so that results are therapeutically relevant. • Treatment options and costs of mis-classification such that a classifier is likely to be used • Perform internal validation of classifier to assess whether it appears sufficiently accurate relative to standard prognostic factors that it is worth further development • Translate classifier to platform that would be used for broad clinical application • Demonstrate that the classifier is reproducible • Independent validation of the completely specified classifier on a prospectively planned study

Types of Clinical Utility • Identify patients whose prognosis is sufficiently good without cytotoxic chemotherapy • Identify patients whose prognosis is so good on standard therapy S that they do not need additional treatment T • Identify patients who are likely to benefit from a specific systemic therapy and/or patients who are unlikely to benefit from it

Validation of Predictive Classifiers