Use of Genomics in Clinical Trial Design and How to Critically Evaluate Claims for Prognostic & Predictive Biomarker

Use of Genomics in Clinical Trial Design and How to Critically Evaluate Claims for Prognostic & Predictive Biomarkers Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://brb.nci.nih.gov

BRB Websitebrb.nci.nih.gov • Powerpoint presentations • Reprints • BRB-ArrayTools software • Data archive • Q/A message board • Web based Sample Size Planning • Clinical Trials • Optimal 2-stage phase II designs • Phase III designs using predictive biomarkers • Phase II/III designs • Development of gene expression based predictive classifiers

Different Kinds of Biomarkers • Endpoint • Measured before, during and after treatment to monitor treatment effect • Surrogate of clinical endpoint • Pharmacodynamic • Predictive biomarkers • Measured before treatment to identify who will benefit from a particular treatment • Prognostic biomarkers • Measured before treatment to indicate long-term outcome for patients untreated or receiving standard treatment

Types of Validation for Prognostic and Predictive Biomarkers • Analytical validation • Accuracy, reproducibility, robustness • Clinical validation • Does the biomarker predict a clinical endpoint or phenotype • Clinical utility • Does use of the biomarker result in patient benefit • By informing treatment decisions • Is it actionable

Prognostic and Predictive Biomarkers in Oncology • Single gene or protein measurement • Scalar index or classifier that summarizes expression levels of multiple genes

Prognostic Factors in Oncology • Many prognostic factors are not used because they are not actionable • Most prognostic factor studies are not conducted with an intended use • They use a convenience sample of heterogeneous patients for whom tissue is available • Retrospective studies of prognostic markers should be planned and analyzed with specific focus on intended use of the marker • Design of prospective studies depends on context of use of the biomarker • Treatment options and practice guidelines • Other prognostic factors

Potential Uses of a Prognostic Biomarker • Identify patients who have very good prognosis on standard treatment and do not require more intensive regimens • Identify patients who have poor prognosis on standard chemotherapy who are good candidates for experimental regimens

Prospective Marker Strategy Design • Patients are randomized to either • have marker measured and treatment determined based on marker result and clinical features • don’t have marker measured and receive standard of care treatment based on clinical features alone

Randomize Patients to Test or No Test Rx Determined by Test Rx Determined By SOC

Marker Strategy Design • Inefficient • Many patients get the same treatment regardless of which arm they are randomized to • Uninformative • Since patients in the standard of care arm do not have the marker measured, it is not possible to compare outcome for patients whose treatment is changed based on the marker result

Apply Test to All Eligible Patients Using phase II data, develop predictor of response to new drug Test Deterimined Rx Different From SOC Test Determined Rx Same as SOC Off Study Use Test Determined Rx Use SOC

Prospective Evaluation of OncotypeDx (TAILORx) • For patients with predicted low risk of recurrence • Withhold chemotherapy and observe long term recurrence rate • If recurrence rate is very low, potential chemotherapy benefit must be very small

Predictive Biomarkers

Prospective Co-Development of Drugs and Companion Diagnostics • Develop a completely specified genomic classifier of the patients likely to benefit from a new drug • Establish analytical validity of the classifier • Use the completely specified classifier in the primary analysis plan of a phase III trial of the new drug

Guiding Principle • The data used to develop the classifier should be distinct from the data used to test hypotheses about treatment effect in subsets determined by the classifier • Developmental studies can be exploratory • Studies on which treatment effectiveness claims are to be based should not be exploratory

Develop Predictor of Response to New Drug Using phase II data, develop predictor of response to new drug Patient Predicted Responsive Patient Predicted Non-Responsive Off Study New Drug Control

Evaluating the Efficiency of Enrichment Design • Simon R and Maitnourim A. Evaluating the efficiency of targeted designs for randomized clinical trials. Clinical Cancer Research 10:6759-63, 2004; Correction and supplement 12:3229, 2006 • Maitnourim A and Simon R. On the efficiency of targeted clinical trials. Statistics in Medicine 24:329-339, 2005. • reprints and interactive sample size calculations at http://linus.nci.nih.gov

Relative efficiency of targeted design depends on • proportion of patients test positive • effectiveness of new drug (compared to control) for test negative patients • When less than half of patients are test positive and the drug has little or no benefit for test negative patients, the targeted design requires dramatically fewer randomized patients

DevelopPredictor of Response to New Rx Predicted Responsive To New Rx Predicted Non-responsive to New Rx New RX Control New RX Control Stratification Design

Stratification Design • Use the test to structure a prospective specified primary analysis plan • Having a prospective analysis plan is essential • “Stratifying” (balancing) the randomization is useful to ensure that all randomized patients have tissue available but is not a substitute for a prospective analysis plan • The purpose of the study is to evaluate the new treatment overall and for the pre-defined subsets; not to modify or refine the classifier • The purpose is not to demonstrate that repeating the classifier development process on independent data results in the same classifier

R Simon. Using genomics in clinical trial design, Clinical Cancer Research 14:5984-93, 2008 • R Simon. Designs and adaptive analysis plans for pivotal clinical trials of therapeutics and companion diagnostics, Expert Opinion in Medical Diagnostics 2:721-29, 2008

Use of Archived Specimens in Evaluation of Prognostic and Predictive BiomarkersRichard M. Simon, Soonmyung Paik and Daniel F. Hayes • Claims of medical utility for prognostic and predictive biomarkers based on analysis of archived tissues can be considered to have either a high or low level of evidence depending on several key factors. • Studies using archived tissues, when conducted under ideal conditions and independently confirmed can provide the highest level of evidence. • Traditional analyses of prognostic or predictive factors, using non analytically validated assays on a convenience sample of tissues and conducted in an exploratory and unfocused manner provide a very low level of evidence for clinical utility.

Use of Archived Specimens in Evaluation of Prognostic and Predictive BiomarkersRichard M. Simon, Soonmyung Paik and Daniel F. Hayes • For Level I Evidence: • (i) archived tissue adequate for a successful assay must be available on a sufficiently large number of patients from a phase III trial that the appropriate analyses have adequate statistical power and that the patients included in the evaluation are clearly representative of the patients in the trial. • (ii) The test should be analytically and pre-analytically validated for use with archived tissue. • (iii) The analysis plan for the biomarker evaluation should be completely specified in writing prior to the performance of the biomarker assays on archived tissue and should be focused on evaluation of a single completely defined classifier. • iv) the results from archived specimens should be validated using specimens from a similar, but separate, study.

Publications Reviewed • Original study on human cancer patients relating gene expression to clinical outcome • Survival or disease-free survival • Response to treatment • Published in English before December 31, 2004 • Analyzed gene expression of more than 1000 probes

90 publications identified that met criteria • Abstracted information for all 90 • Performed detailed review of statistical analysis for the 42 papers published in 2004

Major Flaws Found in 40 Studies Published in 2004 • Inadequate control of multiple comparisons in gene finding • 9/23 studies had unclear or inadequate methods to deal with false positives • 10,000 genes x .05 significance level = 500 false positives • Misleading report of prediction accuracy • 12/28 reports based on incomplete cross-validation • Misleading use of cluster analysis • 13/28 studies invalidly claimed that expression clusters based on differentially expressed genes could help distinguish clinical outcomes • 50% of studies contained one or more major flaws

Control for Multiple Testing • If each gene is tested for significance at level  and there are n genes, then the expected number of false discoveries is n  . • e.g. if n=10,000 and =0.001, then 10 false “discoveries” • Control the FDR (false discovery rate) • g = number of genes reported as having expression significantly correlated with a phenotype • FDR = number of false positives / g

Evaluating a Classifier • Fit of a model to the same data used to develop it is no evidence of prediction accuracy for independent data • Goodness of fit vs prediction accuracy

Split-Sample Evaluation • Training-set • Used to select features, select model type, determine parameters and cut-off thresholds • Test-set • Withheld until a single model is fully specified using the training-set. • Fully specified model is applied to the expression profiles in the test-set to predict class labels. • Number of errors is counted

Leave-one-out Cross Validation • Leave-one-out cross-validation simulates the process of separately developing a model on one set of data and predicting for a test set of data not used in developing the model

Leave-one-out Cross Validation • Omit sample 1 • Develop multivariate classifier from scratch on training set with sample 1 omitted • Predict class for sample 1 and record whether prediction is correct

Leave-one-out Cross Validation • Repeat analysis for training sets with each single sample omitted one at a time • e = number of misclassifications determined by cross-validation • Subdivide e for estimation of sensitivity and specificity

Cross validation is only valid if the test set is not used in any way in the development of the model. Using the complete set of samples to select genes violates this assumption and invalidates cross-validation. • With proper cross-validation, the model must be developed from scratch for each leave-one-out training set. This means that feature selection must be repeated for each leave-one-out training set. • The cross-validated estimate of misclassification error is an estimate of the prediction error for model fit using specified algorithm to full dataset

Prediction on Simulated Null Data • Generation of Gene Expression Profiles • 14 specimens (Pi is the expression profile for specimen i) • Log-ratio measurements on 6000 genes • Pi ~ MVN(0, I6000) • Can we distinguish between the first 7 specimens (Class 1) and the last 7 (Class 2)? • Prediction Method • Compound covariate prediction • Compound covariate built from the log-ratios of the 10 most differentially expressed genes.

Use of Genomics in Clinical Trial Design and How to Critically Evaluate Claims for Prognostic & Predictive Biomarker