Translational Data Science

Translational Data Science L.J. Wei, Harvard University

Many thanks to • Lu Tian, Stanford • TianxiCai, Harvard • Brian Claggett, Harvard • Hajime Uno, Harvard • Takahiro Hasegawa, Shionogi, Japan • Soctt Evans, Harvard • Lihui Zhao, Northwestern • Danyu Lin, UNC • Zhiliang Ying, Columbia • Zhezhen Jin, Columbia • Colleagues at pharmaceutical industry

What is the goal of a clinical study? • To obtain robust, clinically interpretable treatment effect estimate with respect to risk-benefit perspectives at the patient’s level via efficient and reliable quantitative procedures

What are the issues? • The conventional way to conduct trials gives us fragmentary information • Lack of clinically meaningful totality evidence • Difficult to use the trial results for future patient’s management

A Few Methodology Issues 1. Estimation vs. testing • P-value provides little clinical information about treatment effect/risk • The size of the effect matters • Goodness of fit test? Using the prediction to assess model fit

TREAT study for EPO CV safety • If we follow the patients up to 48 month, the control arm's average stroke-free time is 46.9 months and the Darb arm's is 46 months. The difference is 0.9 month with 0.95 CI (0.4, 1.4)m and p<0.001 (very significant). • The p-value can be exaggerated for treatment difference. A small increase of Z-value may drastically decrease p-value. The confidence interval estimate is much stable and interpretable.

What is a clinically meaningful treatment effect via estimation? • Reimbursement issue beyond getting the medical product approved by regulatory agencies. • What is the “estimand?” • If the overall treatment effect is not “clinically impressive,” we may identify a “high value” subgroup via a pre-specified procedure

2. How do we define a primary endpoint with multiple outcomes? • What is current practice? • Define primary endpoints and secondary endpoints • Efficacy and toxicity (how to connect them together?) • Disease burden measure? • The conventional component-specific analysis – informative missing, censoring or competing risks

Example: A large cardiovascular study

What is the general clinical practice for treating a patient with cardiovascular diseases? • Following the patient over time • Having periodic clinical/lab exams/tests • Recording the time to multiple clinical/lab outcomes (heart attack, stroke, CV hosp, CV death…BP, HbA1C, toxicity..) • Assessing the disease burden/progression over time via totality of multiple outcomes • Making decision of treatment selections

A typical cardiovascular(CV) study • Comparing a new therapy with standard care • Question is whether new treatment would prevent from having bad CV outcomes/toxcity • Following each patient over time • Times to multiple clinical events are collected

Conventional approaches for clinical trials • Choosing a single outcome (e.g., time to clinical event) as the primary endpoint • Applying univariate analysis for the treatment difference • Figuring out how to handle informative censoring (competing risks) • Considering other outcomes (risk, benefit) as secondary endpoints • Not sure how to treat future patients from study results via those separate summary measures for efficacy/safety

Example : Beta-Blocker Evaluation of Survival (BEST) Trial (NEJM, 2001) • Study • Bucindolol vs. placebo • patients with advanced chronic heart failure -- n = 2707 • Average follow-up: 2 years • Primary endpoint: overall survival • Hazard ratio for death = 0.90 (p-value = 0.1)

BEST Trial

Possible solutions? • Using the patient’s disease burden or progression information during the entire followup to define the “responder” • Creating more than one response categories: ordinal categorical response • Brian Claggett’s thesis paper (Published in Biostatistics)

BEST Example: 8 Categories • 1: No events • 2: Alive, non-HF hospitalization only • 3: Alive, 1 HF hosp. • 4: Alive, >1 HF hosp. • 5: Late non-CV death (>12 months) • 6: Late CV death (>12 months) • 7: Early non-CV death (<12 months) • 8: Early CV death (<12 months)

Example: Treatment for HIV infected children • Primary endpoint: viral load reduction • Major secondary endpoint: growth profile over 48 weeks

Example: DMD rare disease • Nonsense mutation Duchenne muscular dystrophy (nmDMD) is a rare, X-linked, neuromuscular, childhood disorder.

Ambulatory Boys with Nonsense Mutation Muscular Dystrophy • Outcomes for quantifying muscle function • 6 MWD • 10-meter walk/run • 4-stair climb • 4-stair descend

Comparative studies for DMD • Two trials done by PTC • The primary endpoint is 6MWD • Various secondary endpoints • Each study was a 48week, multicenter, randomized, double-blind, placebo controlled, compared the efficacy and safety of ataluren vs placebo in ambulatory boys with nmDMD.

Graphical display for patient level data Treatment Placebo 3 3 4 4 5 5 No. No. 1 1 2 2

Treatment Placebo 3 4 5 3 4 5 1 2 1 2

How to analyze multiple outcome data? • For each column (specific outcome), obtaining the treatment difference D • Combining D’s linearly (weighted average) • Evaluating how “unlikely” to get the observed combined statistic • Wei-Lachin (JASA, 1984) and Wei-Johnson (Biometria, 1985) • Powerful if all the test statistics were on the “right direction”

How unlikely to observe this pattern under null hypothesis? Study 007 Study020 FavorsPlacebo FavorsAtaluren ∆ 6MWD Change at Week 48, LS Mean 95% CI(m) FavorsPlacebo FavorsAtaluren ∆ 6MWD Change at Week 48, LS Mean 95% CI(m) -200204060 -200204060 Endpoint Endpoint 6MWD 6MWD 10-meter walk /run 10-meter walk /run 4-stairclimb 4-stairclimb 4-stair descend -2 0 2 4 6 Study 007 ITT Ataluren 10,10,20 mg/kg (N=57) Placebo (N=57) 4-stairdescend -2 0 2 4 6 Study 020 ITT Ataluren (N=114) Placebo (N=114)

Another way to combine • For each outcome, we rank the observations over patients in each treatment group • Add the ranks across each row (for each patient) so each patient has a rank score • Conducting a test using those scores • (O’Brien test)

Limitation of this combination approach • Different outcomes have different scales, so it may be only useful as a powerful test procedure • How to get an overall estimate for treatment effect?

3. Identifying a high value subgroup of patients? • A negative trial does not mean the treatment is no good for anyone • A positive trial does not mean it works for everyone • The usual subgroup analysis is not adequate to address this issue • Need a built-in pre-specified procedure for identifying patients who benefit from treatment • FDA’s guidance on predictive enrichment (2012)

4. How to monitoring trials “quantitatively” via prediction? • The usual practice is to use p-value (O-B stopping et al). • Use conditional power? • Use prediction confidence interval estimate (EAST new version)

5. How to monitor safety? • What is the conventional way? • Component-wise tabulation or analysis? • No information about multiple AE events at the patient level • Graphical method to show the temporal toxicity profile?

6. Quantifying treatment contrast (difference)? • Should be model-free parameter • Using difference of means, median, etc. • For censored data, using a constant hazard ratio (heavily model-based)? • Model-based measure is difficult to interpret or validate

Issues for the hazard ratio estimate • Hazard ratio estimate is routinely used for designing, monitoring and analyzing clinical studies in survival analysis

Model Free Parameter for Treatment Difference * Considering a two-treatment comparison study in “survival analysis” * How do we quantify the treatment difference? • Median failure time (may not be estimable); • t-year survival rate (not an overall measure)? • A constant hazard ratio over time with the log-rank test

Eastern Cooperative Oncology Group • E4A03 trial to compare low- and high-dose dexamethasone for naïve patients with multiple myeloma • The primary endpoint is the survival time • n=445 • The trial stopped early at the second interim analysis; the low dose was superior. • Patients on high-dose arm were then received low-dose and follow-up for overall survival were continued.

A Cancer Study Example Group 1 Group 2

The proportional hazards assumption is not valid • The PH estimator is estimating a quantity which cannot be interpreted and, worse, depends on the study-specific censoring distributions • Any model-based treatment contrast has such issues (need a model-free parameter) • The logrank test is not powerful

Conventional analysis: • Log-rank test: p=0.47 • Hazard Ratio: HR=0.87 (0.60, 1.27)

What is the alternative way for survival analysis? • Using the area under the curve of Kaplan-Meier estimate up to a fixed time point • Restricted mean survival time • Model-free and a global measure of efficacy • Can be estimated even under heavy censoring

The area under Kaplan-Meier as a summary of survival distribution Treated Area under the curve RMST: 33.3 m Area under the curve RMST: 35.4 m

Cancer Study Example Restricted Mean (up to 40 months): • 35.4 months vs. 33.3 months • Δ = 2.1 (0.1, 4.2) months; p=0.04 • Ratio of Survival time = 35.4/33.3 = 1.06 (1.00, 1.13) • Ratio of time lost = 6.7/4.6 = 1.46 (1.02, 2.13)

7. Post-marketing/safety studies ? • It is not appropriate to use an event driven procedure to conduct a safety study. • The event rate is low, the exposure time matters • Requires lot of resources (large or long-term study)

CV safety study for anti-diabetes drugs • Event driven studies, that is, we need to have a pre-specified # of events so the resulting confidence interval for the treatment difference is “narrow” • For example, the upper bound of 95% confidence interval is less than 1.3

The EXAMINE trial (alogliptin) NEJM, October 3, 2013

RMST (24 months): Placebo 21.9 (21.7, 22.2) Alogliptin 22.0 (21.8, 22.3) Difference -0.08 (-0.39, 0.24) Ratio 1.00 (0.98, 1.01) RMST (30 months): Placebo 27.1 (26.7, 27.4) Alogliptin 27.2 (26.9, 27.5) Difference -0.12 (-0.56, 0.33) Ratio 1.00 (0.98, 1.01)

What if a smaller study? 95% confidence intervals for various measures

8. Evaluating new treatment for rare diseases • Utilizing the registry data or natural history data • Single arm trial? • Multiple outcomes? • It is not all clear how to quantify disease burden over time

How to make treatments comparable across studies? • Which patient population are we referring to? • It is not clear using the propensity score procedure. • Using a model relating outcome to covariates with registry data, then move the fitted model to the clinical trial population?

9. Meta analysis for safety issues

Nissen and Wolski (2007) performed a meta analysis to examine whether Rosiglitazone (Avandia, GSK), a drug for treating type 2 diabetes mellitus, significantly increases the risk of MI or CVD related death.

ExampleEffect of Rosiglitazone on MI or CVD Deaths • Avandia was introduced in 1999 and is widely used as monotherapy or in fixed-dose combinations with either Avandamet or Avandaryl. • The original approval of Avandia was based on its ability in reducing blood glucose and glycated hemoglobin levels. • Initial studies were not adequately powered to determine the effects of this agent on micro- or macro- vascular complications of diabetes, including cardiovascular morbidity and mortality.

Translational Data Science