480 likes | 662 Views
Facing Challenging Situations When Grading Strength of Evidence. Presenters: Holger Schünemann , MD, PhD, McMaster University Nancy Berkman, PhD, RTI International . Process Overview. Topic Development & Refinement. Evidence Review. Dissemination & Research Needs Development.
E N D
Facing Challenging Situations When Grading Strength of Evidence Presenters: Holger Schünemann, MD, PhD, McMaster University Nancy Berkman, PhD, RTI International
Process Overview Topic Development & Refinement Evidence Review Dissemination & Research Needs Development Topic Generation Translation & Implementation In Practice Establish Review & Stakeholder Team With Appropriate Expertise* Nomination Of Topics Clarify Intent Comprehensive Search Grade Body Of Evidence Evidence Report Gaps + Public, Expert* ID Topics Narrative (& Quantitative) Synthesis Screen & Select Studies Horizon Scanning Develop Protocol Analytic Framework Public Comments & Peer Review* Analytic/ Conceptual Framework Appraise Risk Of Bias/Quality Public Comment Finalize Key Questions Prioritization Finalize Protocol Final Report Abstract Data Future Research Needs Report * Manage COI
Steps in AHRQ EPC Approach to Grading SOE • Separately for RCT and observational study evidence, aggregated across studies, for each outcome • Score 4 required domains • Risk of bias • Consistency • Directness • Precision • Considering, possibly scoring, 4 additional domains • Dose-response association • Plausible confounding • Strength of association • Publication bias • Combine into a single SOE grade
Risk of bias domain score • Concerns both study design and study conduct for individual studies • Assesses the aggregate quality or risk of bias of studies separately for RCTs and observational studies and integrates those assessments into an overall risk of bias score • Scores: high, medium, or low • High risk of bias lowers SOE grade • Low risk of bias raises SOE grade
Consistency domain score • Degree of similarity in the effect sizes of different studies within the evidence base. • Consistent: same direction of effect (same side of “no effect”) and narrow range of effect sizes • Inconsistent: non-overlapping confidence intervals, significant unexplained clinical or statistical heterogeneity, etc • Unknown or not applicable: single study so cannot be assessed
Directness domain score • Whether evidence reflects a single, direct link between the intervention of interest and the ultimate health outcome under consideration • Direct: single direct link between the intervention and health outcome • Indirect: evidence relies on • Surrogate or proxy outcomes • More than one body of evidence (no head-to-head studies)
Precision domain score • Degree of certainty for estimate of effect with respect to a specific outcome • Precise: estimate allows a clinically useful decision • Imprecise: confidence interval is so wide that it could include clinically distinct (even conflicting) conclusions
Additional “discretionary” domains • Dose-response association (pattern of larger effect with greater exposure): present, not present, NA • Plausible confounders (confounding that works in the direction opposite, “weakens” effect): present, absent • Strength of association (effect so large that cannot have occurred solely as a result of bias from confounders): strong, weak • Publication bias: (not formally scored) • Unlike GRADE, applicability is considered separately
Integrating domain scores into a SOE grade • EPCs can use different approaches to incorporating multiple domains into an overall strength of evidence grade • GRADE algorithm • EPC’s own weighting system • A qualitative approach • Evaluation needs to be made by (at least) 2 reviewers • Must document approach used
Challenge 1: CER of benefits, 1 study, no meta-analysis or CIs • Topic: Antidepressant medication response in the elderly • Evidence description: 1 fair quality RCT (N = 108). Outcome evaluated through 2 validated scales that are clinician administered. • Scale 1: Results reported in bar graph only: (p = 0.03) • Scale 2: Results reported in bar graph only: (p = 0.04)
Challenge 1: Precision Score • AHRQ/GRADE approach: Precise • AHRQ approach: Imprecise • GRADE approach: Imprecision Serious (-1) • GRADE approach: Imprecision Very Serious (-2)
Challenge 1: Strength of evidence grade • AHRQ/GRADE approach: High • AHRQ/GRADE approach: Moderate • AHRQ/GRADE approach: Low • AHRQ approach: Insufficient • GRADE approach: Very low
Challenge (1) - Response • Rules for precision: • Based on CI, number of events, effect size • Not perfect but good guides • Judgment simple and possible for this example • Given only 108 people, downgrade for imprecision unless effect is huge (which we need for this evaluation) and possibly by two levels
Optimal information size • We suggest the following: if the total number of patients included in a systematic review is less than the number of patients generated by a conventional sample size calculation for a single adequately powered trial, consider rating down for imprecision. Authors have referred to this threshold as the “optimal information size” (OIS)
For systematic reviews • If the 95% CI excludes a relative risk (RR) of 1.0 and the total number of events or patients exceeds the OIS criterion, precision is adequate. If the 95% CI includes appreciable benefit or harm (we suggest a RR of under 0.75 or over 1.25 as a rough guide) rating down for imprecision may be appropriate even if OIS criteria are met.
Figure 4: Optimal information size given alpha of 0.05 and beta of 0.2 for varying control event rates and relative risks For any chosen line, evidence meets optimal information size criterion if sample size above the line
Challenge 2: CER of harms Mixed outcomes & mixed results from RCTs and obs studies • Topic: Risk of suicidality from antidepressants • Evidence description: • RCT: 1 fair quality study • Suicidal ideation worse Drug B (p = 0.03) • Case control: 1 fair quality study (N = 1300) • Non-fatal suicidal behavior; Drug A (OR = 1.16); Drug B (OR = 1.29) • Overlapping confidence intervals comparing each with Drug C • Nested case control: 1 good quality study (N = 10,000) • Completed suicides in adjusted analyses (P = NS)
Challenge 2: CER of harmsMixed outcomes & mixed results from RCTs and obs studies
Challenge 2: Directness Score RCTs • AHRQ/GRADE approach: Direct • AHRQ approach: Indirect • Grade approach: Serious indirectness (-1) • Grade approach: Very serious indirectness (-2)
Challenge 2: Directness Score Observational Studies • AHRQ/GRADE approach: Direct • AHRQ approach: Indirect • Grade approach: Serious indirectness (-1) • Grade approach: Very serious indirectness (-2)
Challenge 2: Strength of evidence grade • AHRQ/GRADE approach: High • AHRQ/GRADE approach: Moderate • AHRQ/GRADE approach: Low • AHRQ approach: Insufficient • GRADE approach: Very low
Challenge (2) - Response • Indirect comparison • Downgrade • Observational study can provide more direct evidence • Need to go through full framework to find that out
Challenge 3: CER of benefits, RCTs found no difference between treatments • Topic: Medication response • Evidence description: 5 fair quality RCTs, # of participants ranges from 90-200, each study: (p = NS) • Meta-analysis pooled risk ratio: 1.03 (95% CI, 0.92-1.16)
Challenge 3: Strength of evidence grade • AHRQ/GRADE approach: High • AHRQ/GRADE approach: Moderate • AHRQ/GRADE approach: Low • AHRQ approach: Insufficient • GRADE approach: Very low
Challenge 3: Are the treatments equivalent for this outcome? • Yes • No • Don’t know
Challenge (3) - Response • Superiority, inferiority and non-inferiority depend on more than one outcome. • Need to specify threshold. If threshold met, not imprecise, if not met, imprecise.
Figure 1, Rating down for imprecision in guidelines: Thresholds are key Threshold if side effects, toxicity and cost minimal, NNT = 200. Entire confidence interval to left of threshold, do not rate down for imprecision Mortality estimate and confidence interval Threshold if side effects, toxicity and Cost appreciable, NNT = 100. Confidence interval crosses threshold, rate down for imprecision 2.0 0.5 0 0.5 Favors Intervention Favors Control Risk difference (%)
Challenge 4: CER of serious harms, Mixed findings in RCTs and observational studies • Topic: Serious infection from rheumatoid arthritis treatments • Evidence description: • RCTs: 4 fair quality studies. Number of participants ranges from 80 to 531. Number of serious infections presented for each treatment, very rare event. In each study (p = NS) • Retrospective cohort study 1: fair quality(N = 5,326). Hospitalization with a definite bacterial infection: Higher for Treatment A. Adjusted HR =1.94 (95% CI, 1.32 to 2.83) • Retrospective cohort study 2: good quality/low risk of bias (N = 2,369) Adjusted rate of serious bacterial infection: RR =1.0 (95% CI, 0.6 to 1.71)
Challenge 4: CER of serious harms, Mixed findings in RCTs and observational studies
Challenge 4: Risk of bias score • AHRQ/GRADE approach: Low risk of bias • AHRQ approach: Medium risk of bias • AHRQ approach: High risk of bias • GRADE approach: Serious risk of bias (-1) • GRADE approach: Very serious risk of bias (-2)
Challenge 4: Strength of evidence grade • AHRQ/GRADE approach: High • AHRQ/GRADE approach: Moderate • AHRQ/GRADE approach: Low • AHRQ approach: Insufficient • GRADE approach: Very low
Challenge (4) - Response • Sequential work • Use the evidence that is of higher quality • Mention observational evidence in footnote
Challenge 5: Can you use less stringent criteria to evaluate risk of bias if the outcome without treatment is likely to result in death? • Topic: use of Hematopoietic stem cell transplantation (HSCT), also known as bone marrow transplantation. • Low Risk of Bias modified to be: natural history (or severity) of disease made spontaneous remission highly unlikely or impossible. • Evidence description: • For single HSCT for Wolman’s disease: The natural history of this disease death occurs by approximately 6 months of age. Of five cases reported in the evidence, three patients were alive at 4 to 11 years’ followup, with normal function and attending school. The strength of the body of evidence is high.
Challenge 5: Do you agree that it would be appropriate to use less stringent criteria to evaluate risk of bias under these circumstances? • Yes • No • Don’t know
Challenge 5: Do you agree that it would be appropriate to use less stringent criteria to evaluate risk of bias under these circumstances? • One reviewer commented that, rather than modifying Risk of Bias criteria, “the SOE system does allow consideration of other factors through the ‘optional domains’ if applied correctly.” These optional domains are: • dose-response association, • plausible confounding that would decrease observed effect, • strength of association (magnitude of effect), and • publication bias. • Do you agree?
Challenge (5) - Response • Particular design features of extremely rigorous well-conducted observational studies may warrant consideration for rating up quality of evidence. For instance, a case-control study found that sigmoidoscopy was associated with a reduction in colon cancer mortality for lesions in range of the sigmoidoscope (OR 0.30, 95% CI 0.19 to 0.48), but not beyond the range of the sigmoidoscope (OR 0.96, 95% CI 0.61 to 1.50). Possible bias because of unmeasured confounders should have been very similar if not identical in the two situations, considerably raising confidence in the causal effect of the sigmoidoscopy.
Challenge (5) - Response • Furthermore, when considering rating up the quality of evidence for magnitude of effect, factors relating to the magnitude are rapidity of treatment response, and the previous underlying trajectory of the condition6. For example, we feel confident that hip replacement has a large effect not only because of the size of the treatment response, but because the natural history of hip osteoarthritis is a progressive deterioration that surgery rapidly and uniformly reverses. The rapidity of response compared to the known trajectory of the condition can also be considered (and calculated6) as a large effect size. • An additional factor mitigating the problem of rating up the quality because of a large effect is that indirect evidence usually provides further support for large treatment effects. For example, oral anticoagulation in mechanical heart valves has not been compared to placebo in an RCT, but evidence from observational studies suggests a large effect of oral anticoagulation in decreasing thromboembolic events87. Supplementary indirect evidence from randomized trials that have demonstrated large reductions in the relative risk of thrombosis with anticoagulation in analogous conditions such as atrial fibrillati further increases our confidence in the beneficial effect of anticoagulation9. • Similarly, the effectiveness of antibiotic prophylaxis in a variety of other situations supports observational studies that suggest that antibiotic prophylaxis results in an 89% relative risk reduction in meningococcal disease in contacts of patients who have suffered the illness10. • Another situation allows an inference of a strong association without a formal comparative study. Consider the question of the impact of routine colonoscopy versus no screening for colon cancer on the rate of perforation associated with colonoscopy. Here, a large series of representative patients undergoing colonoscopy will provide high quality evidence on the risk of perforation associated with colonoscopy. When control rates are near 0 (i.e. we are certain that the incidence of spontaneous colon perforation in patients not undergoing colonoscopy is very low), case series of representative patients (one might call these cohort studies of affected patients if they include large numbers of patients) can provide high quality evidence of adverse effects associated with an intervention, thereby allowing us to infer a strong association from even a limited number of events. One should not confuse the situation highlighted in the previous example with isolated case reports of associations between exposures and rare adverse outcomes (as have, for instance, been reported with vaccine exposure).
Challenge 6: Challenges in using GradePro. • “I find it challenging to use GRADEpro to grade the body of evidence for non-RCTs and unpooled data.” • Comments?
Challenge (6) • Response: • GRADEpro is updated for observational studies. • Unpooled data: headcount as last resort, can still make qualitative judgments as long as transparent (e.g. inconsistency, imprecision)
Challenge 7 Current grading schemes are not amenable to healthcare quality improvement studies because: • They may only distinguish between RCTs and “all other” types of studies. • They may not distinguish quality of studies within RCTs and other types of study designs. • They do not have a way to appropriately grade external validity, which is critically important in QI studies. • Comments?
Challenge 7 - Response • They may only distinguish between RCTs and “all other” types of studies. • GRADE makes explicit judgments necessary about the confidence in estimates of effects for any study design. Randomization is just one of the criteria early on in the process as it is the key method to protect against bias • They may not distinguish quality of studies within RCTs and other types of study designs. • GRADE’s explicit judgments do make this distinction • They do not have a way to appropriately grade external validity, which is critically important in QI studies. • Judgments about directness do accomplish that (PICO) where P includes the setting
Holger Schunemann Chair and Professor Department of Clinical Epi and Biostatistics McMaster University schuneh@mcmaster.ca Nancy Berkman Senior Health Policy Research Analyst Program on Healthcare Quality and Outcomes berkman@rti.org More Information