Meta-analysis of Test Performance When There Is a “ Gold Standard ”

Meta-analysis of Test Performance When There Is a “Gold Standard” Prepared for: The Agency for Healthcare Research and Quality (AHRQ) Training Modules for Medical Test Reviews Methods Guide www.ahrq.gov

Learning Objectives • Graphically display diagnostic test performance across multiple studies using a “gold standard” reference • Explain the dependence of sensitivity and specificity over studies and thus the need for a multivariate (joint) analysis • Describe choices for a meta-analysis to summarize test performance depending on whether the sensitivity and specificity estimates from multiple studies vary (or do not vary) widel Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Background • This module focuses on how to conduct a meta-analysis with a “gold standard” reference. • Module 9 discusses how to conduct a meta-analysis when no gold standard reference exists. • There are two goals for a meta-analysis in a systematic review: • Provide summary estimates for key quantities • Explain observed heterogeneity in the results of studies included in the review • For systematic reviews of medical tests, a meta-analysis often focuses on synthesis of test performance data. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Important Terms • Gold Standard: A reference standard that is considered adequate in defining the presence or absence of the condition of interest (disease). • Diagnostic Test: This type of test is potentially less accurate than using the gold standard to ascertain disease. • Data: The main focus is on tests with positive or negative results because of the use of a cut-off level (threshold); each study provides 2 × 2 tabulation. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Measures Used To Assess Test Performance (1 of 2) • Sensitivity = The proportion of test positives among people with a disease (true-positive rate) • Specificity = The proportion of test negatives among healthy people (true-negative rate) • Positive predictive value = Proportion with disease among people with test-positive results • Negative predictive value = Proportion of healthy people with test-negative results • The predictive values can be computed from sensitivity, specificity, and disease prevalence. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Measures Used To Assess Test Performance (2 of 2) • Positive likelihood ratio = sensitivity/(1 ‒ specificity) = proportion of test positives among diseased/proportion of test positives among healthy • Negative likelihood ratio = (1 ‒ sensitivity)/specificity = proportion of test negatives among diseased/proportion of test negatives among healthy • Diagnostic odds ratio = (true positives/false negatives)/(false positives/true negatives) = odds of a positive test with disease over odds of a positive test without disease • Diagnostic odds ratios do not allow weighing of the true-positive and false-positive rates separately. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Dependence of Sensitivity and Specificity Across Studies (1 of 2) • Meta-analysis aims to provide a meaningful summary of sensitivity and specificity across studies. • Within each study, sensitivity and specificity are independent — they are estimated from different patients (those with a disease or those who are healthy). • Across studies, sensitivity and specificity are generally negatively correlated — as one increases the other is expected to decrease. • This negative correlation is most obvious with varying thresholds (known as “threshold effect”), varying time from onset of symptom to test, et cetera. • Positive correlations are often due to a missing covari Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Dependence of Sensitivity and Specificity Across Studies (2 of 2) • This is an example with 11 studies using D-dimer tests to diagnose acute coronary events, showing that sensitivity increases as specificity decreases: • Summarizing the two correlated quantities is a multivariate problem, and multivariate methods should be used to address it. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Becker DM, Philbrick JT, Bachhuber TL, et al. Ann Intern Med 1996 May 13;156(9):939-46. PMID: 8624174.

Challenges • How to quantitatively summarize medical test performance when: • The sensitivity and specificity estimates of various studies do not vary widely or extensively • Can use a summary pointto obtain summary test performance if the tests have the same threshold • Summary point = a summary sensitivity and summary specificity pair • The sensitivity and specificity of multiple studies vary widely • Can use a summary line to describe the relationship between average sensitivity and average specificity • May be less important than variations in thresholds, reference standards, study designs, et cetera, between the studies Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Principles for Addressing the Challenges • Principle 1: Favor the most informative way to summarize the data. • Choose between a summary point and a summary line. • Use the summary point when sensitivity/specificity do not vary much. • Use the summary line when there are different thresholds for positive tests or estimates vary widely. • Both can also be used, since they convey complementary information. • The choice is subjective; there are no hard-and-fast rules. • Principle 2: Explore the variability in study results with graphs and suitable analyses rather than relying exclusively on “grand means” (i.e., a single summary statistic). Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Deciding Which Metrics To Meta-analyze (1 of 6) • Problem: • Within a study, sensitivity, specificity, positive/negative predictive values, and prevalence are all interrelated via simple formulas. • Meta-analyzing each metric across studies will create summaries that are inconsistent with these formulas. • Proposed solution: • Obtain summaries for sensitivities and specificities across studies via meta-analysis, then back-calculate the rest of the metrics (using the formulas) over a range of prevalence values. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Deciding Which Metrics To Meta-analyze (2 of 6) A Visual Summary of Sensitivity/Specificity Across Studies With back calculation of the Other Metrics NLR = negative likelihood ratio NPV = negative predictive value PLR = positive likelihood ratio PPV = positive predictive value Prev = prevalence Se = sensitivity Sp = specificity Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Deciding Which Metrics To Meta-analyze (3 of 6) • Why does it make sense to directly meta-analyze sensitivity and specificity? • It aligns well with our understanding of positivity threshold effects. • Sensitivity and specificity are often considered independent of prevalence. • Summary sensitivity and specificity obtained by direct meta-analysis will always be between 0 and 1. • These two metrics are not as easily understood as predictive values and likelihood ratios, so back calculation of these other metrics is useful. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Deciding Which Metrics To Meta-analyze (4 of 6) • Why does it not make sense to directly meta-analyze predictive values or prevalence? • Predictive values are dependent on prevalence. • Rarely is it meaningful to meta-analyze each value across studies. • Prevalence is often wide ranging. • Prevalence cannot be estimated from case-control studies (the main design of many medical test studies). • It is better to back calculate these values over a range of plausible prevalence values. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Deciding Which Metrics To Meta-analyze (5 of 6) • Why can directly meta-analyzing positive and negative likelihood ratios be problematic? • Combining likelihood ratios across studies does not guarantee the summary values are “internally consistent.” • It is possible to obtain summary likelihood ratios that correspond to impossible summary sensitivities or specificities (i.e., values <0 or >1). • Back calculation avoids this. • This is not a common case, however; often direct meta-analysis yields the same conclusions as back calculation. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Deciding Which Metrics To Meta-analyze (6 of 6) • Directly analyzing diagnostic odds ratios • Is straightforward and follows standard meta-analytic methods • Characteristics of the diagnostic odds ratio • Closely linked to sensitivity, specificity, and likelihood ratios • Can easily be included in meta-regression models for analysis of heterogeneity between studies • Disadvantages • Challenging to interpret • Impossible to weigh the true-positive rate and the false-positive rate separately Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Desired Characteristics of Meta-analytic Methods • Meta-analytic methods should: • Respect the multivariate nature of test performance metrics (i.e., sensitivity and specificity) • Allow for nonindependence between sensitivity and specificity across studies (“threshold effect”) • Allow for between-study heterogeneity (i.e., variability not explained by the statistical distribution of the data in each study) • The most theoretically motivated approaches are based on multivariate methods (hierarchical modeling). Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Preferred Methods for Obtaining a Summary Point (1 of 4) • Multivariate meta-analysis of sensitivity and specificity (i.e., joint analysis of both) should be performed, rather than separate univariate meta-analyses. • It requires hierarchical modeling. • Bivariate model • Hierarchical summary receiver operator characteristic model • Both families of models use two levels to model data. • 1st level = within-study variability, from 2 ×2 table counts • 2nd level = between-study variability (i.e. heterogeneity), allowing for nonindependence of sensitivity and specificity across studies Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Preferred Methods for Obtaining a Summary Point (2 of 4) • Model families differ in the parameters used for between-study variability in the 2nd level. • The bivariate model uses parameters that are transformations of the average sensitivity and specificity. • The hierarchical summary receiver operator characteristic (HSROC) model uses a scale parameter and an accuracy parameter. • Both models are functions of the sensitivity and specificity. • They also define an underlying HSROC curve. • Both models are mathematically the same in the absence of covariates. • Both models assume a normal distribution of parameters, which can be difficult to satisfy. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Preferred Methods for Obtaining a Summary Point (3 of 4) • Researchers need to choose between the bivariate and the hierarchical summary receiver operator characteristic (HSROC) models when covariates are present (i.e., meta-regression analysis). For example: • The bivariate model is more appropriate when there is variation in disease severity. • This affects sensitivity but not specificity. • The bivariate model allows direct evaluation of the difference in sensitivity and/or specificity. • The HSROC model is more effective when “spectrum effects” (the subjects in a study do not represent the patients who will receive the test in practice) are present. • This is more likely to affect test accuracy rather than threshold. • The HSROC model allows direct evaluation of the difference in accuracy and/or threshold parameters. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Preferred Methods for Obtaining a Summary Point (4 of 4) Methods Commonly Used To Calculate a Summary Point Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Preferred Methods for Obtaining a Summary Line (1 of 2) • Hierarchical modeling is recommended. • Hierarchical summary lines can be calculated from bivariate random-effects model parameters • A range of hierarchical summary receiver operator characteristic (HSROC) lines can be calculated from fitted bivariate model parameters. • An example is the Rutter-Gatsonis HSROC model. • Represent alternative characterizations of the bivariate distribution of sensitivity and specificity • Show how the summary sensitivity changes with the summary specificity Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Rutter CM, Gatsonis CA. Acad Radiol 1995 Mar;2 Suppl 1:S48-56; discussion S65-7, S70-1 pas. PMID: 9419705.

Preferred Methods for Obtaining a Summary Line (2 of 2) Methods Commonly Used To Calculate a Summary Line Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Littenberg B, Moses LE. Med Decis Making 1993 Oct-Dec;13(4):313-21. PMID: 8246704. Rutter CM, Gatsonis CA. Acad Radiol 1995 Mar;2 Suppl 1:S48-56; discussion S65-7, S70-1 pas. PMID: 9419705.

Special Case: Joint Analysis of Sensitivity and Specificity With Multiple Thresholds (1 of 2) • It is not uncommon for studies to report multiple sensitivity/ specificity pairs at several thresholds for positive tests. • Option 1: Decide on one threshold from each study (e.g., the threshold with the highest sensitivity) • Option 2: Use all thresholds • An extension of the hierarchical summary receiver operator characteristic model has been developed for this purpose. • A method combining whole receiver operator characteristic (ROC) curves can also be used. • It is recommended that data be explored graphically in ROC space to highlight similarities and differences among the studies. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Special Case: Joint Analysis of Sensitivity and Specificity With Multiple Thresholds (2 of 2) This is a typical receiver operator characteristic (ROC) graph for four hypothetical studies. Studies in the left shaded area have an LR+ ≥10. Studies in the top shaded area have an LR- ≤ 0.1. Those in the intersection have both. This is an example of an ROC graph for studies with different thresholds for total serum bilirubin. Points on the line for each study represent sensitivity/specificity pairs at different threshold values. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Trikalinos TA, Chung M, Lau J, et al. Pediatrics 2009 Oct;124(4):1162-71. PMID: 19786450.

Recommended Algorithm • A three-step algorithm is recommended for meta-analyzing studies with a gold standard reference: • Start by considering sensitivity and specificity separately. • Perform a multivariate meta-analysis (when each study reports a single threshold). • Explore between-study heterogeneity Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Step 1: Start by Considering Sensitivity and Specificity Separately (1 of 2) • Reviewers should familiarize themselves with the pattern of study-level sensitivities and specificities. • Use graphical displays. • Forest plots of study sensitivities and specificities with their confidence intervals give a visual impression of variability of sensitivity and specificity across studies • A plot of sensitivity (vertical axis) versus 1 – specificity (horizontal axis) give a visual impression of the relationship between sensitivity and specificity across studies. These plots are also known as receiver operator characteristic graphs. • A shoulder-and-arm pattern is present when there is a threshold effect. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Step 1: Start by Considering Sensitivity and Specificity Separately (2 of 2) • Examples of forest plots • An example of a receiver operator characteristic graph with the shoulder-and-arm pattern Increasing the threshold decreases sensitivity but increases specificity Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Becker DM, Philbrick JT, Bachhuber TL, et al. Arch Intern Med 1996 May 13;156(9):939-46. PMID: 8624174.

Step 2: Multivariate Meta-analysis (When Each Study Reports a Single Threshold) • Obtain a 2-dimensional summary point (sensitivity, specificity) using the bivariate model of meta-analysis, preferably with utilization of binomial error. • Obtain summary lines based on multivariate meta-analytic models. • Interpretation of a summary line is not “automatically” that of “threshold effects,” especially if there is a positive correlation between sensitivity and specificity across studies • If more than one threshold is reported per study, consider incorporating all of them in the analysis both qualitatively (via graphs) and quantitatively (via proper methods). Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Step 3: Explore Between-Study Heterogeneity • The hierarchical summary receiver operator characteristic (HSROC) model allows direct evaluation of heterogeneity in accuracy and threshold parameters. • Bivariate models allow direct evaluation of sensitivity and specificity. • Added covariates that reduce variability across studies may need to be taken into account when summarizing the studies. • Some common sources of heterogeneity: • Patient population/selection • Methods to verify/interpret results • Clinical setting • Disease severity Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Example 1: D-Dimer Assays for Diagnosing Venous Thromboembolism (1 of 4) Forest Plots of Sensitivity, Specificity, and Likelihood Ratios • D-dimers are fragments specific to fibrin degradation. • They are measured by using an enzyme-linked immunosorbent assay (ELISA) to diagnose venous thromboembolism. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Becker DM, Philbrick JT, Bachhuber TL, et al. Ann Intern Med 1996 May 13;156(9):939-46. PMID: 8624174.

Example 1: D-Dimer Assays for Diagnosing Venous Thromboembolism (2 of 4) • Forest plots show more heterogeneity in sensitivity/specificity than in likelihood ratios. • Verified by formal heterogeneity testing • May be a threshold effect • Because of the variety of thresholds being used in each study, it is more informative to summarize test performance with an hierarchical summary receiver operator characteristic plot rather than by summarizing sensitivities and specificities. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Example 1: D-Dimer Assays for Diagnosing Venous Thromboembolism (3 of 4) HSROC Plot of D-Dimer Tests Using the Highest Thresholds • The shoulder-and-arm pattern indicates the threshold effect. • The location of points in the upper shaded area of the receiver operator characteristic space indicates high sensitivity and low specificity. • The test minimizes false-negative results and is good for ruling out disease. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Lijmer JG, Bossuyt PM, Heisterkamp SH. Stat Med 2002 Jun 15;21(11):1525-37. PMID: 12111918.

Example 1: D-Dimer Assays for Diagnosing Venous Thromboembolism (4 of 4) Calculated Negative Predictive Values for the D-Dimer Test With the Prevalence of Venous Thromboembolism Between 5 and 50 Percent • It is informative to give a summary of the negative and positive predictive values for this test. • Calculate over a range of prevalence values using the summary sensitivity and specificity values. • A consistently high negative predictive value line means that a high percentage of people who test negative actually are negative for the disease. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Lijmer JG, Bossuyt PM, Heisterkamp SH. Stat Med 2002 Jun 15;21(11):1525-37. PMID: 12111918.

Example 2: Serial Measurements of the Creatine Kinase-Myocardial Band To Diagnose Acute Cardiac Ischemia (1 of 3) • Serial measurements of the creatine kinase-myocardial band (CK-MB) are used to diagnose acute cardiac ischemia in the emergency room. • Blood levels of CK-MB increase over time from symptom onset. • 14 studies performed CK-MB testing at varying times after symptom onset. • There was evident heterogeneity in sensitivity that was not attributable to the threshold effect. • The sensitivity of the test increased as the time from symptom onset increased. • The difference in sensitivity may be attributable to time; to test this possibility, a bivariate meta-analytic model was used. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Example 2: Serial Measurements of the Creatine Kinase-Myocardial Band To Diagnose Acute Cardiac Ischemia (2 of 3) • Sensitivity increases with longer hours from symptom onset to the last measurement of the creatine kinase-myocardial band. Actual Hours 95-Percent Confidence Regions Actual hours are indicated next to the points; circles ≤ 3 hours; Xs > 3 hours Dashed lines = 95-percent confidence regions; blue ≤ 3 hours; red > 3 hours Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Balk EM, Ioannidis JP, Salem D, et al. Ann Emerg Med 2001 May;37(5):478-94. PMID: 11326184. Lau J, Ioannidis JP, Balk E, et al. Evid Rep Technol Assess (Summ) 2000 Sep;(26):1-4. PMID: 11079073.

Example 2: Serial Measurements of the Creatine Kinase-Myocardial Band To Diagnose Acute Cardiac Ischemia (3 of 3) • The hierarchical summary receiver operator characteristic (HSROC) model (bivariate meta-regression) was used to compare summary sensitivity and specificity with a binary variable to account for timing of the last serial creatine kinase-myocardial band measurement (fixed-effects binary covariate). • Note that properly specified bivariate/HSROC meta-regressions can be used to compare two or more index tests. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm. Balk EM, Ioannidis JP, Salem D, et al. Ann Emerg Med 2001 May;37(5):478-94. PMID: 11326184. Lau J, Ioannidis JP, Balk E, et al. Evid Rep Technol Assess (Summ) 2000 Sep;(26):1-4. PMID: 11079073.

Overall Recommendations (1 of 3) • Use the bivariate random-effects meta-analytic models to obtain a summary sensitivity and specificity. • Back-calculate the overall positive and negative predictive values (over a range of prevalence values) from summary estimates of sensitivity and specificity, rather than meta-analyzing them directly. • Back-calculate overall positive and negative likelihood ratios from summary estimates of sensitivity and specificity, rather than meta-analyzing them directly. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Overall Recommendations (2 of 3) • To obtain a summary line, use multivariate meta-analysis methods such as the hierarchical summary receiver operator characteristic (HSROC) model. • Several summary lines can be obtained based on multivariate meta-analytic models. • They can differ when the estimated correlation between sensitivity and specificity is positive and when there is little between-study variability. • If there is evidence of a positive correlation, the variability in the studies cannot be attributed to a threshold effect. • Explore for missing important covariates. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Overall Recommendations (3 of 3) • If more than one threshold is reported per study, this must be taken into account in the quantitative analyses. • Qualitative analysis with graphs and quantitative analyses with proper methods are encouraged. • Explore the impact of study characteristics on summary results using meta-regression–based analyses or subgroup analyses in the context of the primary methodology used to summarize the studies. Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Methods guide for medical test reviews. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

Practice Question 1 (1 of 2) • Within individual studies of a systematic review, sensitivity and specificity are independent variables. • True • False

Practice Question 1 (2 of 2) Explanation for Question 1: This statement is true. Sensitivity and specificity within each study are independent because they are estimated from different patients. Across studies they typically are negatively correlated.

Practice Question 2 (1 of 2) • Why does this module recommend directly meta-analyzing sensitivity and specificity? • Sensitivity and specificity are dependent on the prevalence of the condition under study. • Other predictive values and likelihood ratios can be back-calculated for a range of prevalence values by using known formulas. • Summary sensitivity and specificity obtained by direct meta-analysis will always be greater than 1. • Interpretation of sensitivity and specificity is very intuitive.

Practice Question 2 (2 of 2) Explanation for Question 2: The correct answer is b. Once the summary sensitivity and specificity are calculated by meta-analysis, there are formulas that allow the back calculation of overall predictive values and likelihood ratios. Likelihood ratios and predicative values are more easily interpreted by the reader of the review. Sensitivity and specificity are often considered to be independent of prevalence because they do not depend on it mathematically and will always be between 0 and 1.

Practice Question 3 (1 of 2) • What is the preferred method for obtaining a summary sensitivity and specificity in a meta-analysis? • Multivariate meta-analysis • Separate univariate meta-analyses • Using a summary line • The Kester and Buntinxvariant

Practice Question 3 (2 of 2) Explanation for Question 3: The correct answer is a. A multivariate meta-analysis of sensitivity and specificity is the recommended method for obtaining a summary point (summary sensitivity and specificity). This is a joint analysis of both quantities instead of a separate univariate meta-analyses. Obtaining a summary line is an alternative to calculating a summary point. The Kester and Buntinx method is used to analyze sensitivity and specificity pairs when there are several thresholds for positive tests.

Practice Question 4 (1 of 2) • In which situation would a summary line be more helpful in summarizing medical test performance? • Sensitivity and specificity estimates of various studies do not vary widely. • Sensitivity and specificity of various studies vary over a large range.

Practice Question 4 (2 of 2) Explanation for Question 4: The correct answer is b. Both a summary point and a summary line are informative and are useful in synthesizing data. There are no strict rules to follow in deciding which to use. A summary line may be more helpful as a summary of test performance when the sensitivity and specificity estimates of various studies vary over a large range.

Authors • This presentation was prepared by Brooke Heidenfelder, Andrzej Kosinski, Rachael Posey, Lorraine Sease, Remy Coeytaux, Gillian Sanders, and Alex Vaz, of the Duke University Evidence-based Practice Center. • The module is based on Trikalinos TA, Coleman CI, Griffith L, et al. Meta-analysis of test performance when there is a “gold standard.” In: Chang SM and Matchar DB, eds. Methods guide for medical test reviews. Rockville, MD: Agency for Healthcare Research and Quality; June 2012. p. 8.1-21. AHRQ Publication No. 12-EHC017. Available at www.effectivehealthcare.ahrq.gov/medtestsguide.cfm.

References (1 of 9) • Arends LR, Hamza TH, van Houwelingen JC, et al. Bivariate random effects meta-analysis of ROC curves. Med Decis Making 2008 Sep-Oct;28(5):621-38. PMID: 18591542. • Balk EM, Ioannidis JP, Salem D, et al. Accuracy of biomarkers to diagnose acute cardiac ischemia in the emergency department: a meta-analysis. Ann Emerg Med 2001 May;37(5):478-94. PMID: 11326184. • Becker DM, Philbrick JT, Bachhuber TL, et al. D-dimer testing and acute venous thromboembolism. A shortcut to accurate diagnosis? Arch Intern Med 1996 May 13;156(9):939-46. PMID: 8624174. • Bossuyt PM, Reitsma JB, Bruns DE, et al. The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration. Ann Intern Med 2003 Jan 7;138(1):W1-12. PMID: 12513067.

Meta-analysis of Test Performance When There Is a “ Gold Standard ”