450 likes | 684 Views
Studies of Diagnostic Tests. Thomas B. Newman, MD, MPH October 15, 2009. Reminders/Announcements. Door must be closed Write down answers to problems in the book and check your answers! Final exam to be passed out 12/3, reviewed 12/10 Send questions!. Overview.
E N D
Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 15, 2009
Reminders/Announcements • Door must be closed • Write down answers to problems in the book and check your answers! • Final exam to be passed out 12/3, reviewed 12/10 • Send questions!
Overview • Common biases of studies of diagnostic test accuracy • Prevalence, spectrum and nonindependence • Meta-analysis of diagnostic tests • Checklist & systematic approach • Examples: • Physical examination for presentation • Pain with percussion, hopping or cough for appendicitis • Pertussis • Predicting hyperbilirubinemia
Bias #1 Example • Study of BNP to diagnose congestive heart failure (CHF, Chapter 4, Problem 3)
Bias #1 Example • Gold standard: determination of CHF by two cardiologists blinded to BNP • Chest x-rays found to be highly predictive of CHF • Is there a problem with assessing accuracy of chest x-rays to diagnose CHF in this study? *Maisel AS, Krishnaswamy P, Nowak RM, McCord J, Hollander JE, Duc P, et al. Rapid measurement of B-type natriuretic peptide in the emergency diagnosis of heart failure. N Engl J Med 2002;347(3):161-7.
Bias #1: Incorporation bias • Cardiologists not blinded to Chest X-ray • Probably used (incorporated) it to make final diagnosis • Incorporation bias for assessment of Chest X-ray (not BNP) • Biases both sensitivity and specificity upward
Bias #2 Example: • Visual assessment of jaundice in newborns • Study patients who are getting a bilirubin measurement • Ask clinicians to estimate extent of jaundice at time of blood draw
Sensitivity of jaundice below the nipple line for TSB ≥ 12 mg/dL = 97% Specificity = 19% What is the problem? Visual Assessment of jaundice*: Results Editor’s Note: The take-home message for me is that no jaundice below the nipple line equals no bilirubin test, unless there’s some other indication. --Catherine D. DeAngelis, MD *Moyer et al., APAM 2000; 154:391
Bias #2: Verification bias • Inclusion criterion for study: gold standard test was done • in this case, blood test for bilirubin • Subjects with positive index tests are more likely to be get the gold standard and to be included in the study • clinicians don’t order blood test for bilirubin if the jaundice is minimal • How doe this affect sensitivity and specificity?
Bias #2: Verification Bias* Sensitivity, a/(a+c), is biased ___. Specificity, d/(b+d), is biased ___. *AKA Work-up, Referral Bias, or Ascertainment Bias
Bias #3 • Example: Pioped study of accuracy of V/Q scan to diagnose pulmonary embolus* • Study Population: All patients presenting to the ED who received a V/Q scan • Test: V/Q Scan • Disease: Pulmonary embolism (PE) • Gold Standards: • 1. Pulmonary arteriogram (PA-gram) if done (more likely with more abnormal V/Q scan) • 2. Clinical follow-up in other patients (more likely with normal VQ scan *PIOPED. JAMA 1990;263(20):2753-9.
Double Gold Standard Bias • Two different “gold standards” • One gold standard (e.g., surgery, invasive test) is more likely to be applied in patients with positive index test, • Other gold standard (e.g., clinical follow-up) is more likely to be applied in patients with a negative index test. • There are some patients in whom the tests do not give the same answer • spontaneously resolving disease • newly occurring disease
Double Gold Standard Bias: effect of spontaneously resolving cases Sensitivity, a/(a+c) biased __ Specificity, d/(b+d) biased __ Double gold standard compared with follow-up for all Double gold standard compared with PA-Gram for all
Double Gold Standard Bias: effect of newly occurring cases Sensitivity, a/(a+c) biased __ Specificity, d/(b+d) biased __ Double gold standard compared with follow-up for all Double gold standard compared with PA-Gram for all
Double Gold Standard Bias: Ultrasound diagnosis of intussusception
What if 10% of the 86 U/S- followed subjects actually had intussusceptions that resolved spontaneously?
Spectrum of Disease, Nondisease and Test Results • Disease is often easier to diagnose if severe • “Nondisease” is easier to diagnose if patient is well than if the patient has other diseases • Test results will be more reproducible if ambiguous results excluded
Spectrum Bias • Sensitivity depends on the spectrum of disease in the population being tested. • Specificity depends on the spectrum of non-disease in the population being tested. • Example: Absence of Nasal Bone (on 13-week ultrasound) as a Test for Chromosomal Abnormality
Spectrum Bias Example: Absence of Nasal Bone as a Test for Chromosomal Abnormality* Sensitivity = 229/333 = 69% BUT the D+ group only included fetuses with Trisomy 21 Cicero et al., Ultrasound Obstet Gynecol 2004;23: 218-23
Spectrum Bias: Absence of Nasal Bone as a Test for Chromosomal Abnormality • D+ group excluded 295 fetuses with other chromosomal abnormalities (esp. Trisomy 18) • Among these fetuses, sensitivity 32% (not 69%) • What decision is this test supposed to help with? • If it is whether to test chromosomes using chorionic villus sampling or amniocentesis, these 295 fetuses should be included!
Spectrum Bias:Absence of Nasal Bone as a Test for Chromosomal Abnormality, effect of including other trisomies in D+ group Sensitivity = 324/628 = 52% NOT 69% obtained when the D+ group only included fetuses with Trisomy 21
Quiz: What if we considered the nasal bone absence as a test for Trisomy 21? • Then instead of excluding subjects with other chromosomal abnormalities or including them as D+, we should count them as D-. Compared with excluding them, • What would happen to sensitivity? • What would happen to specificity?
Prevalence, spectrum and nonindependence • Prevalence (prior probability) of disease may be related to disease severity • One mechanism is different spectra of disease or nondisease • Another is that whatever is causing the high prior probability is related to the same aspect of the disease as the test
Prevalence, spectrum and nonindependence • Examples • Iron deficiency • Diseases identified by screening • Urinalysis as a test for UTI in women with more and fewer symptoms (high and low prior probability)
Meta-analyses of Diagnostic Tests • Systematic and reproducible approach to finding studies • Summary of results of each study • Investigation into heterogeneity • Summary estimate of results, if appropriate • Unlike other meta-analyses (risk factors, treatments), results aren’t summarized with a single number (e.g., RR), but with two related numbers (sensitivity and specificity) • These can be plotted on an ROC plane
MRI for the diagnosis of MS Whiting et al. BMJ 2006;332:875-84
Studies of Diagnostic Test Accuracy: Checklist • Was there an independent, blind comparison with a reference (“gold”) standard of diagnosis? • Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)? • Was the reference standard applied regardless of the diagnostic test result? • Was the test (or cluster of tests) validated in a second, independent group of patients? From Sackett et al., Evidence-based Medicine,2nd ed. (NY: Churchill Livingstone), 2000. p 68
Systematic Approach • Authors and funding source • Research question • Study design • Study subjects • Predictor variable • Outcome variable • Results & Analysis • Conclusions
A clinical decision rule to identify children at low risk for appendicitis (Problem 5.6) • Study design: prospective cohort study • Subjects • Of 4140 patients 3-18 years presenting to Boston Children’s Hospital ED with CC abdominal pain • 767 (19%) received surgical consultation for possible appendicitis • 113 Excluded (Chronic diseases, recent imaging) • 53 missed • 601 included in the study (425 in derivation set) Kharbanda et al. Pediatrics 2005; 116(3): 709-16
A clinical decision rule to identify children at low risk for appendicitis • Predictor variable • Standardized assessment by PEM attending • Focus on “Pain with percussion, hopping or cough” (complete data in N=381) • Outcome variable: • Pathologic diagnosis of appendicitis for those who received surgery (37%) • Follow-up telephone call to family or pediatrician 2-4 weeks after the ED visit for those who did not receive surgery (63%) Kharbanda et al. Pediatrics116(3): 709-16
A clinical decision rule to identify children at low risk for appendicitis • Results: Pain with percussion, hopping or cough • 78% sensitivity seems low to me. Is it valid for me in deciding whom to image? Kharbanda et al. Pediatrics116(3): 709-16
Checklist • Was there an independent, blind comparison with a reference (“gold”) standard of diagnosis? • Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)? • Was the reference standard applied regardless of the diagnostic test result? • Was the test (or cluster of tests) validated in a second, independent group of patients? From Sackett et al., Evidence-based Medicine,2nd ed. (NY: Churchill Livingstone), 2000. p 68
Systematic approach • Study design: prospective cohort study • Subjects • Of 4140 patients 3-18 years presenting to Boston Children’s Hospital ED with CC abdominal pain • 767 (19%) received surgical consultation for possible appendicitis Kharbanda et al. Pediatrics116(3): 709-16
A clinical decision rule to identify children at low risk for appendicitis • Predictor variable • “Pain with percussion, hopping or cough” (complete data in N=381) • Outcome variable: • Pathologic diagnosis of appendicitis for those who received surgery (37%) • Follow-up telephone call to family or pediatrician 2-4 weeks after the ED visit for those who did not receive surgery (63%) Kharbanda et al. Pediatrics116(3): 709-16
Issues • Sample representative? • Verification bias? • Double-gold standard bias? • Spectrum bias
For children presenting with abdominal pain to SFGH 6-M • Sensitivity probably valid (not falsely low) • But whether all of them tried to hop is not clear • Specificity probably low • PPV is high • NPV is low • Does not address surgical consultation decision
Does this coughing patient have pertussis? • RQ (for us): what are LR for coughing fits, whoop, and post-tussive vomiting in adults with persistent cough? • Design (for one study we reviewed*): Prospective cross-sectional study • Subjects: 217 adults ≥18 years with cough 7-21 days, no fever or other clear cause for cough enrolled by 80 French GPs. • In a subsample from 58 GPs, of 710 who met inclusion criteria only 99 (14%) enrolled *Gilberg S et al. J Inf Dis 2002;186:415-8
Petussis diagnosis • Predictor variables: “GPs interviewed patients using a standardized questionnaire.” • Outcome variable: Evidence of pertussis based on • Culture (N=1) • PCR (N=36) • Or ≥ 2-fold change in anti-pertussis toxin IgG (N=40) • Total N = 70/217 with evidence of pertussis *Gilberg S et al. J Inf Dis 2002;186:415-8
Issues • Verification (selection) bias: only 14% of eligible subjects included • Questionable gold standard (internally inconsistent) • Nice illustration of difficulty doing a systematic review!