490 likes | 654 Views
Studies of Diagnostic Tests. Thomas B. Newman, MD, MPH October 14, 2010. Reminders/Announcements. Write down answers to as many of the problems in the book as you can and check your answers! Final exam to be passed out 12/2, reviewed 12/9 Send questions!. Overview.
E N D
Studies of Diagnostic Tests Thomas B. Newman, MD, MPH October 14, 2010
Reminders/Announcements • Write down answers to as many of the problems in the book as you can and check your answers! • Final exam to be passed out 12/2, reviewed 12/9 • Send questions!
Overview • Common biases of studies of diagnostic test accuracy • Prevalence, spectrum and nonindependence • Meta-analysis of diagnostic tests • Checklist & systematic approach • Examples: • Pain with percussion, hopping or cough for appendicitis • Pertussis
Bias #1 Example • Study of BNP to diagnose congestive heart failure (CHF, Chapter 4, Problem 3)
Bias #1 Example • Gold standard: determination of CHF by two cardiologists blinded to BNP • “The best clinical predictor of congestive heart failure was an increased heart size on chest roentgenogram (accuracy, 81 percent)” • Is there a problem with assessing accuracy of chest x-rays to diagnose CHF in this study? *Maisel AS, Krishnaswamy P, Nowak RM, McCord J, Hollander JE, Duc P, et al. Rapid measurement of B-type natriuretic peptide in the emergency diagnosis of heart failure. N Engl J Med 2002;347(3):161-7.
Bias #1: Incorporation bias • Cardiologists not blinded to chest X-ray • Probably used (incorporated) it to make final diagnosis • Incorporation bias for assessment of Chest X-ray (not BNP) • Biases both sensitivity and specificity upward
Bias #2 Example: • Visual assessment of jaundice in newborns • Study patients who are getting a bilirubin measurement • Ask clinicians to estimate extent of jaundice at time of blood draw
Sensitivity of jaundice below the nipple line for bilirubin ≥ 12 mg/dL = 97% Specificity = 19% What is the problem? Visual Assessment of jaundice*: Results Editor’s Note: The take-home message for me is that no jaundice below the nipple line equals no bilirubin test, unless there’s some other indication. --Catherine D. DeAngelis, MD *Moyer et al., APAM 2000; 154:391
Bias #2: Verification Bias* -1 • Inclusion criterion for study: gold standard test was done • in this case, blood test for bilirubin • Subjects with positive index tests are more likely to be get the gold standard and to be included in the study • clinicians usually don’t order blood test for bilirubin if there is little or no jaundice • How does this affect sensitivity and specificity? *AKA Work-up, Referral Bias, or Ascertainment Bias
Bias #2: Verification Bias Sensitivity, a/(a+c), is biased ___. Specificity, d/(b+d), is biased ___. But is sensitivity what we really want to know to support Cathy’s conclusion? *AKA Work-up, Referral Bias, or Ascertainment Bias
Bias #2: Verification Bias • Negative predictive value was 94%. Is it biased? • The “Test negative” group (no jaundice) that still gets the gold standard may have other risk factors or indications • Therefore, c may be too high relative to d and NPV may be underestimated
Bias #3 • Example: Pioped study of accuracy of ventilation/perfusion (V/Q) scan to diagnose pulmonary embolus* • Study Population: All patients presenting to the ED who received a V/Q scan • Test: V/Q Scan • Disease: Pulmonary embolism (PE) • Gold Standards: • 1. Pulmonary arteriogram (PA-gram) if done (more likely with more abnormal V/Q scan) • 2. Clinical follow-up in other patients (more likely with normal VQ scan *PIOPED. JAMA 1990;263(20):2753-9.
Double Gold Standard Bias • Two different “gold standards” • One gold standard (usually an immediate, more invasive test, e.g., angiogram, surgery) is more likely to be applied in patients with positive index test • Second gold standard (e.g., clinical follow-up) is more likely to be applied in patients with a negative index test.
Double Gold Standard Bias • There are some patients in whom the two “gold standards” do not give the same answer • Spontaneously resolving disease (positive with immediate invasive test, but not with follow-up) • Newly occurring or newly detectable disease (positive with follow-up but not with immediate invasive test)
Effect of Double Gold Standard Bias 1: Spontaneously resolving disease • Test result will always agree with gold standard • Both sensitivity and specificity increase • Example: Joe has a small pulmonary embolus (PE) that will resolve spontaneously. • If his VQ scan is positive, he will get an angiogram that shows the PE (true positive) • If his VQ scan is negative, his PE will resolve and we will think he never had one (true negative) • VQ scan can’t be wrong!
Effect of Double Gold Standard Bias 2: Newly occurring or newly detectable disease • Test result will always disagree with gold standard • Both sensitivity and specificity decrease • Example: Jane has a nasty breast cancer that is currently undetectable • If her mammogram is positive, she will get biopsies that will not find the tumor (mammogram will look falsely positive) • If her mammogram is negative, she will return in several months an we will think the tumor was initially missed (mammogram will look falsely negative) • Mammogram can’t be right!
Spectrum of Disease, Nondisease and Test Results • Disease is often easier to diagnose if severe • “Nondisease” is easier to diagnose if patient is well than if the patient has other diseases • Test results will be more reproducible if ambiguous results excluded
Spectrum Bias • Sensitivity depends on the spectrum of disease in the population being tested. • Specificity depends on the spectrum of non-disease in the population being tested. • Example: Absence of Nasal Bone (on 13-week ultrasound) as a Test for Chromosomal Abnormality
Spectrum Bias Example: Absence of Nasal Bone as a Test for Chromosomal Abnormality* Sensitivity = 229/333 = 69% BUT the D+ group only included fetuses with Trisomy 21 Cicero et al., Ultrasound Obstet Gynecol 2004;23: 218-23
Spectrum Bias: Absence of Nasal Bone as a Test for Chromosomal Abnormality • The D+ group excluded 295 fetuses with other chromosomal abnormalities (mainly Trisomy 18) • Among these fetuses, the sensitivity of nasal bone absence was 32% (not 69%) • What decision is this test supposed to help with? • If it is whether to test chromosomes using chorionic villus sampling or amniocentesis, these 295 fetuses should be included!
Spectrum Bias:Absence of Nasal Bone as a Test for Chromosomal Abnormality, effect of including other trisomies in D+ group Sensitivity = 324/628 = 52% NOT 69% obtained when the D+ group only included fetuses with Trisomy 21
Quiz: What if we considered the nasal bone absence as a test for Trisomy 21? • Then instead of excluding subjects with other chromosomal abnormalities or including them as D+, we should count them as D-. Compared with excluding them, • What would happen to sensitivity? • What would happen to specificity?
Quiz: What if we considered the nasal bone absence as a test for Trisomy 21? • Sensitivity unchanged • Specificity reduced
Prevalence, spectrum and nonindependence • Prevalence (prior probability) of disease may be related to disease severity • One mechanism is different spectra of disease or nondisease • Another is that whatever is causing the high prior probability is related to the same aspect of the disease as the test
Prevalence, spectrum and nonindependence • Examples • Iron deficiency, HIV • Diseases identified by screening • Urinalysis as a test for UTI in women with more and fewer symptoms (high and low prior probability)
Meta-analyses of Diagnostic Tests • Systematic and reproducible approach to finding studies • Summary of results of each study • Investigation into heterogeneity • Summary estimate of results, if appropriate • Unlike other meta-analyses (risk factors, treatments), results aren’t summarized with a single number (e.g., RR), but with two related numbers (sensitivity and specificity) • These can be plotted on an ROC plane
MRI for the diagnosis of MS Whiting et al. BMJ 2006;332:875-84
Dermoscopy vs Naked Eye for Diagnosis of Malignant Melanoma Br J Dermatol. 2008 Sep;159(3):669-76
Studies of Diagnostic Test Accuracy: Checklist • Was there an independent, blind comparison with a reference (“gold”) standard of diagnosis? • Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)? • Was the reference standard applied regardless of the diagnostic test result? • Was the test (or cluster of tests) validated in a second, independent group of patients? From Sackett et al., Evidence-based Medicine,2nd ed. (NY: Churchill Livingstone), 2000. p 68
Systematic Approach • Authors and funding source • Research question • Study design • Study subjects • Predictor variable • Outcome variable • Results & Analysis • Conclusions Consider possible biases due to deviations from a perfect study and estimate the magnitude and direction of each
A clinical decision rule to identify children at low risk for appendicitis (Problem 5.6) • Study design: prospective cohort study • Subjects • 4140 patients 3-18 years presenting to Boston Children’s Hospital ED with abdominal pain • Of these, 767 (19%) received surgical consultation for possible appendicitis • 113 Excluded (Chronic diseases, recent imaging) • 53 missed • 601 included in the study (425 in derivation set) Kharbanda et al. Pediatrics 2005; 116(3): 709-16
A clinical decision rule to identify children at low risk for appendicitis • Predictor variable • Standardized assessment by pediatric ED attending • Focus on “Pain with percussion, hopping or cough” (complete data in N=381) • Outcome variable: • Pathologic diagnosis of appendicitis (or not) for those who received surgery (37%) • Follow-up telephone call to family or pediatrician 2-4 weeks after the ED visit for those who did not receive surgery (63%) Kharbanda et al. Pediatrics116(3): 709-16
A clinical decision rule to identify children at low risk for appendicitis • Results: Pain with percussion, hopping or cough • 78% sensitivity and 83% NPV seem low to me. Are they valid for me in deciding whom to image? Kharbanda et al. Pediatrics116(3): 709-16
Checklist • Was there an independent, blind comparison with a reference (“gold”) standard of diagnosis? • Was the diagnostic test evaluated in an appropriate spectrum of patients (like those in whom we would use it in practice)? • Was the reference standard applied regardless of the diagnostic test result? • Was the test (or cluster of tests) validated in a second, independent group of patients? From Sackett et al., Evidence-based Medicine,2nd ed. (NY: Churchill Livingstone), 2000. p 68
In what direction would these biases affect results? • Sample not representative (population referred to pedi surgery)? • Verification bias? • Double-gold standard bias? • Spectrum bias
For children presenting with abdominal pain to SFGH 6-M • Sensitivity probably valid (not falsely low) • But whether all of them tried to hop is not clear • Specificity probably low • PPV is too high • NPV is too low • Does not address surgical consultation decision
Does this coughing patient have pertussis?* • RQ (for us): what are LR for coughing fits, whoop, and post-tussive vomiting in adults with persistent cough? • Design (for one study we reviewed**): Prospective cross-sectional study • Subjects: 217 adults ≥18 years with cough 7-21 days, no fever or other clear cause for cough enrolled by 80 French GPs. • In a subsample from 58 GPs, of 710 who met inclusion criteria only 99 (14%) enrolled * Cornia et al. JAMA 2010;304(8):890-896 **Gilberg S et al. J Inf Dis 2002;186:415-8
Petussis diagnosis • Predictor variables: “GPs interviewed patients using a standardized questionnaire.” • Outcome variable: Evidence of pertussis based on • Culture (N=1) • PCR (N=36) • Or ≥ 2-fold change in anti-pertussis toxin IgG (N=40) • Total N = 70/217 with evidence of pertussis (32%) *Gilberg S et al. J Inf Dis 2002;186:415-8
Issues • Verification (selection) bias: only 14% of eligible subjects included • Questionable gold standard • 2-fold dilution too small • Increase or decrease counted • Internally inconsistent: pts with + PCR no more likely to have change in Ab titres.
Double Gold Standard Bias: effect of spontaneously resolving disease Sensitivity, a/(a+c) biased __ Specificity, d/(b+d) biased __ Double gold standard compared with follow-up for all Double gold standard compared with immediate invasive test for all
Double Gold Standard Bias: effect of newly occurring cases Sensitivity, a/(a+c) biased __ Specificity, d/(b+d) biased __ Double gold standard compared with follow-up for all Double gold standard compared with PA-Gram for all
Double Gold Standard Bias: Ultrasound diagnosis of intussusception
What if 10% of the 86 U/S- followed subjects actually had intussusceptions that resolved spontaneously?