1 / 42

Studying the Impact of Tests

Studying the Impact of Tests. Jon Deeks Professor of Health Statistics University of Birmingham Work supported by a DOH NCC RCD Senior Research Scientist in Evidence Synthesis Award. Answering policy decisions about the use of diagnostic tests.

Download Presentation

Studying the Impact of Tests

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Studying the Impact of Tests Jon Deeks Professor of Health Statistics University of Birmingham Work supported by a DOH NCC RCD Senior Research Scientist in Evidence Synthesis Award

  2. Answering policy decisions about the use of diagnostic tests • Should GPs refer patients with low back pain for X-ray and/or MRI? • Should patients with dyspeptic symptoms receive serology tests for H.pylori, endoscopy, or empirical therapy?

  3. Standard hierarchy for HTA of tests(Fryback and Thornton 1991) • Technical quality of the test • Diagnostic accuracy • Change in diagnostic thinking • Change in patient management • Change in patient outcomes • Societal costs and benefits

  4. Studies on the Diagnostic Evaluation Pathway • Analytical validity • Reliability (repeatability and reproducibility) • Measurement accuracy • Diagnostic validity • Diagnostic accuracy • Comparative/incremental diagnostic accuracy • Impact • Change in diagnostic yield • Change in management • Change in patient outcomes • Economic evaluation

  5. HTA policy on evaluating tests (up until 2004) • “the emphasis of the HTA programme is to assess the effect on patient management and outcomes … improvements in diagnostic accuracy, whilst relevant, are not the primary interest of this commissioned research programme”

  6. Studies on the Diagnostic Evaluation Pathway • Analytical validity • Reliability (repeatability and reproducibility) • Measurement accuracy • Diagnostic validity • Diagnostic accuracy • Incremental diagnostic accuracy • Impact • Change in diagnostic yield • Change in management • Change in patient outcomes • Economic evaluation Focus of HTA programme

  7. Outline of talk • Trials of diagnostic evaluations • Problems • What is being evaluated? • Statistical power • Study validity • Outcomes • Pragmatic suggestions • When are trials really needed? • Alternative trial designs • Alternative of assessing comparative accuracy • More research is needed

  8. Outcome Population Randomise Control Active Sample Outcome RCT to assess patient outcomes

  9. Population Randomise Outcome Outcome Control TEST Sample Diagnostic RCT

  10. Referred for X-ray N=73 Outcome 6 weeks N=59 Outcome 1 year N=50 GP attendees aged 16-64 yrs with LBP. Excluded if ‘flu or previous consultation for LBP in last 4 weeks PRIMARY Roland score HADS SF-36 EuroQol SECONDARY time off work therapists medication satisfaction Sample N=153 Randomise No X-ray referral N=80 Outcome 6 weeks N=67 Outcome 1 year N=58 RCT 1: X-ray at first GP presentation for low back pain. HTA 2000(4): 20

  11. Referred for X-ray N=73 Outcome 6 weeks N=59 Outcome 1 year N=50 RESULTS At 6 weeks  SF-36 mental health and vitality subscales (P<.05) At 12 months  SF-36 mental health subscale (P<.05) GP attendees aged 16-64 yrs with LBP. Excluded if ‘flu or previous consultation for LBP in last 4 weeks Sample N=153 Randomise No X-ray referral N=80 Outcome 6 weeks N=67 Outcome 1 year N=58 RCT 1: X-ray at first GP presentation for low back pain. HTA 2000(4): 20

  12. Referred for X-ray N=210 Outcome 3 months N=199 Outcome 9 months N=195 GP attendees aged 20-55. 1st episode of LBP between 6 weeks and 6 months duration. Excluded if ‘red flags’ PRIMARY Roland score SECONDARY pain (VAS) EuroQol pain (diary) satisfaction pain (any) belief in X-ray time off work therapists medication consultations Sample N=421 Randomise No X-ray referral N=211 Outcome 3 months N=203 Outcome 9 months N=199 RCT 2: X-ray for GP presentation for low back pain >6 weeks. HTA 2001(5): 30

  13. Referred for X-ray N=210 Outcome 3 months N=199 Outcome 9 months N=195 GP attendees aged 20-55. 1st episode of LBP between 6 weeks and 6 months duration. Excluded if ‘red flags’ RESULTS At 3 months  proportion reporting LBP (P<.05) At 9 months None Sample N=421 Randomise No X-ray referral N=211 Outcome 3 months N=203 Outcome 9 months N=199 RCT 2: X-ray for GP presentation for low back pain >6 weeks. HTA 2001(5): 30

  14. Diagnostic accuracy Medical Test Information RCT combines effects Test harms and placebo effects Diagnostic yield Decision Patient Outcome Action Management What is being evaluated?

  15. What is being evaluated? • Conditions for a test to be of diagnostic benefit • Test is more accurate • Interpretation of test results is rational and consistent • Management is rational and consistent • Treatment is effective • Conditions for a trial to be informative • Rules for interpretation of test results are described • Management protocol is described • No descriptions given in example trials • Applying the results requires faith that the behaviour of your patients and clinicians is the same as the trial

  16. What is being evaluated? • If no difference is observed … • Is the test no more accurate? • Are clinicians not correctly interpreting test results? • Are management decisions inconsistent or inappropriate? • Is the treatment ineffective? • None of these questions can be answered • If one element changes, the results of the trial become redundant

  17. Statistical Power • RCT 1: • Reduction in proportion with pain at 2 weeks from 40% to 30% could be detected with 300 patients with 80% power at 5% significance • RCT 2: • Difference of 1.5 on Roland score could be detected with 388 patients with 90% power and 5% significance • sd=4.5, standardised difference=1.5/4.5=0.33 • These sample size calculations are suitable for a trial of treatment vs placebo, not a trial of test+treatment

  18. Diagnostic Accuracy ofClinical Judgement Serious (requires intervention) Minor (requires no intervention) TP FN FP TN

  19. Diagnostic Accuracy of Clinical Judgement + X-ray Serious (requires intervention) Minor (requires no intervention) TP FN FP TN

  20. Comparison of Diagnostic Accuracy Serious (requires intervention) Minor (requires no intervention) All TP Discrepant A All FN All FP Discrepant B All TN

  21. Benefit can only occur in those whose diagnosis changes • Where can differences arise? • Discrepant A could benefit if intervention effective • Discrepant B could benefit if intervention harmful • All others have no benefit as no change in their intervention • Sample size must take into account • Prevalence of treatable condition • Detection rate (sensitivity) with control test • Detection rate (sensitivity) with new test • Treatment rate if control test negative (assume zero) • Treatment rate if new test positive (assume 100%) • Outcome for treatable condition if untreated • Treatment effect

  22. Sample size for detecting treatment effects • Sample size for treatment vs control 300-400. • Sample size must be adjusted according to the proportion in discrepant cells (particularly A). • If 20% have serious disease and sensitivity 20% there will be 4% in Discrepant A  increase N 25-fold (N=7,500-10,000) • If 10% have serious disease and sensitivity 10% there will be 1% in Discrepant A  increase N 100-fold (N=30,000-40,000)

  23. Sample size for detecting differences in accuracy • Sample size depends on whether the sample all receive both tests, or are randomised to tests • Sample sizes for difference in sensitivity • If 20% have serious disease to detect sensitivity 20% from 70% to 90% (80% power, alpha 0.05)  paired cohort design N=116 [68-136]  parallel cohort design N=232 • If 10% have serious disease to detect sensitivity 10% from 80% to 90% (80% power, alpha 0.05)  paired cohort design N=706 [271-814]  parallel cohort design N=1411

  24. Sample size for detecting differences in diagnoses and management • Sample size based on accuracy sample size inflated according to: • For diagnostic impact • diagnosis rate if control test negative • diagnosis rate if new test positive* • For therapeutic impact • treatment rate if control test negative • treatment rate if new test positive* * subject to “learning effects”

  25. Validity Concerns • Blinding • Participants and outcome assessors are rarely blind in diagnostic trials • Trials may be more susceptible to measuring preconceived notions of participants and expectations of trialists • Drop-out • Lack of blinding can induce differential drop-out • There are more stages at which drop-out occurs • Compliance • Lack of blinding and complexity in strategies can reduce compliance

  26. What outcomes? • The problem is multi-multi-factorial • Assessing the effect of a single intervention for a single disease requires multiple outcomes • Tests are used to differentiate between multiple diseases and disease states • A trial should assess all the important outcomes for the multiple diseases within the differential diagnosis • But trials usually have a focus on one condition

  27. Summary of problems • Diagnostic trials are … • Rarely done • Assess effects of “test+treatment package” • Uninformative about the value of the test • Often underpowered • At risk of bias • May not assess all relevant outcomes • May be more likely to detect “placebo” effects than benefits of better diagnoses • May not represent future impact on treatment and diagnostic decisions

  28. Key issues • Trials only need be done in limited circumstances • Only patients in the discrepant cell are informative • Audit and feedback studies are better for assessing and changing clinicians’ behaviour than trials • More good comparative studies of test accuracy are required

  29. When is measuring sensitivity and specificity sufficient to evaluate a new test?Lord et al. Ann Int Med 2006; 144: 850-5 • Categories of test attributes: • The new test is safer or is less costly • The new test is more specific (excludes more cases of non-disease) • The new test is more sensitive (detects more cases of disease) • If an RCT of treatments exists, when do we still need to undertake an RCT of test+treatment?

  30. Trial evidence versus linked evidence of test accuracy and treatment efficacy Lord, S. J. et. al. Ann Intern Med 2006;144:850-855

  31. Assessing new tests using evidence of test accuracy, given that treatment is effective for cases detected by the old test Lord, S. J. et. al. Ann Intern Med 2006;144:850-855

  32. When is measuring sensitivity and specificity sufficient to evaluate a new test?Lord et al. Ann Int Med 2006; 144: 850-5 • If the new test has similar sensitivity • Trials of test+treatment are not required • Reductions in harm or cost are benefits • Improved specificity can only be a benefit • Decision models can be used to analyse trade-offs between positive and negative benefits

  33. When is measuring sensitivity and specificity sufficient to evaluate a new test?Lord et al. Ann Int Med 2006; 144: 850-5 • If the new test has improved sensitivity • Value of using the test depends on treatment response in the extra cases detected • A trial is still not needed if • Inclusion in the treatment trial was based on the reference standard for assessing test accuracy • The test is evaluated in a treatment trial as a predictor of response • The new cases represent the same spectrum or subtype of disease • Treatment response is known to be similar across the spectrum or subtype of disease

  34. Serious Serious Population Minor Clinical diagnosis Randomise Outcome Intervene Outcome X-ray X-ray Intervene Outcome Outcome Sample Serious Minor Do not intervene Minor Do not intervene Alternative Diagnostic RCT

  35. Serious Serious Population Minor Clinical diagnosis Randomise Sample X-ray X-ray Outcome Intervene Outcome Compare Serious Minor Do not intervene Minor Alternative Diagnostic RCT

  36. Alternative Diagnostic RCT • Everybody gets all tests, randomise only those with discrepant results • Benefits • Assess diagnostic yield and resultant patient outcomes • Less follow-up required • Include a reference standard for a random sample and comparative diagnostic accuracy can also be assessed • Downsides • More tests undertaken • Problems when test material is limited • Does not assess test harms or other direct effects • May not be ethical to randomise treatment

  37. Assessing clinicians’ behaviours • Informative trials require documentation and standardisation of decision-making • Particularly difficult when the comparison group is standard practice • Assessing behaviour observed in a trial may not be representative • Future behaviour will depend on the trial results • Learning curves may affect compliance • Becoming acquainted with a test • Ascertaining how best to use it • Gaining confidence in its findings • Allowing it to replace other investigations

  38. Diagnostic Before-and-After Studies • Design • Doctors’ assessments of diagnostic, prognostic and required management decisions recorded • Result of new test made available • Doctors’ changes in diagnostic, prognostic and required management decisions noted • (Reference standard applied) • Application • Assessment of an Additional Test only • Assessment of Diagnostic Yield and Management • Concerns • New test assessed independent of other tests • Doctors’ processes may not reflect standard clinical practice • Learning effects

  39. Conclusions • We have much to learn about the best way of studying diagnostic tests • Test+treatment trials are difficult to undertake, are prone to bias, and often require unattainable sample sizes. • Good comparative studies of test accuracy combined in decision models with evidence from trials of treatments may in many circumstances provide the necessary evidence for policy decisions • Good comparative studies of test accuracy should be commissioned more readily

  40. Defects and Disasters in Evaluations of the Impact of Diagnostic Tests Jon Deeks Professor of Health Statistics University of Birmingham Work supported by a DOH NCC RCD Senior Research Scientist in Evidence Synthesis Award

More Related