CAT (Critically Appraised Topic) (adapted from Sackett, et al. 2000)

CAT (Critically Appraised Topic) (adapted from Sackett, et al. 2000) • 1-page summary of evidence resulting from critical appraisal of an article, test, etc. • Answers a specific foreground question • “Compared to no treatment, does parent-administered treatment significantly improve the language skills of toddlers with language delay?”

First part of CAT identical for tx and dx studies (see handout pp. 2-3) • Clinical bottom line: (appears 1st but completed last) • Clinical question: • Search terms: • Appraised by whom, and date: • Synopsis of key (memorable) information, in a concise, maximally useful format (e.g., types of subjects, procedures, measures, results, etc.)

CAT-egories (appraisal points) for a study of therapy (Sackett et al., 2000) • Prospective, controlled? • Random assignment? • Comparing > 2 conditions? • Recognizable subjects? • Evidence of pre-tx group similarity? • Blinding (insofar as possible) of evaluators, relevant others?

Appraisal points (cont.) • Control over nuisance variables? • Valid, reliable measures of tx effects? • Statistically significant difference (p-value)? • Practically significant difference (d-value)? • Precision of treatment effects (narrow CI)? • Outcomes for all enrolled? • Cost-benefit and feasibility analyses?

A sample treatment CAT • CAT: Language of delayed toddlers improves in response to parent-administered focused stimulation • Clinical bottom line: Compared to an untreated control group, motivated mothers of low-vocabulary toddlers significantly decreased their speaking rate and language complexity and increased their vocabulary inputs in response to ~18 hr of instruction in focused stimulation techniques, and their children produced significantly more words and early grammatical forms. • Clinical question: Compared to no treatment, does parent-administered treatment significantly improve the language skills of toddlers with language delay? • Search terms: word learning AND toddlers, PubMed clinical query • Appraised by: Dollaghan

Key appraisal points • Prospective, controlled Yes • Randomized Yes • Comparing > 2 conditions Yes • Recognizable Ss Yes • Pre-tx similarity Yes • Blinding Yes Cn; no parent • Control over nuisance variables Yes • Valid, reliable measures Yes • Statistically significant differences Yes • Practically significant differences Yes • Precision of treatment effects No • Outcomes for all enrolled Yes • Cost-benefit, feasibility analyses Yes

Critical appraisal of evidence on diagnostic indicators • The key variables by which individuals are identified as members of a class, ostensibly to improve prediction and outcome for them • Myriad diagnostic indicators have been proposed in communication sciences and disorders • Diagnostic indicators in your area of interest?

Most diagnostic indicators in CSD are based on “Phase I” studies • Group mean comparison studies • People with, and people without, the condition of interest are compared with respect to a proposed indicator • Correlational studies • Association between proposed indicator and accepted indicators • Such studies can’t address the two most crucial features of a diagnostic indicator: accuracy and precision

Accuracy and precision • Accuracy • The ability of an indicator to identify a condition of interest, i.e., the amount of agreement between the proposed indicator and a reference standard • Precision • Width of confidence intervals (CI) for estimates of accuracy

Accuracy of a diagnostic indicator • The ability of an indicator to identify a condition of interest, i.e., the amount of agreement between the proposed indicator and a reference standard • Preferred measures of diagnostic accuracy: positive and negative likelihood ratios (Battaglia et al., 2002)

Positive Likelihood Ratio (LR+) • Reflects the degree of confidence that a person who scores in the positive (affected or disordered) range on a dx indicator does have the disorder • Formula: sensitivity/1-specificity • The higher the LR+, the more informative the indicator for identifying people who have the disorder

Interpreting LR+ values (Sackett et al., 1991) LR+ > 20 Very high; virtually certain that a person with this score has the disorder LR+ = 10 High; disorder very likely in a person with this score LR+ = 4 Intermediate; the indicator is suggestive of disorder but insufficient to diagnose LR+ = 1Equivocal; a person who scores in the disordered range on the measure may or may not have the disorder; the measure provides no new information

Negative Likelihood Ratio (LR-) • Reflects the degree of confidence that a person scoring in the negative (normal) range on the diagnostic indicator truly does not have the disorder • Formula: 1-sensitivity/specificity • The lower the LR-, the more informative the indicator for ruling out the presence of disorder

Interpreting LR- values (Sackett et al., 1991) LR- < 0.10 Very low ; virtually certain that a person scoring in this range does not have the disorder LR- = 0.20 Low; disorder very unlikely LR- = 0.40 Intermediate; the indicator is suggestive but insufficient to rule out the disorder LR- = 1.0 Equivocal; a person scoring in the normal range on this measure may or may not be normal

Calculating sensitivity and specificity (nothing more than LR precursors) • Sensitivity: the percentage of people with the disorder that the new indicator correctly classifies as disordered • Specificity: the percentage of people who don’t have the disorder that the new indicator correctly classifies as not disordered • The “true” status of every individual with regard to the disorder is established according to a gold (or reference) standard

Disorder Status (re: Gold Standard) - Disorder (LN) + Disorder (LI) + Disorder (LI) New Test Result -Disorder (LN) # with disorder # without disorder

True positive a False positive b c False negative d True negative Disorder Status (re: Gold Standard) + Disorder (LI) - Disorder (LN) + Disorder (LI) New Test Result -Disorder (LN) Sensitivity=a/a+c (the proportion of people with the disorder that the new test identifies as having the disorder)

Disorder Status (re: Gold Standard) + Disorder - Disorder + Disorder New Test Result -Disorder Specificity = d/b+d (the proportion of people without the disorder that the new test identifies as not having the disorder)

Example • 100 children diagnosed with language impairments (LI) and enrolled in language intervention, and 100 same-age children with no history of language impairment (LN), were administered a new test of grammatical morphology. • 80 of the children with LI, and 30 of the children with LN, scored in the disordered range on the new measure.

Disorder Status (re: Gold Standard) + Disorder (LI) - Disorder (LN) + Disorder (LI) New Test Result -Disorder (LN) 100 with disorder Sens= a/a+c= 80/100 = .80 100 without disorder Spec = d/b+d = 70/100 = .70

Why not just use sensitivity and specificity as measures of accuracy? • It’s their interrelationship that is most important overall • Sensitivity and specificity vary substantially according sample characteristics, including N, base rate (prevalence), severity, confusability • Likelihood Ratios are not impervious to sample characteristics, but are much less affected than are sensitivity and specificity

Calculating Likelihood Ratios • Sens = .80 • Spec = .70 • LR+ = sens/1-spec = .80/.30 = 2.67 • LR- = 1-sens/spec = .20/.770 = 0.29 • Several programs, some free on web, are set up to allow entry in 2x2 table format • In addition to accuracy measures, they also provide information on precision

Precision of a diagnostic indicator • Width of confidence intervals (CI) for sensitivity, specificity, and likelihood ratios, calculated by adding and subtracting a multiple of standard error (e.g., 1.96 SE for a 95% CI) • Standard error depends on sample size and reliability; larger samples and higher reliability will result in narrower CIs, all else being equal • Sackett et al. (2000) appendix shows how to calculate CIs by hand, and programs (some free) provide CIs given raw numbers in a 2x2 table

Sample size and precision: 95% CIs for studies with same LRs but different Ns N = 200 N = 20 Value (95% CI) (95% CI) Sens = .80 (0.71-0.87) (0.44-0.98) Spec = .70 (0.60-0.79) (0.35-0.93) LR+ = 2.67 (1.98-3.70) (1.12-7.66) LR- = 0.29(0.19-0.42) (0.08-0.87)

CAT-ing evidence on a diagnostic indicator (Sackett et al., 2000; Battaglia et al., 2002) • Does the study report a comparison between measures, or measure and gold standard? • sine qua non for evidence of diagnostic accuracy • Was the gold (or reference) standard valid, reliable, and/or reasonable? • Gold standard and new indicator also must be independent to avoid incorporation bias that can inflate accuracy measures

Criteria for diagnostic indicators (cont.) • Were patients enrolled prospectively and consecutively (or by random assignment), and • Did the sample include a spectrum of patient types and severities? • These two criteria are important in avoiding spectrum bias, in which the sample includes only clear-cut or hand-picked cases and thus does not represent the diagnostic task

Criteria for diagnostic indicators (cont.) • Were the new measure and the reference standard administered independently, by different examiners, and • Were the examiners blinded to the subject’s performance on the other test and to other relevant subject information? • Were the new measure and the reference standard both administered to all subjects and controls? • Important to avoid differential verification bias, when controls are assumed to be normal without testing on gold standard

Criteria for diagnostic indicators (cont.) • Do likelihood ratios suggest adequate diagnostic accuracy? • LR+ > 4.0 (> 10 cf. Bayes Library, 2002) • LR- < 0. 40 (< 0.20, cf Bayes Library, 2002) • Precision (narrow confidence intervals)? • Feasibility for usual clinical practice? • Value (i.e., better than current measure)?

Evidence on norm-referenced tests as diagnostic indicators for early LI • Many norm-referenced tests have diagnosis of LI as their explicit purpose • A growing number of tests meet typical psychometric criteria, e.g. N = 100 subjects per age level; reliability > .90; means, standard deviations, and standard errors of measurement • But very few provide evidence of diagnostic accuracy or precision, and none meet the recommended critical appraisal criteria

Norm-referenced tests not providing information on accuracy or precision • Test of Language Development (TOLD) • Sequenced Inventory of Language Development (SICD) • Test of Early Language Development (TELD) • Reynell Scales • MacArthur Communicative Development Inventories (CDI)

A few tests provide information allowing accuracy and precision to be calculated Age LI LN LR+ (95% CI) LR- (95% CI) PLS-4 Total language score < 85 3 24 24 6.7 (2.6-19.4) 0.19 (.08-.42) 4 23 23 18 (3.6-102) 0.23 (.10-.44) 5 28 28 4.4 (2.1-10.2) 0.26 (.12-.50) 3-5 75 75 6.7 (3.7-12.5) 0.23 (.14-.35) CELF-P Total language score < 85 3-5 80 80 5.3 (2.9-10.2) 0.45 (.34-.58) CELF-P Total language score < 77 3-5 80 80 12.7 (4.4-37.8) 0.54 (.43-.66) But note that these studies would fail many of the other critical appraisal criteria, their accuracy notwithstanding.

The situation is no better for other proposed diagnostic indicators • Few compare indicator to a gold standard, so accuracy can’t be determined • Few used blinded examiners, so a high potential for context and other biases • Small samples, wide CIs (rarely provided) • When sensitivity and specificity have been reported, they have sometimes been calculated incorrectly and/or misinterpreted

I choose not to despair • Knowing the limitations of our diagnostic tools is an important prerequisite to designing better diagnostic tools • Several possible ways forward, most involving clinician-researcher partnerships

Away forward to EBP in Speech-language pathology and Audiology • Designing studies to meet the criteria for strong evidence • e.g., STARD (Bossuyt et al., 2003) statement • Large-scale, cooperative studies of diagnostic indicators • CARE-COAD model (Straus et al. 2002) • Dealing with the absence of a gold standard • e.g., Demissie et al., 1998; Dunson, 2001; reliability and outcome studies • Diagnostic studies as multivariable, prediction research (Moons & Grobbee, 2002)

Test yourself • Critical appraisal of diagnostic test (handout p. 5) • Critical appraisal of treatment study (handout p. 4)

Critical appraisal and CAT enable the remaining steps to EBP 5. Decide whether the evidence is strong enough to influence your clinical practice 6. Integrate the evidence with the “intangibles” 7. Update!

EBP is itself a set of assumptions, not a cult • Ultimately, strong evidence will be needed to determine whether EBP results in improved clinical service. • And EBP can’t be applied blindly, to all kinds of problems...

As with many interventions intended to prevent ill health, the effectiveness of parachutes has not been subjected to rigorous evaluation by using randomised controlled trials. Advocates of evidence based medicine have criticised the adoption of interventions evaluated by using only observational data. We think that everyone might benefit if the most radical protagonists of evidence based medicine organised and participated in a double blind, randomised, placebo controlled, crossover trial of the parachute.

Thanks! References

CAT (Critically Appraised Topic) (adapted from Sackett, et al. 2000)