Comparative Research on Training Simulators in Emergency Medicine: A Methodological Review

Comparative Research on Training Simulators in Emergency Medicine: A Methodological Review Matt Lineberry, Ph.D. Research Psychologist, NAWCTSD matthew.lineberry@navy.mil Medical Technology, Training, & Treatment (MT3) May 2012

Credits and Disclaimers • Co-authors • Melissa Walwanis, Senior Research Psychologist, NAWCTSD • Joseph Reni, Research Psychologist, NAWCTSD • These are my professional views, not necessarily those of NAWCTSD, NAVMED, etc.

Objectives • Motivate conduct of comparative research in simulation-based training (SBT) for healthcare • Identify challenges evident from past comparative research • Promote more optimal research methodologies in future research

Cook et al. (2011) meta-analysis in JAMA: “…wequestion the need for further studies comparing simulation with no intervention (ie, single-group pretest-posttest studies and comparisons with no-intervention controls). …theory-based comparisons between different technology-enhanced simulation designs (simulation vs. simulation studies) that minimize bias, achieve appropriate power, and avoid confounding… are necessary”

Issenberg et al. (2011) research agenda in SIH: “…studies that compare simulation training to traditional training or no training (as is often the case in control groups), in which the goal is to justify its use or prove it can work, do little to advance the field of human learning and training.”

Moving forward:comparative research • How do varying degrees and types of fidelity affect learning? • Are some simulation approaches or modalities superior to others?For what learning objectives?Which learners? Tasks? Etc. • How do cost and throughput considerations affect the utility of different approaches?

Where are we now? Searched for peer-reviewed studies comparing training effectiveness of simulation approaches and/or measured practice on human patients for emergency medical skills • Searched PubMed and CINAHL • mannequin, manikin, animal, cadaver, simulat*, virtual reality, VR, compar*, versus, and VS • Exhaustively searched Simulation in Healthcare • Among identified studies, searched references forward and backward

Reviewed studies 17 studies met criteria • Procedure trained: • Predominantly needle access (7 studies).4 airway adjunct, 3 TEAM, 2 FAST, etc. • Simulators compared: • Predominantly manikins, VR systems, and part-task trainers

Reviewed studies • Design: Almost entirely between-subjects (16 of 17) • Trainee performance measurement: • 7 were post-test only; all others included pre-tests • Most (9 studies) use expert ratings;also: knowledge tests (7), success/failure (6), and objective criteria (5) • 6 studies tested trainees on actual patients • 6 tested trainees on one of the simulators used in training

Apparent methodological challenges • Inherently smaller differences between conditions – and consequently, underpowered designs • An understandable desire to “prove the null” – but inappropriate approaches to testing equivalence • Difficulty measuring or approximating the ultimate criterion: performance on the job

Challenge #1: Detecting “small” differences • Cook et al. (2011) meta:Differences in outcomes of roughly 0.5-1.2 standard deviations, favoring simulation-based training over no simulation.Comparative research should expect smaller differences than these. • HOWEVER, small differences can have great practical significance if they… • correspond to important outcomes (e.g., morbidity or mortality), • can be exploited widely, and/or • can be exploited inexpensively.

The power of small differences… • Physicians Health Study:Aspirin trial halted prematurely due to obvious benefit for heart attack reduction • Effect size: r = .034 • Of 22k participants, 85 fewer heart attacks in the aspirin group

…and the tyranny of small differences • Probability to detect differences (power) decreases exponentially as effect size decreases • We generally can’t control effect sizes.Among other things, we can control: • Sample size • Reliability of measurement • Chosen error rates

Sample size • Among reviewed studies, n ranges from 8 to 62; median n = 15. • If n = 15, α = .05, true difference = 0.2 SDs, and measurement is perfectly reliable,probability of detecting the difference is only 13% RECOMMENDATION:Pool resources in multi-site collaborations to achieve needed power to detect effects(and estimate power requirements a priori)

Reliability of measurement • Potential rater errors are numerous • Typical statistical estimates can be uninformative (i.e. coefficient alpha, inter-rater correlations) • If measures are unreliable – and especially if samples are also small – you’ll almost always fail to find differences,whether they exist or not

Reliability of measurement Among nine studies using expert ratings: • Only two used multiple raters for all participants • Six studies did not estimate reliability at all • One study reported an inter-rater reliability coefficient • Two studies reported correlations between raters’ scoresBoth approaches make unfounded assumptions • Ratings were never collected on multiple occasions

Reliability of measurement RECOMMENDATIONS: • Use robust measurement protocols –e.g., frame-of-reference rater training, multiple raters • For expert ratings, use generalizability theory to estimate and improve reliability G-theory respects a basic truth: “Reliability” is not a single value associated with a measurement toolRather, it depends on how you conduct measurement, who is being measured, the type of comparison for which you use the scores, etc.

G-theory process, in a nutshell • Collect ratings, using an experimental design to expose sources of error(e.g., have multiple raters give ratings, on multiple occasions) • Use ANOVA to estimate magnitude of errors • Given results from step 2, forecast what reliability will result from different combinations of raters, occasions, etc.

Weighted scoring • Two studies used weighting schemes – more points associated with more critical procedural steps • Can improve both reliability and validity • RECOMMENDATION:Use task analytic procedures to identify criticality of subtasks;weight scores accordingly

Selecting error rates Why do we choose p = .05 as the threshold for statistical significance?

Relative severity of errors Type I error: “Simulator x is more effective than Simulator y”(but really, they’re equally effective) Potential outcome: Largely trivial; both are equally effective, so erroneously favoring one does not affect learning or patient outcomes Type II error: “Simulators x and y are equally effective”(but really, Simulator X is superior) Potential outcome: Adverse effects on learning and patient outcomes if Simulator X is consequently underutilized

Relative severity of errors α=.05 Type I error: “Simulator x is more effective than Simulator y”(but really, they’re equally effective) Potential outcome: Largely trivial; both are equally effective, so erroneously favoring one does not affect learning or patient outcomes Type II error: “Simulators x and y are equally effective”(but really, Simulator X is superior) Potential outcome: Adverse effects on learning and patient outcomes if Simulator X is consequently underutilized β=1-power (e.g., 1-.80 = .20)

Relative severity of errors • RECOMMENDATION:Particularly in a new line of research, adopt an alpha level that rationally balances inferential errors according to their severity Cascio, W. F., & Zedeck, S. (1983). Open a new window in rational research planning: Adjust alpha to maximize statistical power. Personnel Psychology, 36, 517-526. Murphy, K. (2004). Using power analysis to evaluate and improve research. In S.G. Rogelberg(Ed.), Handbook of research methods in industrial and organizational psychology (Chapter 6, pp. 119-137). Malden, MA: Blackwell.

Challenge #2: Proving the null • Language in studies often reflects desire to assert equivalence • e.g., different simulators are “reaching parity” • Standard null hypothesis statistical testing (NHST) does not support this assertion • Failure to detect effects should prompt reservation of judgment, not acceptance of the null hypothesis

Which assertion is more bold? “Sim X is more effective than Sim Y” “Sims X and Y are equally effective” Y favored X favored 0 Y favored X favored 0

Proving the null • Possible to prove the null: • Set a region of practical equivalence around zero • Evaluate whether all plausible differences (e.g., 95% confidence interval) fall within the region • RECOMMENDATION: • Avoid unjustified acceptance of the null • Use strong tests of equivalence when hoping to assert equivalence • Be explicit about what effect size you would consider practically significant, and why

Challenge #3: Getting to the ultimate criterion • The goal is not test performance but job performance;“the map is not the terrain” • Typical to test demonstration of procedures, often on a simulator • Will trainees perform similarly on actual patients, under authentic work conditions? • Do trainees know when to execute the procedure? • Are trainees willingto act promptly?

e.g.: Roberts et al. (1997) • No differences detected in rate of successful laryngeal mask airway placement for manikin vs. manikin-plus-live-patient training • However: Confidence very low, and only increased with live-patient practice • “…if a nurse does not feel confident enough… the patient will initially receive pocket-mask or bag-mask ventilation, and this is clearly less desirable”Issue of willingness to act decisively

Criterion relevance • RECOMMENDATION:Where possible, use criterion testbeds that correspond highly to actual job performance • Assess performance on human patients/volunteers • Replicate performance-shaping factors (not just environment) • Test knowledge of indications and willingness to act

What if patients can’t be used? • Using simulators as the criterion testbed introduces potential biases • e.g., train on cadaver or manikin; test on a different manikin

A partial solution:Crossed-criterion design

A partial solution:Crossed-criterion design • Advantages • Mitigates bias • Allows comparison of generalization of learning from each training condition • Disadvantages • Precludes pre-testing, if pre-test exposure to each simulator is sufficiently lengthy to derive learning benefits

Conclusions • “The greatest enemy of a good plan is the dream of a perfect plan” • All previous comparative research is to be lauded for pushing the field forward • Concrete steps can be taken to maximize the theoretical and practical value of future comparative research

Comparative Research on Training Simulators in Emergency Medicine: A Methodological Review