Assessing Intervention Fidelity in RCTs: Concepts and Methods

Assessing Intervention Fidelity in RCTs: Concepts and Methods Panelists: David S. Cordray, PhD Chris Hulleman, PhD Joy Lesnick, PhD Vanderbilt University Presentation for the IES Research Conference Washington, DC June 12, 2008

Overview • Session planned as an integrated set of presentations • We’ll begin with: • Definitions and distinctions; • Conceptual foundation for assessing fidelity in RCTs, a special case. • Two examples of assessing implementation fidelity: • Chris Hulleman will illustrate an assessment for an intervention with a single core component • Joy Lesnick illustrates additional consideration when fidelity assessment is applied to intervention models with multiple program components. • Issues for the future • Questions and discussion

Definitions and Distinctions

Dimensions Intervention Fidelity • Little consensus on what is meant by the term “intervention fidelity”. • But Dane & Schneider (1998) identify 5 aspects: • Adherence/compliance– program components are delivered/used/received, as prescribed; • Exposure – amount of program content delivered/received by participants; • Quality of the delivery – theory-based ideal in terms of processes and content; • Participant responsiveness – engagement of the participants; and • Program differentiation – unique features of the intervention are distinguishable from other programs (including the counterfactual)

Distinguishing Implementation Assessment from Implementation Fidelity Assessment • Two models of intervention implementation, based on: • A purely descriptive model • Answering the question “What transpired as the intervention was put in place (implemented). • An a priori intervention model, with explicit expectations about implementation of core program components. • Fidelity is the extent to which the realized intervention (tTx) is “faithful” to the pre-stated intervention model (TTx) • Fidelity = TTx – tTx • We emphasize this model

What to Measure? • Adherence to the intervention model: • (1) Essential or core components (activities, processes); • (2) Necessary, but not unique to the theory/model, activities, processes and structures (supporting the essential components of T); and • (3) Ordinary features of the setting (shared with the counterfactual groups (C) • Essential/core and Necessary components are priority parts of fidelity assessment.

An Example of Core Components” Bransford’s HPL Model of Learning and Instruction • John Bransford et al. (1999) postulate that a strong learning environment entails a combination of: • Knowledge-centered; • Learner-centered; • Assessment-centered; and • Community-centered components. • Alene Harris developed an observation system (the VOS) that registered novel (components above) and traditional pedagogy in classes. • The next slide focuses on the prevalence of Bransford’s recommended pedagogy.

Challenge-based Instruction in “Treatment” and Control Courses: The VaNTH Observation System (VOS) Percentage of Course Time Using Challenge-based Instructional Strategies Adapted from Cox & Cordray, in press

Implications • Fidelity can be assessed even when there is no known benchmark (e.g., 10 Commandments) • In practice interventions can be a mixture of components with strong, weak or no benchmarks • Control conditions can include core intervention components due to: • Contamination • Business as usual (BAU) contains shared components, different levels • Similar theories, models of action • But to index “fidelity”, we need to measure components within the control condition

Linking Intervention Fidelity Assessment to Contemporary Models of Causality • Rubin’s Causal Model: • True causal effect of X is (YiTx – YiC) • RCT methodology is the best approximation to the true effect • Fidelity assessment within RCT-based causal analysis entails examining the difference between causal components in the intervention and counterfactual condition. • Differencing causal conditions can be characterized as “achieved relative strength” of the contrast. • Achieved Relative Strength (ARS) = tTx – tC • ARS is a default index of fidelity

Treatment Strength Outcome .45 .40 .35 .30 .25 .20 .15 .10 .05 .00 TTx 100 90 85 80 75 70 65 60 55 50 Infidelity t tx (85)-(70) = 15 txC “Infidelity” TC Achieved Relative Strength =.15 ExpectedRelative Strength =.25

In Practice…. • Identify core components in both groups • e.g., via a Model of Change • Establish bench marks for TTX and TC; • Measure core components to derive tTx and tC • e.g., via a “Logic model” based on Model of Change • With multiple components and multiple methods of assessment; achieved relative strength needs to be: • Standardized, and • Combined across: • Multiple indicators • Multiple components • Multiple levels (HLM-wise) • We turn to our examples….

Assessing Implementation Fidelity in the Lab and in Classrooms: The Case of a Motivation Intervention Chris S. Hulleman Vanderbilt University

The Theory of Change INTEREST MANIPULATED RELEVANCE PERCEIVED UTILITY VALUE PERFORMANCE Adapted from: Hulleman (2008); Hulleman, Godes, Hendricks, & Harackiewicz (2008); Hulleman & Harackiewicz (2008); Hulleman, Hendricks, & Harackiewicz (2007); Eccles et al. (1983); Wigfield & Eccles (2002); Hulleman et al. (2008)

Methods

Motivational Outcome ? g = 0.05 (p = .67)

Fidelity Measurement and Achieved Relative Strength • Simple intervention – one core component • Intervention fidelity: • Defined as “quality of participant responsiveness” • Rated on scale from 0 (none) to 3 (high) • 2 independent raters, 88% agreement

Quality of Responsiveness

Indexing Fidelity Absolute • Compare observed fidelity (tTx) to absolute or maximum level of fidelity (TTx) Average • Mean levels of observed fidelity (tTx) Binary • Yes/No treatment receipt based on fidelity scores • Requires selection of cut-off value

Fidelity Indices

Indexing Fidelity as Achieved Relative Strength Intervention Strength = Treatment – Control Achieved Relative Strength (ARS) Index • Standardized difference in fidelity index across Tx and C • Based on Hedges’ g (Hedges, 2007) • Corrected for clustering in the classroom (ICC’s from .01 to .08)

Average ARS Index Group Difference Sample Size Adjustment Clustering Adjustment Where, = mean for group 1 (tTx ) = mean for group 2 (tC) ST = pooled within groups standard deviation nTx = treatment sample size nC = control sample size n = average cluster size p = Intra-class correlation (ICC) N = total sample size

Absolute and Binary ARS Indices Group Difference Sample Size Adjustment Clustering Adjustment Where, pTx = proportion for the treatment group (tTx ) pC = proportion for the control group (tC) nTx = treatment sample size nC = control sample size n = average cluster size p = Intra-class correlation (ICC) N = total sample size

Average ARS Index Fidelity Achieved Relative Strength = 1.32 Treatment Strength 100 66 33 0 3 2 1 0 TTx Infidelity t tx (0.74)-(0.04) = 0.70 tC “Infidelity” TC

Achieved Relative Strength Indices

Linking Achieved Relative Strength to Outcomes

Sources of Infidelity in the Classroom Student behaviors were nested within teacher behaviors • Teacher dosage • Frequency of responsiveness Student and teacher behaviors were used to predict treatment fidelity (i.e., quality of responsiveness).

Sources of Infidelity: Multi-level Analyses Part I: Baseline Analyses • Identified the amount of residual variability in fidelity due to students and teachers. • Due to missing data, we estimated a 2-level model (153 students, 6 teachers) Student: Yij = b0j + b1j(TREATMENT)ij+ rij, Teacher: b0j = γ00 + u0j, b1j = γ10 + u10j

Sources of Infidelity: Multi-level Analyses Part II: Explanatory Analyses • Predicted residual variability in fidelity (quality of responsiveness) with frequency of responsiveness and teacher dosage Student: Yij = b0j + b1(TREATMENT)ij+ b2(RESPONSE FREQUENCY)ij+ rij Teacher: b0j = γ00 + u0j b1j = γ10 + b10(TEACHER DOSAGE)j+ u10j b2j = γ20 + b20(TEACHER DOSAGE)j+ u20j

Sources of Infidelity: Multi-level Analyses * p < .001.

Case Summary • The motivational intervention was more effective in the lab (g = 0.45) than field (g = 0.05). • Using 3 indices of fidelity and, in turn, achieved relative treatment strength, revealed that: • Classroom fidelity < Lab fidelity • Achieved relative strength was about 1 SD less in the classroom than the laboratory • Differences in achieved relative strength = differences motivational outcome, especially in the lab. • Sources of fidelity: teacher (not student) factors

Assessing Fidelity of Interventions with Multiple Components: A Case of Assessing Preschool Interventions Joy Lesnick

What Do We Mean By Multiple Components in Preschool Literacy Programs? How do you define preschool instruction? Academic content, materials, student-teacher interactions, student-student interactions, physical development, schedules & routines, assessment, family involvement, etc. etc. How would you measure implementation? Preschool Interventions: Are made up of components (e.g., sets of activities and processes) that can be thought of as constructs; These constructs vary in meaning, across actors (e.g., developers, implementers, researchers); They are of varying levels of importance within the intervention; and These constructs are made up of smaller parts that need to be assessed. Multiple components makes assessing fidelity more challenging 33

Overview Four areas of consideration when assessing fidelity of programs with multiple components: Specifying Multiple Components Major Variations in Program Components The ABCs of Item and Scale Construction Aggregating Indices One caveat: Very unusual circumstances Goal of this work: To build on the extensive evaluation work that had already been completed and use the case study to provide a framework for future efforts to measure fidelity of implementation. 34

1. Specifying Multiple Components Our Process Extensive review of program materials Potentially hundreds of components How many indicators do we need to assess fidelity? 35

1. Specifying Multiple Components 1234 Oral Language Language, comprehension, response to text Literacy 1234 Math Interactions between teacher and child Book and print awareness 1234 Social & Personal Development Phonemic awareness Physical Environment 1234 Healthful Living Letter and word recognition Routines and classroom management Scientific Thinking Materials 1234 Social Studies Content Instruction Creative Arts Assessment Physical Development Processes Family Involvement Technology Structured lessons Structured units Writing 1234 36 Constructs Sub-Constructs Facets Elements Indicators

Conceptual differences between programs may happen at micro-levels Empirical differences between program implementation may happen at more macro levels Theoretically expected differences vs. empirically observed differences Must identify conceptual differences between programs at the smallest grain size at the outset, although may be able to detect empirical differences once implemented at higher macro levels Grain Size is Important 37

2. Major Variations in Program Components One program often has some combination of these different types of components: Scripted (highly structured) activities Unscripted (unstructured) activities Nesting of activities Micro-level (discrete) activities Macro-level (extended) activities What you’re trying to measure will influence how to measure it -- and how often it needs to be measured. 38

2. Major Variations in Program Components Abs: “Absolute Fidelity” Index: what happened as compared to what should have happened – highest standard Avg: Magnitude or exposure level; indicates what happened, but it’s not very meaningful – how do we know if level is good or bad? Bin: Binary Complier: Can we set a benchmark to determine whether or not program component was successfully implemented? >30% for example? Is that realistic? Meaningful? ARS : Difference in magnitude between Tx and C – relative strength – is there enough difference to warrant a treatment effect? 39

Dots under a microscope – what is it???

Starry Night, Vincent Van Gogh, 1889

We must measure the trees… and also the forest… Micro-level (discrete) activities Depending on the condition, daily activities (i.e. whole group time, small group time, center activities) may be scripted or unscripted and take place within larger structure of theme under study. Macro-level (extended) activities Month long thematic unit (is structured in treatment condition and unstructured in control) is underlying extended structure within which scripted or unscripted micro activities take place. In multi-component programs, many activities are nested within larger activity structures. This nesting has implications for fidelity analysis – what to measure and how to measure it. 42

3. The ABCs of Item and Scale Construction Aim for one-to-one correspondence of indicators to component of interest Balance items across components Coverage and quality are more important than the quantity of items 43

Aim for one-to-one correspondence Example of more than one component being assessed in one item: [Does the teacher] Talk with children throughout the day, modeling correct grammar, teaching new vocabulary, and asking questions to encourage children to express their ideas in words? (Yes/No) Example of one component being measured in each item: Teacher provides an environment wherein students can talk about what they are doing. Teacher listens attentively to students’ discussions and responses. Teacher models and/or encourages students to ask questions during class discussions. Diff bw T & C (Oral Lang)*: T: 1.80 (0.32) C: 1.36 (0.32) ARS ES: 1.38 T: 3.45 (0.87) C: 2.26 (0.57) ARS ES: 1.62 *Data for the case study comes from an evaluation conducted by Dale Farran, Mark Lipsey, Carol Blibrey, et al. 44

Balance items across components How many items are needed for each scale? Oral-language over represented Scales with α<0.80 not reliable 45

Coverage and quality more important than quantity Two scales each have 2 items, but very different levels of reliability How many items are needed for each scale? Oral Language: 20 items. Randomly selected items and recalculated alpha: 10 items: α = 0.92 8 items: α = 0.90 6 items: α = 0.88 5 items: α = 0.82 4 items: α = 0.73 46

Aggregating Indices To weight or not to weight? How do we decide? Possibilities: Theory Consensus $$ spent Time spent Case study example – 2 levels of aggregation within and between: Unit-weight within facet: “Instruction – Content – Literacy” Hypothetical weight across sub-construct: “Instruction – Content” 47

YOU ARE HERE…. 1234 Oral Language Language, comprehension, response to text Literacy 1234 Math Interactions between teacher and child Book and print awareness 1234 Social & Personal Development Phonemic awareness Physical Environment 1234 Healthful Living Letter and word recognition Routines and classroom management Scientific Thinking Materials 1234 Social Studies Content Instruction Creative Arts Assessment Physical Development Processes Family Involvement Technology Structured lessons Structured units Writing 1234 UNIT WEIGHT THEORY WEIGHT HOW WEIGHT? HOW WEIGHT? 48

Aggregating Indices • Unit-weight within facet: Instruction – Content – Literacy 49 **clustering is ignored

Aggregating Indices • Theory-weight across sub-construct (hypothetical)

Assessing Intervention Fidelity in RCTs: Concepts and Methods