Operational Data or Experimental Design? A Variety of Approaches to Examining the Validity of Test Accommodations

Operational Data or Experimental Design?A Variety of Approaches to Examining the Validity of Test Accommodations Cara Cahalan-Laitusis

Review types of evidence • Review current research designs • Pros/Cons for each approach

Types of Validity Evidence • Psychometric research • Experimental research • Survey research • Argument based approach

Psychometric Indicators (National Academy of Sciences, 1982) • Reliability • Factor Structure • Item functioning • Predicted Performance • Admission Decisions

Psychometric Evidence • Is the test as reliable when taken with and without accommodations? (Reliability) • Does the test (or test items) appear to measure the same construct for each group? (Validity) • Are test items of relatively equal difficulty for students with and without a disability who are matched on total test score? (Fairness/Validity)

Psychometric Evidence • Are completion rates relatively equal between students with and without a disability who are matched on total test score? (Fairness) • Is equal access provided to testing accommodations across different disability, racial/ethnic, language, gender, and socio-economic groups? (Fairness) • Do tests scores under or over predict an alternate measure of performance (e.g., grades, teacher ratings, other test scores, post graduate success) for students with disabilities? (Validity)

Advantages of Operational Data • Cost effective • Quick results • Easy to replicate • Provides evidence of validity • Large sample size • Motivated test takers

Disability and accommodation are confounded Order effects can not be controlled for Sample size can be insufficient Difficult to show reasons why data is not comparable between subgroups Disability and Accommodation codes are not always accurate Approved accommodations may not be used Disability category may be too broad Limitations of Operational Data

Types of Analyses • Correlations • Factor Analysis • Differential Item Functioning • Descriptive analyses

Relationship Among Content Areas • Correlation between content areas (e.g. reading and writing) can also assess a tests reliability. • Compare correlations among content areas by population (e.g., LD with read aloud vs. LD without an accommodation) • Does the accommodation alter construct being measured? (e.g., correlations between reading and writing may be lower if read aloud is used for writing but not reading). • Is correlation significantly lower for one population? (difference of .10 or greater)

Reliability • Examine internal consistency measures • with and without specific accommodations • with and without a disability • Examine test-retest reliability between different populations • with and without specific accommodations • with and without a disability

Factor Structure • Types of questions • Are the number of factors invariant? • Are the factor loadings invariant for each of the groups? • Are the intercorrelations of the factors invariant for each of the groups?

Differential Item Functioning • DIF refers to a difference in item performance between two comparable groups of test takers • DIF exists if test takers who have the same underlying ability level are not equally likely to get an item correct • Some recent DIF studies on accommodations/disability • Bielinski, Thurlow, Ysseldyke, Freidebach & Friedebach, 2001 • Bolt, 2004 • Barton & Finch, 2004 • Cahalan-Laitusis, Cook, & Aicher, 2004

Issues Related to the Use of DIF Procedures for Students with Disabilities • Group characteristics • Definition of group membership • Differences between ability levels of reference and focal groups • The characteristics of the criterion • Unidimensional • Reliable • Same meaning across groups

Procedures/Sample • DIF Procedures (e.g., Mantel-Haenszel, Logistic regression, DIF analysis paradigm, Sibtest) • Reference/focal groups • minimum of 100 per group, ETS uses a minimum of 300 for most operational tests • Select groups that are specific (e.g., LD with read aloud) rather than broad (e.g., all students with IEP or 504)

DIF with hypotheses • Generate hypotheses on why items may function differently • Code items based on hypotheses • Compare DIF results with item coding • Examine DIF results to generate new hypotheses

Other Psychometric Research • DIF to examine fatigue on extended time • Item completion rates between groups matched on ability • Loglinear analysis to examine if specific demographic subgroups (SES, race/ethnicity, geographic regions, gender) are using specific accommodation less than other groups.

Other Research Studies • Experimental Research • Differential Boost • Survey/Field Test Research • Argument-based Evidence

Advantages of Collecting Data • Disability and accommodation can be examined separately • Form and Order effects can be controlled • Sample can be specific (e.g., reading-based LD rather than all LD or LD with or without ADHD) • Opportunity to collect additional information • Reasons for differences can be tested • Data can be reused for psychometric analyses

Cost of large data collection Test takers may not be as motivated More time consuming than psychometric research Over testing of students Disadvantages

Differential Boost (Fuchs & Fuchs 1999) • Would students without disabilities benefit as much from the accommodation as students with disabilities? • If Yes then the accommodation is not valid. • If No, then the accommodation may be valid.

Differential Boost Design

Ways to reduce cost: • Decrease sample size • Randomly assign students to one of two conditions • Use operational test data for one of the two sessions

Additional data to collect: • Alternate measure of performance on construct being assessed • Teacher survey (ratings of student performance, history of accommodation use) • Student survey • Observational data (how student used accommodation) • Timing data

Additional Analyses • Differential Boost • by subgroups • controlling for ability level • Psychometric properties (e.g, DIF) • Predictive Validity (alt performance measure required)

Field Testing Survey • How well does item type measure intended construct (e.g., reading comprehension, problem solving)? • Did you have enough time to complete this item type? • How clear were the directions (for this type of test question)?

Field Testing Survey • How would you improve this item type? • To make the directions clearer • To measure the intended construct • What specific accommodations would improve this item type? • Which presentation approach did the test takers prefer?

Additional Types of Surveys • How accommodation decisions are made • Expert opinion on how/if accommodation interferes with construct being measured • Information on how test scores with and without accommodations interpreted • Correlation between use of accommodations in class and on standardized tests

Additional Research Designs • Think Aloud Studies or Cognitive Labs • Item Timing Studies • Scaffolded Accommodations

Argument-Based Validity • Clearly Define Construct Assessed • Evidence Centered Design • Decision Tree

Operational Data or Experimental Design? A Variety of Approaches to Examining the Validity of Test Accommodations