390 likes | 562 Views
Empirical Bayes DIF Assessment Rebecca Zwick, UC Santa Barbara. Presented at Measured Progress August 2007. Overview. Definition and causes of DIF Assessing DIF via Mantel-Haenszel EB enhancement to MH DIF (1994-2002, with D. Thayer & C. Lewis) Model and Applications Simulation findings
E N D
Empirical Bayes DIF AssessmentRebecca Zwick, UC Santa Barbara Presented at Measured Progress August 2007
Overview • Definition and causes of DIF • Assessing DIF via Mantel-Haenszel • EB enhancement to MH DIF (1994-2002, with D. Thayer & C. Lewis) • Model and Applications • Simulation findings • Discussion
What’s differential item functioning ? • DIF occurs when equally skilled members of 2 groups have different probabilities of answering an item correctly. (Only dichotomous items considered today)
IRT Definition of (absence of) DIF • Lord, 1980: P(Yi = 1| , R) = P(Yi = 1| , F) means DIF is absent • P(Yi = 1| , G) is the probability of correct response to item i, given , in group G, G = F (focal) or R (Reference). • is a latent ability variable, imperfectly measured by test score S. (More later...)
Reasons for DIF • “Construct-irrelevant difficulty” (e.g., sports content in a math item) • Differential interests or educational background: NAEP History items with DIF favoring Black test-takers were about M. L. King, Harriet Tubman, Underground Railroad (Zwick & Ercikan, 1989) • Often mystifying (e.g., “X + 5 = 10” has DIF; “Y + 8 = 11” doesn’t)
Mini-history of DIF analysis: • DIF research dates back to 1960’s • In late 1980’s (“Golden Rule”), testing companies started including DIF analysis as a QC procedure. • Mantel-Haenszel (Holland & Thayer, 1988): method of choice for operational DIF analyses • Few assumptions • No complex estimation procedures • Easy to explain
Mantel-Haenszel: • Compare item performance for members of 2 groups, after matching on total test score, S. • Suppose we have K levels of the score used for matching test-takers, s1, s2, …sK • In each of the K levels, data can be represented as a 2 x 2 table (Right/Wrong by Reference/Focal).
Mantel-Haenszel • For each table, compute conditional odds ratio= Odds of correct response| S=sk, G=R Odds of correct response| S=sk, G=F • Weighted combination of these K values is MH odds ratio, • MH DIF statistic is -2.35 ln( )
Mantel-Haenszel The MH chi-square tests the hypothesis, H0: k = = 1, k = 1, 2, … K versus H1: k = ≠ 1, k = 1, 2, … K where kis the population odds ratio at score level k. (Above H0 is similar, but not, in general, identical to the IRT H0; see Zwick, 1990 Journal of Educational Statistics)
Mantel-Haenszel • ETS: Size of DIF estimate, plus chi-square results are used to categorize item: • A: negligible DIF • B: slight to moderate DIF • C: substantial DIF • For B and C, “+” or “-” used to indicate DIF direction: “-” means DIF against focal group. • Designation determines item’s fate.
Drawbacks to usual MH approach • May give impression that DIF status is deterministic or is a fixed property of the item • Reviewers of DIF items often ignore SE • Is unstable in small samples, which may arise in CAT settings
EB enhancement to MH: • Provides more stable results • May allow variability of DIF findings to be represented in a more intuitive way • Can be used in three ways • Substitute more stable point estimates for MH • Provide probabilistic perspective on true DIF status (A, B, C) and future observed status • [Loss-function-based DIF detection]
Main Empirical Bayes DIF Work (supported by ETS and LSAC) • An EB approach to MH DIF analysis (with Thayer & Lewis). JEM, 1999. [General approach, probabilistic DIF] • Using loss functions for DIF detection: An EB approach (with Thayer & Lewis). JEBS, 2000. [Loss functions] • The assessment of DIF in CATs. In van der Linden & Glas (Eds.) CAT: Theory and Practice, 2000. [review] • Application of an EB enhancement of MH DIF analysis to a CAT (with Thayer). APM, 2002. [simulated CAT-LSAT]
What’s an Empirical Bayes Model?(See Casella (1985), Am. Statistician) • In Bayesian statistics, we assume that parameters have prior distributions that describe parameter “behavior.” • Statistical theory, or past research may inform us about the nature of those distributions. • Combining observed data with the prior distribution yields a posterior (“after the data”)distribution that can be used to obtain improved parameter estimates. • “EB” means prior’s parameters are estimated from data (unlike fully Bayes models).
Recall: EB DIF estimate is a weighted combination of MHi and prior mean.
Next… • Performance of EB DIF estimator • “Probabilistic DIF” idea
How does EB DIF estimator EBi compare to MHi? • Applied to real data, including GRE • Applied to simulated data, including simulated CAT-LSAT (Zwick & Thayer, 2002): • Testlet CAT data simulated, including items with varying amounts of DIF • EB and MH both used to estimate (known) True DIF • Performance compared using RMSR, variance, and bias measures
Design of Simulated CAT • Pool: 30 5-item testlets (150 items total) • 10 Testlets at each of 3 difficulty levels • Item data generated via 3PL model • CAT algorithm was based on testlet scores • Examinees received 5 testlets (25 items) • Test score (used as DIF matching variable) was expected true score on pool (Zwick, Thayer, & Wingersky, 1994 APM)
Simulation Conditions Differed on Several Factors: • Ability distribution: • Always N(0,1) in Reference group • Focal group either N(0,1) or N(-1,1) • Initial sample size per group: 1000 or 3000 • DIF: Absent or Present (in amounts that vary across items) • 600 replications for results shown today
Definition of True DIF for Simulation Range of True DIF: -2.3 to 2.9, SD ≈ 1.
MSR = Variance + Squared Bias MSR = RMSR2 =
RMSRs for No-DIF condition, Initial N=1000; Item N’s = 80 to 300
RMSRs - 50 hard items, DIF condition, Focal N(-1,1)Focal N’s = 16 to 67, Reference N’s 80 to 151
RMSRs for DIF condition, Focal N(-1,1)Initial N=1000; Item N’s = 16 to 307
Variance and Squared Bias for Same ConditionInitial N=1000; Item N’s = 16 to 307
Summary-Performance of EB DIF Estimator • RMSRs (and variances) are smaller for EB than for MH, especially in (1) no-DIF case and (2) very small-sample case. • EB estimates more biased than MH; bias is toward 0. • Above findings are consistent with theory. • Implications to be discussed.
“External” Applications/Elaborations of EB DIF Point Estimation • Defense Dept: CAT-ASVAB (Krass & Segal, 1998) • ACT: Simulated multidimensional CAT data (Miller & Fan, NCME, 1998) • ETS: Fully Bayes DIF model (NCME, 2007) of Sinharay et al: Like EB, but parameters of prior are determined using past data (see ZTL). Also tried loss function approach.
Probabilistic DIF • In our model, posterior distribution is normal, so is fully determined by mean and variance. • Can use posterior distribution to infer the probability that DIF falls into each of the ETS categories (C-, B-, A, B+, C+), each of which corresponds to a particular DIF magnitude. (Statistical significance plays no role here.) • Can display graphically.
Probabilistic DIF status for an “A” item in LSAT sim.MH = 4.7, SE = 2.2, Identified Status = C+Posterior Mean = EBi= .7, Posterior SD = .8 NR=101 NF = 23
Probabilistic DIF, continued • In EB approach can be used to accumulate DIF evidence across administrations. • Prior can be modified each time an item is given: Use former posterior distribution as new prior (Zwick, Thayer & Lewis, 1999). • Pie chart could then be modified to reflect new evidence about an item’s status.
Predicting an Item’s Future Status: The Posterior Predictive Distribution • A variation on the above can be used to predict future observed DIF status • Mean of posterior predictive distribution is same as posterior mean, but variance is larger. • For details and an application to GRE items, see Zwick, Thayer, & Lewis, 1999 JEM.
Discussion • EB point estimates have advantages over MH counterparts • EB approach can be applied to non-MH DIF methods • Advisability of shrinkage estimation for DIF needs to be considered • Reducing Type I error may yield more interpretable results • Degree of shrinkage can be fine-tuned • Probabilistic DIF displays may have value in conveying uncertainty of DIF results.