350 likes | 612 Views
Hierarchical Linear Modeling for Detecting Cheating and Aberrance. Statistical Detection of Potential Test Fraud May, 2012 Lawrence, KS. William Skorupski University of Kansas Karla Egan CTB/McGraw-Hill. Purpose of the Study.
E N D
Hierarchical Linear Modeling for Detecting Cheating and Aberrance Statistical Detection of Potential Test Fraud May, 2012 Lawrence, KS William Skorupski University of Kansas Karla Egan CTB/McGraw-Hill
Purpose of the Study • “Cheating” as a paradigm for psychometric research has focused on individuals. • Our purpose is to identify groups of cheaters, based on the premise that teachers and administrators may be motivated to inappropriately influence students’ scores.
Background • Importance of cheating detection • Cheating as classroom-, school-, or even district-wide phenomenon • Results of many large-scale educational assessments are tied to incentives, e.g., merit-based pay, accountability, AYP targets from NCLB • Teachers may be tempted to “teach to the test,” provide inappropriate materials, alter students’ answer sheets
Previous Study • Skorupski & Egan (2011) demonstrated a Bayesian hierarchical modeling approach for group-level aberrance (real data). • Cross-validation with external reports of impropriety. • Reasonable detection rates, difficult to verify results.
Findings • Relatively large aberrance for a few schools at certain Time points suggested that this approach may be useful for flagging potentially cheating schools. • The present simulation study was planned to evaluate detection power.
Goals of the study • Evaluate the robustness of the Bayesian HLM approach for detecting group-level cheating through Monte Carlo simulation. • Develop heuristics for flagging known “cheaters” from the analysis
Cheating & Aberrance • Certain kinds of aberrance may be evidence of cheating • Answer copying • Model-data misfit • In our analysis: unusually high group performance at given time, given marginal group & time effects • i.e., Large positive interaction effect
Important Note • No cheating/aberrance detection method can “prove” cheating, but merely flag unusual individuals or groups for further review. • Our goal is to demonstrate detection of known group-level cheating with adequate power while maintaining an acceptable Type I error rate.
Methods – Data Simulation • Data created to emulate a vertically scaled SWA • 3 linked administrations, means increasing 0.5s between each Time mt = 0, 0.5, 1 • 60 Groups, N(g) within ranging from 10 to 260 (Total N = 4,650)
51 of 60 means at Time 1 from m(g) ~ N(0,1) 3 x 3 = 9 groups: N(g) = 10, 60, 110 m(g) = -1,0,1 These 9 groups (3 at each Time, so 5% overall) will be the “cheaters”
Simulate Individual Scores • q ~ MVN(0,R): 0 vector of zeros, R correlation matrix, off-diagonals = 0.77 (based on real data study) • Each individual score Yigt was created by taking qigt and adding its respective Time and Group mean. • At this point, all scores are “non-aberrant;” main effects alone account for differences
Simulate “Cheating” • For cheating groups, additional interaction effect is added to Yigt • 3 at each Time, for m(g) = -1, 0, or 1 and N(g) = 10, 60, or 110 • Group-by-Time (60 x 3) matrix of effects. If GT=0 no cheating, GT>0 cheating. • GT=1 for simulated cheaters (i.e., Group mean is +1s above main effects)
Time 3 Cheating Time 2 Cheating Time 1 Cheating Each of these 3 patterns was crossed with 3 N = 10, 60, 110
Notes on Simulation • Forms must be linked over Time • In this analysis, scale scores were directly simulated (treating scores as measured without error), but in practice item response data would first be obtained, linked in a vertical scale. • Examinees are nested within groups, Time points nested within individuals
Groupg Groups(1,…,G) Individuals(1,…,N(g)) Person1g Personig PersonN(g)g Time (linked)(1,2,3) Yig1 Yig2 Yig3
Methods – Analysis Hierarchical Growth Model • Model: Scale scores for individuals (i) within groups (g) over time (t): Yigt = b0 + b1g + b2t + b3gt + eigt • eigt ~ N (0, s2) • Fully Bayesian estimation (MCMC) using WinBUGS (Lunn et al, 2000) • 50 replications
Baseline Model • Only Time- and Group-level effects are estimated as differences in intercepts (plus interaction term) • With real data, other models could also incorporate covariates (SES, etc.) at any level of the model
Outcomes • The parameter estimates b3gt(Group-by-Time interactions) are used to infer aberrant group performance at a given Time. • b1g (main effect for Group) could also be used to detect systematic aberrance • Delta values for parameter estimates, plus “Posterior Probability of Cheating” (PPoC).
Outcomes • PPoC = proportion of posterior draws (samples from the posterior in MCMC output) above zero. • Criterion for flagging: PPoC≥0.75 Standardized effect size for Interaction. Previous study found d≥0.5 as a reasonable criterion
Cross-validation • Any Group/Time interaction effect with d≥0.5 and PPoC≥0.75 was considered flagged as aberrant (i.e., potentially cheating). • Over replications, correctly identified groups were part of the Power calculation, false positive flags were part of the Type I error rate.
Results • MCMC: 2 chains, 30,000 iterations each, burn-in=25,000 • Very good convergence of solutions • Main effects for Time and Group were well recovered. • Detection power was very good at Times 2 & 3, quite low for Time 1 • Acceptable Type I error rate
Flag Criteria: d ≥ .5 PPoC ≥ .75 Marginal Power = .59 Type1 = .04
Flag Criteria: d ≥ .5 PPoC ≥ .75 Marginal Power = .59 Type1 = .04
Flag Criteria: d ≥ .5 PPoC ≥ .75 Time 1 Power = .07 Type1 = .04
Flag Criteria: d ≥ .5 PPoC ≥ .75 Time 2 Power = .71 Type1 = .04
Flag Criteria: d ≥ .5 PPoC ≥ .75 Time 3 Power = 1 Type1 = .05
Discussion • Overall power is quite good, very poor at Time 1 • Type I error rate acceptable • Pretty encouraging results; more simulations, replications planned • More conditions with various effect sizes, sample sizes, non-linear trends, etc.
How might this method be used in practice? • Flagged groups may be compared to the Overall growth trajectory to infer aberrance of performance. • Groups flagged must then be investigated further. • Unusual performance could be caused by cheating, or it could indicate something exemplary! • Commend or Condemn?
Thanks! wps@ku.edu