390 likes | 562 Views
Cross-Grade Scales in NAEP: Research and Real-Life Experience. Catherine A. McClellan, John R. Donoghue, Lydia Gladkova, & Xueli Xu. Measurement invariance. One key idea in all types of modeling discussed here is that of invariance
E N D
Cross-Grade Scales in NAEP: Research and Real-Life Experience Catherine A. McClellan, John R. Donoghue, Lydia Gladkova, & Xueli Xu
Measurement invariance • One key idea in all types of modeling discussed here is that of invariance • Many of the thorny assessment problems we face require an assumption of invariance • In order to do modeling across grade levels, across time, across groups of people, and across scorers, something, somewhere must be assumed to be invariant – and usually it is some aspect of construct invariance
It’s all about the construct… • Cross-grade scaling needs construct invariance across ages/grades • Differential item functioning (DIF) needs construct invariance across groups • Trend measurement needs construct invariance across time • Constructed-response (CR) item scoring needs rater invariance in interpreting the construct as reflected in the item and rubric
There are a couple of other invariance areas to watch • Design invariance – the assessment design should not change without careful study of the impact • Ink color matters! (particularly in reading) • Context matters – what items and subject matter appear with (particularly before) others matters • Analysis invariance – changes in analysis methodology can introduce artifactual changes in results and should be carefully evaluated before implementation
Cross-grade scales • Cross-grade scales in NAEP must measure the growth between pairs of grade levels while maintaining the trend line for each grade • Assessment design must meet these constraints: • non-adjacent grades assessed (4, 8, and 12) • use of IRT methodology to link grades and trend points • the necessity of trend measurement • item release and replacement • years with missing grades
Differential Item Functioning (DIF) • DIF requires construct invariance across groups of students defined by some known variable (race, gender, parental education, SES, etc.) • Cross-grade scaling issues are age DIF
Trend issues • Items are assumed to function the same way across time • Item parameter drift is a threat – societal changes and scientific discoveries can alter item functioning • Most marginal estimation procedures are sample-dependent • Sets of items that refer to common stimulus materials are prone to dependence, and the structure of the dependence can change over time
Constructed response scoring issues • CR items must be scored the same way • Rater change (or drift) corrupts trend measures • Often can’t get the same raters; even if they are the same people, they have changed • Training may differ, especially if the trainer is not the same • Historical events (state initiatives, etc.) may change how raters perceive items and may even introduce new correct responses or remove previously correct responses • Scoring may differ across grade levels
The Ugly – Long-Term Trend Writing • The Bad – US History and Geography • The Good – Reading
The Ugly: NAEP long-term trend writing • Originally designed in 1984 • 6 writing prompts in 2 disjoint sets (4/2) • Each student receives 1-4 prompts • Scored according to primary trait rubrics • Scores are on a four-point scale, 0-3 • In 1986, there were problems with scoring • Items were declared non-trend and 1984 responses rescored in 1986 • 1986 then becomes the base year for the trend • Items continue in same form through 1999
IRT scaling with LTT writing • IRT scaling (GPCM; Muraki, 1992) introduced in 1992 • 1992, 1990, 1988, and 1984 data calibrated simultaneously • NAEP marginal estimation and plausible values technology used to produce trend results • For each new wave of data, use adjacent pairs of years (i.e. current and previous) scaled together to place current assessment onto the reporting scale • Applied in 1994, 1996, and 1999
Cross-year invariance issues in 1999 • Basis of trend—assumption that items function identically across time • In 1999, the re-score data and plots raised concerns about whether the assessment data supported this assumption • Cross-year drift essentially “splits” an item into two separate items, one as rated in each year • Creating two items from one can be done in analysis • Requires judgment, as there is currently no valid statistic test for this type of misfit
Item effects • Recall that there were a small number of prompts in the assessment and that the design was weakly linked across items • Overall trend and results proved to be sensitive to decisions made about a single item • Simultaneous calibration of all years was less sensitive, but still showed the same effect
So now what? • The alternatives seemed to be: • Report the 1999 IRT based results as they stood • Do alternative, non-IRT analysis to further evaluate the situation and possibly use as the reporting results • Account for rater effects • Incorporate sources of error • Develop standard errors that reflect these sources of error • It was decided to pursue the non-IRT analyses
Accounting for rater effects - 1 • In 1988 rater drift was noted and a portion of 1984 papers were rescored, so 1988 became the official base year for subsequent scoring • In pursuing the non-IRT analyses in 1999, data from all assessment years subsequent to 1988 (1990, 1992, 1994, 1996, and 1999) were analyzed • A small number (230-500) of 1988 papers were rescored as part of the current assessment’s scoring • These 1988 papers were used to estimate and remove the rater drift effect in subsequent years
Accounting for rater effects - 2 • Form rescore data table with 1988 scores as the rows, 1999 scores as the columns • Compute the conditional probability • Take multiple draws from this posterior • Analyze as if regular student scores • Repeat analysis on each set of draws to yield an estimate of the uncertainty due to imputation
Potential concerns • Tables were based on small samples, so estimates of were likely to be unstable • Rescore table data had some (significant) gaps in some years • No scores in the highest score level for some tables • No exact agreement for some tables
Smoothing (Part 1) • Deal with variability using a smoothing procedure on the rescore table, then draw values from smoothed table • Loglinear smoothing (Holland & Thayer, 1998) was applied • This method preserves the moments of the margin and the correlation • Margins of the original tables were preserved exactly • Results indicated poor model-data agreement • For age 9, 4 point items, 14 of 30 tables yielded significant log-likelihood Chi-square values
Other concerns • The empty diagonal cells and empty margins had to be dealt with • Solution chosen was to insert a single observation into the table • The original cells were all multiplied by (N-1)/N to maintain overall N • This preserves important aspects of table: • Percent exact agreement • Mean difference of (current year -1988)
Smoothing (Part 2) • Pre-smoothed tables input to loglinear smoothing procedure • Fit was better than with the un-pre-smoothed data, but there were still some questionable cases • We tried using a Bayesian method (Feinberg & Holland, 1970) to form a weighted combination of the two tables • These tables were used to compute the conditional probabilities to draw imputations
Quantifying uncertainty • Usual sources of uncertainty • Sampling of PSUs, schools, & students • Partial knowledge of student achievement: few items • Usual jackknife procedures • Plus • Uncertainty due to lack of knowledge of the scores the 1988 raters would have assigned • Error introduced by estimation of the conditional probabilities • This got ugly in a hurry…
The Ugly: In summary • In 1999, important drift issues rose to fore • The treatment of single item (trend or split) changed the direction of the overall national trend result • Acting Commissioner Phillips — “I have lost faith in the instrument” • The 1999 LTT writing results were never released
The Bad: US History and Geography • Base year is 1994 for both subjects • Assessed again in 2001 • Reported using a cross-grade scale • Two aspects for consideration: analysis design and construct considerations
Cross-grade scale design 2001 (first trend year) 1994 (base year) Age 9 / grade 4 Age 9 / grade 4 Age 13 / grade 8 Age 13 / grade 8 Age 17 / grade 12 Age 17 / grade 12
Analysis design:US History, 1994 – 1 • There are no common items between grades 4 and 12, nor any across all 3 grades
Analysis design:US History, 1994 – 2 • History has four subscales: Democracy, Cultures, Technology, and World Role • The IRT scaling and the vertical linking of the grades was done at the subscale level, using a weighted generalized Stocking-Lord procedure on the test characteristic curve of the common items • Grade 4 and grade 12 were each linked separately to grade 8, since both had common items with grade 8
Design concerns • There are some subscales that are quite thin across grade levels: • Technology across 4 and 8: 5 items • Cultures across 8 and 12: 4 items • World Role between 4 and 8: 7 items items (note also that there are only six grade-4-specific items in this subscale) • The TCCs that result from a weighted combination of IRFs from so few items may retain substantial variability within year, and also may be subject to trend instability
Construct concerns • Vertical scales are generally based on content areas that are thought of as “developmental” in some way • The baseline construct is established early and the skill is refined and the scope of application expanded as the child matures • It is not clear that US History (or Geography, for that matter) fit this description well
The Bad: In summary • The analysis design is not poor, but there are relatively few items across grades to provide data to the linking • A larger concern is whether or not these academic subject areas are appropriate for a vertical scale
The Good: Reading • The current reading assessment has a trend line back to 1992 • The cross-grade scale design used there was also used in mathematics, started in 1990 • The design implements a concurrent calibration of all three grade levels of data in the base year, then within-grade calibration in subsequent trend years
Cross-grade scale design Year 1 (base year) Year 2 (first trend year) Year 3 (second trend year) 4 8 12 4 8 12 …… 4 8 12
NAEP does not assess every grade in every assessment year, so the design has some holes The sample sizes vary quite a lot: combined samples run ~170,000, national ~10,000 Some complications
Would an alternate design change the results? 1998(a) 2000 (a) 2002 (a) 2003 (a) 2005 (a) 4 4 8 4 8 12 4 8 12 4 8 12
Summary of study results • The majority of cross-grade items fit well in cross-grade calibration • In general, reported values for cross-grade and operational scaling are close, both in mean scale scores and percentages at achievement levels • In a number of subgroups, significant difference tests lead to different results • The reported values for cross-grade and operational scaling differ more for the later years
The Good: In summary • The current cross-grade scale design used in NAEP seems stable to the alternate design studied • Little construct drift was apparent; the results were quite similar under both analysis designs • This was an analytic study only: alternative assessment or item designs would almost certainly yield different conclusions