140 likes | 231 Views
A Model for Scaling, Linking, and Reporting Through-Course Summative Assessments Rebecca Zwick Robert J. Mislevy Educational Testing Service NCSA – Orlando, June 20, 2011. Overview of Presentation. Desired properties of a model for analyzing through-course summative assessments (TCAs)
E N D
A Model for Scaling, Linking, and Reporting Through-Course Summative AssessmentsRebecca ZwickRobert J. MislevyEducational Testing ServiceNCSA – Orlando, June 20, 2011
Overview of Presentation • Desired properties of a model for analyzing through-course summative assessments (TCAs) • Description of the proposed model • A possible simplification • Recommendations
Desired Properties The model must be able to: • Yield proficiency estimates for individuals & groups, accommodating different patterns of instruction • Provide end-of year summaries & growth measures • Provide results that are comparable across classrooms, schools, districts, & states • Incorporate items that vary in instructional sensitivity (implies multidimensionality)
Model Characteristics • Accommodates inferences desired from TCAs; can be a framework for studying plausible submodels • Has 2 components: • Multidimensional item response theory (MIRT) component specifies dependency of item responses on proficiency • Population component models association between proficiency and background variables. • Builds on NAEP, TIMSS, & PISA experience
Notation: = vector of student proficiencies (note that is multidimensional) = vector of item responses for student i (includes dichotomous & polytomous) =vector of background variables for student i(more later on this)
Bayesian model (see Mislevy, 1985): posterior distribution of for student i “ is proportional to” MIRT model conditional distribution of , given background variables
Model Assumptions Our model relies on the traditional IRT assumption of conditional independence: i.e., item responses are independent, conditional on , which implies that the same item response model holds across groups and over time. MIRT makes this possible; our goal is to find an interpretable model that satisfies the assumption. Model testing is needed.
Why include any background variables in the model? • Improves precision of estimation and avoids biases, especially when item data are sparse. • includes both demographic variables and instructional variables, such as nature and order of curriculum. ( will vary over time.) • For fairness reasons, demographic variables are not included when estimating individual scores.
Estimating individual and group proficiency • For individual estimates, use mean or mode of posterior distribution • Group characteristics are NOT estimated using aggregates of individual estimates • Distribution of optimal individual estimates is not the optimal estimate of the population distribution.
Market-basket (MB) reporting (Mislevy, 2003) • Report results in terms of a scale based on a “market- basket” of selected tasks (must be calibrated). • Using observed data, we can generate predicted responses to these MB tasks:
MB Reporting - Example • Suppose 100 items are to be administered during a year; these could be the market-basket • Assume 4 TCAs, each with 25 of these 100 items • For each TCA, each student has actual responses for 25 items and predicted responses for 75 items • Score is expectation of sum of responses over all 100 items
Market-basket reporting(cont.) • “Behind-the-scenes” machinery is complex, but resulting scores “look like” ordinary test scores: multivariate is mapped onto unidimensional scale. • Can use for year-end summary of TCAs and for growth measurement (e.g., last TCA minus first TCA). • Can predict end-of-year performance, achieved by adjusting values of . Given his current , how will Johnny perform with a whole year of instruction?
Can a simpler model work? • Simplifications become more feasible if demands on model are scaled down. • Example: Use a more traditional assessment for comparisons across schools, districts and states • Machine-scoreable items; no complex tasks • Administer to a random sample only • Might eliminate need for a population model • Could then use less constrained test forms, including complex tasks, to inform instruction
Recommendations • Use pilot and field test periods to test the model and explore simplifications 2. Recognize that a tradeoff exists between inferential demands and procedural simplicity. • Reducing demands makes simpler approaches more feasible