320 likes | 826 Views
PARCC Field Test Study Comparability of High School Mathematics End-of-Course Assessments National Conference on Student Assessment San Diego June 2015. Overview. PARCC field test EOC study design Statistical analysis SME review of item maps. Introduction.
E N D
PARCC Field Test Study Comparability of High School Mathematics End-of-Course Assessments • National Conference on Student AssessmentSan DiegoJune 2015
Overview • PARCC field test EOC study design • Statistical analysis • SME review of item maps
Introduction • To assist states in aligning instruction to the CCSSM, model course pathways were developed for High School mathematics with standards organized into two sequences of coursework designed to lead to college and career readiness and to prepare students for study in more advanced mathematics courses. • The Traditional pathway is based on organization of high school course work typically seen in the United States. • It includes two algebra courses and a geometry course, with some data, probability and statistics included in each course. (Algebra 1, Geometry, Algebra 2) • The Integrated pathway provides a more integrated approach to secondary mathematics that is less common in the United States, but typical internationally. • It includes a sequence of three courses, each of which includes number, algebra, geometry, probability and statistics. (Integrated Mathematics 1, 2, 3)
Study Overview The HS EOC comparability study was designed to address the following research questions: 1. What degree of comparability (e.g., linked or concorded) can be achieved between the assessments of the two course sequences? Can the comparability be achieved at the course level or only at the aggregate level? 2. How do the psychometric properties of items that are used in assessments of both course sequences compare? More specifically, can a single calibration suffice for an item used in both course sequences or must an item be separately calibrated for use in each?
Overview of Field Test Design • To the extent possible, the Field Test was designed to reflect future operational administrations • 2 separate administrations – PBA in March, EOY in April • Dual mode administration • PBA and EOY field test forms constructed to full operational test blueprints and requirements • FT data collection design • 2 conditions: 1) Full summative (FS, PBA+EOY), 2) PBA or EOY but not both • Linking through common items across forms and conditions, and randomly equivalent groups • Oversampling to reach target sample size 1,200 valid cases per form • Initial design of 6 FS forms per test title for scoring/scaling and research studies; modified in response to recruitment challenges
EOC Study Data Collection • Primary FT data (CBT as per RFP) • Traditional and Integrated forms with common items • Original design had 6 Condition 1 (FS, PBA & EOY) forms each EOC • Number of forms reduced due for all EOCS, with greater reduction and redistribution for Integrated • Linkage across same level courses (Alg1/Math1, Geometry/Math2, Alg2/Math3) , and diagonally as per PARCC frameworks • For each EOC • Sample recruitment challenges, sought volunteers • Target of 1,200 valid cases per form not met despite forms reduction - persistent gaps for Integrated Math
Analysis Plan • Classical item analysis – cross-sequence examination of relative item difficulties • Cross-sequence DIF • Comparative analyses of factor structure • Cross-sequence linking • Separate calibrations (1PL), linking with mean-mean procedure • Item maps • For examination of consistency of item difficulties • For examination of consistency of meaning of scores at key points with respect to KSAs
Item Difficulty for Common Items • Calculate summary statistics of item difficulties (p-values) for common items administered in each pathway • Convert common item p-values to z-scores and plot to examine the consistency of relative difficulty across the pathways
Z-Score Summary • Algebra 1, Mathematics 1: Correlations indicate consistency of common item relative difficulty in the two EOC populations, at levels considered sufficient to support linking • Geometry, Mathematics 2: Lower correlation, typically considered insufficient for linking • Algebra 2, Mathematics 3: Correlation at level considered sufficient to support linking
Separate Calibrations, Linking • For dichotomous items, the 1PL model (Rasch) • For polytomous items, the one-parameter partial credit (1PPC) model • After separate calibrations, examined correlations of item difficulty parameter estimates for the EOC pair common items. • Item parameter estimates for each EOC course pair were placed on the same scale using the common item linking mean-mean procedure.
Correlations of Common Item Difficulty Parameter Estimates • Algebra 1 with Mathematics 1 .92 • Algebra 2 with Mathematics 3 .92 • Geometry with Mathematics 2 .84
Item Maps • Item maps for each course included both course-specific items and common items, separately identified. • The common items provide the vehicle for aligning the items from the two courses. • Criteria for location of items on the map is based on a specified response probability (RP67) - Metric: Scale score=(RP67 theta * 100) + 400
Expert Review (Subject Matter Experts)Rating Scale Question: Does obtaining a Score of X (showing what a student knows and can do in terms of item content) for Test I match what it means to obtain a Score of X in Test II? Responses: • Yes, very much so • For the most part, but there are some differences • Somewhat, but weakly • No, not at all
Experts Review—First Set of Ratings • Interpret the meaning of scores at key points on the scale in terms of the KSAs represented by the distribution of items in the vicinity of the score. Key scale scores: 550, 650, 750 • Review items located near the 3 scale points and interpret performance on the two tests. All items and item specific information were provided. • Side by side comparison of maps for designated Traditional-Integrated EOC pairs • Compare the distribution of items on each item map • Examine pattern of common item performance across EOCs, and relative to unique items within
Rating Tasks Provide ratings at values of 550, 650, 750, and Overall for each of the following: • Course level • Algebra 1 / Mathematics 1 • Geometry / Mathematics 2 • Algebra 2 / Mathematics 3 • Aggregate level (end of 3-course sequence) • Traditional Sequence / Integrated Sequence
Group Discussion of Item Maps/Ratings SMEs discussed results and were given the opportunity to change ratings during the 2ndmeeting • Second ratings for Algebra 1 / Mathematics 1 indicated less comparability than initial rating • Second ratings for Traditional Pathway with Integrated Pathway indicated more comparability than initial ratings
Item Mapping Summary • Algebra 1 with Mathematics 1 • Responses were close to evenly distributed among ratings of 1 to 3 • Algebra 2 with Mathematics 3 • Modal response was (2) For the most part • 87.5% of response were either 1 or 2 • Geometry with Mathematics 2 • Modal response was that the math skills were not comparable. • 67% of responses either (3) Somewhat but weakly or (4) No, not at all • Aggregate level • Majority of the responses were (2) For the most part, but there are some differences
Limitations • Results from field test data do not always translate directly to operational administration results. • The small sample sizes, especially for the Integrated Mathematics courses, make firm conclusions problematic. • Data from operational administrations should result in increased volume, therefore, more stable results should allow for firmer conclusions.
Conclusions • The data suggest separate scales for Geometry and Mathematics 2 • Concordance tables may be a possibility for aligning scores, if common item correlations are high enough; however, this will likely yield concordant scores that differ substantially in terms of meaning, that is, in terms of the underlying knowledge, skills, and abilities needed to obtain each score. • For the Algebra 1/Mathematics 1 and Algebra 2/Mathematics 3 comparisons, the data from the smallish sample sizes indicate that using concurrent calibration is not strongly supported. • Depending on Operational results, options for reporting may include linking of the separate IRT scales to support a common reporting scale, or concordance tables to align scores.