490 likes | 701 Views
Comparability: What, Why, When and the Changing Landscape of Computer-Based Testing . NCSA, Detroit, June 2010. Presenters. Kevin King, Utah State Office of Education Sarah Susbury, Virginia Department of Education Dona Carling, Measured Progress Kelly Burling, Pearson
E N D
Comparability: What, Why, Whenand the Changing Landscape of Computer-Based Testing NCSA, Detroit, June 2010
Presenters • Kevin King, Utah State Office of Education • Sarah Susbury, Virginia Department of Education • Dona Carling, Measured Progress • Kelly Burling, Pearson • Chris Domaleski, National Center for the Improvement of Educational Assessment
Introduction • For not surprising reasons, many states deliver statewide assessments in both paper and computer modes. • Also not surprising, comparability of scores for the two modes is of concern. • What about comparability issues when switching computer-based testing interfaces?
Introduction • This session will explore the idea of comparability in the context of statewide testing using two administration modes. • Challenges addressed will include: • satisfying peer review, • the impact of switching testing providers, • changing interfaces, • adding tools and functionality, and • adding item types.
What is Comparability? • The ability of a system to deliver data that can be compared in standard units of measurement and by standard statistical techniques with the data delivered by other systems. (online statistics dictionary) • Comparability refers to the commonality of score meaning across testing conditions including delivery modes, computer platforms, and scoring presentation. (Bennett, 2003)
What is Comparability? From Peer Review Guidance (2007) • Comparability of results Many uses of State assessment results assume comparability of different types: comparability from year to year, from student to student, and from school to school. Although this is difficult to implement and to document, States have an obligation to show that they have made a reasonable effort to attain comparability, especially where locally selected assessments are part of the system.
What is Comparability? • Section 4, Technical Quality • 4.4 • When different test forms or formats are used, the State must ensure that the meaning and interpretation of results are consistent. • Has the State taken steps to ensure consistency of test forms over time? • If the State administers both an online and paper and pencil test, has the State documented the comparability of the electronic and paper forms of the test?
What is Comparability? • Section 4, Technical Quality 4.4 • Possible Evidence • Documentation describing the State’s approach to ensuring comparability of assessments and assessment results across groups and time. • Documentation of equating studies that confirm the comparability of the State’s assessments and assessment results across groups and across time, as well as follow-up documentation describing how the State has addressed any deficiencies.
References • Bennett, R.E. (2003). Online assessment and the comparability of score meaning. Princeton, NJ: Educational Testing Service. • U.S. Department of Education (2009). Standards and Assessments Peer Review Guidance: Information and Examples for Meeting Requirements of the No Child Left Behind Act of 2001. Washington, DC: U.S. Government Printing Office.
Utah’s Comparability Story Kevin King Assessment Development Coordinator
What we know about comparability in relation to CBT and PBT • Previous concerns have been around PBT and CBT comparability • Potentially more variability between CBT and CBT than between CBT and PBT • There are policy considerations about when to pursue comparability studies and when not to
CBT Context in Utah • Which tests • 27 Multiple Choice CRTs • Grades 3 – 11 English language arts • Grades 3 – 7 math, Pre-Algebra, Algebra 1, Geometry, Algebra 2 • Grades 4 – 8 science, Earth Systems Science, Physics, Chemistry, Biology • Timeline (total test administrations approximately 1.2 million) • 2001 – 2006: 4% – 8% • 2007: 8% • 2008: 50% • 2009: 66% • 2010: 80%
How Utah has studied comparability • Focus on PBT to CBT • Prior to this year, that is what was warranted
Study #1 • 2006 (8% CBT participation) • Item by item performance comparison • Matched Samples Comparability Analyses • Using NRT as a basis for the matched set • ELA (3, 5, & 8), Math (3, 5, & 8), Science (5 & 8) • Results • Actionable conclusions
2006 Study Results Conclusions: Additional investigations around mode by items be conducted Policy overtones: Rapid movement to 100% CBT impacts these investigations.
Study #2 • 2008 (50 % CBT participation) • Focus on mode transition (i.e., from PBT one year to CBT the next year) • Determine PBT and CBT rs-ss tables for all courses • Benefit to CBT if PBT ss is lower than CBT • Very few variances between rs-ss tables (no variances at proficiency cut) • Conclusions and Policy: move forward with CBT as base for data decisions
Comparability CBT to CBT • Issues • Variations due to local configurations • Screen resolution (e.g., 800x600 vs. 1280x1024) • Monitor size • Browser differences • How much of an issue is this
Impact of Switching Testing Providers • A current procurement dilemma • How to mitigate • What could be different will be different • Item displays • How items are displayed • As graphic images, with text wrapping • How graphics are handled by the different systems • Item transfer between providers and interoperability concerns • Brought about re-entering of items and potential variations in presentation
Evolution of the Systems • How to make decisions as the same system with the same vendor advances • Tool availability • Text wrapping • HTML coding • When do you take advantage of technology advance, but sacrifice item similarity
Changing Interfaces • Portal/Interface for item access • How students navigate the system • How tools function (e.g., highlighter, cross out, item selection) • How advanced tools function • Text enlarger • Color contrast
Adding item types • Utah will be bringing on Technology Enhanced items. • How will the different tests from year to year be comparable? • What about PBT tests versions as accommodated?
Others • Technical disruptions during testing • Student familiarity with workstation and environment
Straddling • PBT and CBT • CBT and PBT • Vendors • Operating Systems and Browsers • Curriculum changes
Final Thoughts • Forced us to really address • why is comparability important • AND • what does that mean? • Is “equal” always “fair”?
Virginia’s Comparability Story Sarah Susbury Director of Test Administration, Scoring, and Reporting
CBT Context in Virginia • Which tests • Grades 3 – 8 Reading and End-of-Course (EOC) Reading • Grades 3 – 8 Math, EOC Algebra 1, EOC Geometry, EOC Algebra 2 • Grades 3, 5, and 8 Science, EOC Earth Science, EOC Biology, EOC Chemistry • Grade 3 History, Grades 5 – 8 History, EOC VA & US History, EOC World History I, EOC World History II, EOC World Geography • Phased Approach (EOC --> MS --> ES) • Participation by districts was voluntary • Timeline (growth in online tests administered) • 2001: 1,700 • Spring 2004: 265,000 • Spring 2006: 924,000 • Spring 2009: 1.67 million • Spring 2010: 2.05 million
Comparability Studies • Conducted comparability studies with the online introduction of each new EOC subject (2001 – 2004) • Students were administered a paper/pencil test form and an equivalent online test form in the same administration • Students were not provided scores until both tests were completed • Students were aware they would be awarded the higher of the two scores (motivation)
Comparability Studies • Results indicated varying levels of comparability. • Due to Virginia’s graduation requirements of passing certain EOC tests, decision was made to equate online tests and paper/pencil tests separately. • Required planning to ensure adequate n-counts in both modes would be available for equating purposes. • Comparability has improved over time.
Accommodations: CBT vs PBT • Some accommodations transfer easily between modes: • Read aloud and audio test administrations • Allowable manipulatives (bilingual dictionary, calculator, etc) • Answer transcription • Visual aids, magnifiers • Other accommodations do not readily transfer: • Flexible administration (variable number of test items at a time) • Braille (cost of Braille writers) • Large Print forms (ability to hold the test form)
Comparability/Changes in Computer Hardware • Screen resolution: Changing from 800 X 600 dpi to 1024 X 768 dpi • Eliminating older hardware from use for online testing • Changing the amount of scrolling needed for full view • Revise flow of text? • Desktop vs laptop computers • Less of an issue than in early 2000’s • Laptop computers vs “Netbooks” • Variability of “Netbooks”
Changes in Computer Interface • New vendor or changes in current vendor’s system • Test navigation may vary • Advancing or returning to items • Submitting a test for scoring • Audio player controls • Test taking tools & strategies may vary • Available tools (e.g., highlighter, choice eliminator, mark for review, etc) • Changes in test security • Prominent display of student’s name on screen?
Introduction of New Item Types • Virginia is implementing technology enhanced items simultaneously with revised Standards of Learning (SOL) • Mathematics SOL • Revised mathematics standards approved in 2009 • Field testing (embedded) new technology enhanced math items during 2010-2011 test administrations • Revised math assessments implemented in 2011-2012 with new standard setting • English and Science SOL • Revised English and science standards approved in 2010 • Field testing (embedded) new English and science items during 2010-2011 test administrations • Revised assessments (reading, writing, and science) implemented in 2013
To change, or not to change. • Sometimes there is no choice: • New technology; prior technology becomes obsolete • Procurement changes/decisions • Sometimes there is a choice (or timing options): • Advances in technology • Advances in assessment concepts
Mitigating Comparability Issues • A Shared Responsibility • Provide teachers with training and exposure to changes in time to impact instruction and test prep. • Systems, test navigation, test items, accommodations, etc. • Provide students with training and exposure to changes prior to test preparation and prior to testing • Practice tests in the new environment, sample new item types, etc. • Communicate changes and change processes to all stakeholders in the educational community
Considerations for Research and Evaluation Kelly Burling Pearson
Comparability Studies • When • What • Why • How • What’s Next
WHEN? • Whenever! • Comparability Studies for Operational Systems & • Comparability Studies for Research Purposes
WHEN Research • Introducing CBT • Curricular Changes • New Item Types • New Provider • New Interface • Changes over time • Changes in technology use in schools Operational • Introducing CBT • Curricular Changes • New Item Types • New Provider • New Interface • Any time there are changes in a high stakes environment
What • See slides 5, 6, 7, & 8 •
Designs • Evaluation Criteria • Validity • Psychometric • Statistical Wang, T., & Kolen, M. J. (2001). Evaluating comparability in computerized adaptive testing: Issues, criteria and an example. Journal of Educational Measurement 38, 19–49.
Validity • User Experience & Systems Designs • Cognitive Psychology • Construct Dimensionality • Relationships to External Criterion Variables • Sub-group differences
Psychometric • Score distribution • Reliability • Conditional SEM • Mean difference • Propensity Scores • Item Level • Mean difference • IRT parameter differences • Response distributions • DIF
Statistical • Evaluating the assumptions underlying the • scoring model • test characteristics • study design
Next Challenges • Performance assessments • E Portfolio with digitally created content, e portfolio with traditional content, physical portfolio • Platforms • Devices • Data Mining
Addressing NCLB/Peer Review Chris Domaleski National Center for the Improvement of Educational Assessment
Establishing Comparability – Applicable Standards • NCLB Federal Peer Review Guidance (4.4) requires the state to document the comparability of the electronic and paper forms of the test • AERA, APA, and NCME Joint Standards (4.10) “a clear rationale and supporting evidence should be provided for any claim that scores earned on different forms of a test may be used interchangeably.”
Potential Evidence to Support Comparability Claims (1) • Design • Item and form development processes (e.g. comparable blueprints and specifications) • Procedures to ensure comparable presentation of and interaction with items (e.g. Can examinees review the entire passage when responding to passage dependent items?) • Don’t’ forget within mode consistency. For example, do items render consistently for all computer users? • Adequate pilot and field testing of each mode • Administration • Certification process for computer based administration to ensure technical requirements are met • Examinees have an opportunity to gain familiarity with assessment mode • Resources (e.g. calculator, marking tools) are equivalent • Accommodations are equivalent
Potential Evidence to Support Comparability Claims (2) • Analyses • Comparability of item statistics • To what extent do the same items produce different statistics by mode? • Are there differences by item ‘bundles’ (e.g. passage or stimulus dependent items) • DIF studies • Comparability of scores • Comparability of total score by tests (e.g. grade, content) • Comparability of total score by group (e.g. SWD, ELL etc.) • Comparability of total score by ability regions (e.g. by quartiles, TCC correspondence) • DTF • Classification consistency studies
Concluding Thoughts • No single approach to demonstrating comparability and no single piece of evidence is likely to be sufficient • Don’t assume that findings apply to all grades, content areas, subgroups • Item type may interact with presentation mode • Design considerations • Are inferences based on equivalent groups? If so, how is this supported? • Are inferences on repeated measures? If so are order effects addressed? • Be clear about standard of evidence required. • Effect size? • Classification differences?