Nadine McBride, NCDPI Melinda Taylor, NCDPI Carrie Perkis , NCDPI

Enhancing the Technical Quality of the North Carolina Testing Program: An Overview of Current Research Studies Nadine McBride, NCDPI Melinda Taylor, NCDPI Carrie Perkis, NCDPI

Overview • Comparability • Consequential validity • Other projects on the horizon

Comparability • Previous Accountability Conference presentations provided early results • Research funded by an Enhanced Assessment Grant from the US Department of Education • Focused on the following topics: • Translations • Simplified language • Computer-based • Alternative formats

What is Comparability? Not just “same score” • Same content coverage • Same decision consistency • Same reliability & validity • Same other technical properties (i.e., factor structure) • Same interpretations of test results, with the same level of confidence

Goal • Develop and evaluate methods for determining the comparability of scores from test variations to scores from the general assessments • The same inferences should be able to be made, with the same level of confidence, from variations of the same test.

Research Questions • What methods can be used to evaluate score comparability? • What types of information are needed to evaluate score comparability? • How do different methods compare in the types of information about comparability they provide?

Products • Comparability Handbook • Current Practice • State Test Variations • Procedures for Developing Test Variations and Evaluating Comparability • Literature Reviews • Research Reports • Recommendations • Designing Test Variations • Evaluating Comparability of Scores

Results - Translations • Replication methodology helpful when faced with small samples and widely different proficiency distributions • Gauge variability due to sampling (random) error • Gauge variability due to distribution differences • Multiple methods for evaluating structure are helpful • Effect size criteria helpful for DIF • Congruence b/w structural & DIF results

Results – Simplified Language • Carefully documented and followed development procedures focused on maintaining the item construct can support comparability arguments. • Linking/equating approaches can be used to examine and/or establish comparability. • Comparing item statistics using the non-target group can provide information about comparability.

Results – Computer-based • Propensity score matching produced similar results to studies using within-subjects samples. • Propensity score method provides a viable alternative to the difficult-to-implement repeated measures study. • Propensity score method is sensitive to group differences. For instance, the method performed better when 8th and 9th grade groups were matched separately.

Results – Alternative Formats • The burden of proof is much heavier for this type of test variation. • A study based on students eligible for the general test can provide some, but not solid, evidence of comparability. • Judgment-based studies combined with empirical studies are needed to evaluate comparability. • More research is needed in methods for evaluating what constructs each test type is measuring.

Lessons Learned • It takes a village… • Cooperative effort of SBE, IT, districts and schools to implement special studies • Researchers to conduct studies, evaluate results • Cooperative effort of researchers and TILSA members to review study design and results • Assessment community to provide insight and explore new ideas

Consequential Validity • What is consequential validity? • Amalgamation of evidence regarding the degree to which use of test results have social consequences • Can be both positive and negative; intended and unintended

Who’s Responsibility? • Role of the Test Developer versus the Test User? • Responsibility and roles are not clearly defined in the literature • State may be designated as both a test developer and a user

Test Developer Responsibility • Generally responsible for… • Intended effects • Likely side effects • Persistent unanticipated effects • Promoted use of scores • Effects of testing

Test Users’ Responsibility • Generally responsible for… • Use of scores • the further from the intended uses, the greater the responsibility

Role of Peer Review • Element 4.1 • For each assessment, including the alternate assessment, has the state documented the issue of validity…. with respect to the following categories: • g) has the state ascertained whether the assessment produces intended and unintended consequences?

Study Methodology • Focus Groups • Conducted in five regions across the state • Led by NC State’s Urban Affairs • Completed in Dec 09 and Jan 10 • Input of teachers and administration staff • Included large, small, rural, urban, suburban schools

Study Methodology • Survey Creation • Drafts currently modeled after surveys conducted in other states • However, most of those were conducted 10+ years ago • Surveys will be finalized after focus group results are reviewed

Study Methodology • Survey administration • Testing Coordinators to receive survey notification • Survey to be available in late March to April

Study Results • Stay tuned! • Hope to make the report publicly available on DPI testing website

Other Research Projects • Trying out different item types • Item location effects • Auditing

Contact Information • Nadine McBride Psychometrician nmcbride@dpi.state.nc.us • Melinda Taylor Psychometrician mtaylor@dpi.state.nc.us • Carrie Perkis Data Analyst cperkis@dpi.state.nc.us

Nadine McBride, NCDPI Melinda Taylor, NCDPI Carrie Perkis , NCDPI