How is Testing Supposed to Improve Schooling?

How is Testing Supposed toImprove Schooling? Edward Haertel April 15, 2012 NCME Career Award Address Vancouver, British Columbia

How Many Purposes… ?

Purposes for Educational Testing

Measuring versus Influencing • Measuring • Relies directly on informational content of specific test scores • Influencing • Effects intended to flow from testing per se, independent of specific test results • Deliberate efforts to raise test scores • Changing perceptions or ideas

Example: Weekly Spelling Test • Measuring • Note words often missed (guides reteaching) • Assign grades • Guide students’ review following testing • Influencing • Motivate studying • Convey importance of spelling proficiency

Leap from measuring to influencing Arguments … claim … program will lead to improvements in school effectiveness and student achievement by focusing … attention … on demanding content. Yet, the validity arguments … attend only to the descriptive part of the interpretive argument …. The validity evidence … tends to focus on scoring and generalization to the content domain for the test. The claim that the imposition of the accountability requirements will improve the overall performance of schools and students is taken for granted. Kane, M. T. (2006). Validation. In R. L Brennan (Ed.), Educational Measurement (4th ed., pp. 17-64)

Interpretive Argument • Scoring • Alignment, DIF, scaling, norming, equating, … • Generalization • Score precision, reliability, generalizability, … • Extrapolation • Score as reflection of intended construct • Decision or Implication • Use in guiding action or informing description

“Appropriate test use and sound interpretation of test scores are likely to remain primarily the responsibility of the test user.” Standards for Educational and Psychological Testing, p. 111 Not our concern?

Process too linear? • Curriculum Framework • Test Specification • Item Writing • Forms Assembly • Tryout and revision • Administration • Scaling

Today’s Focus • Achievement tests taken by students • Some attention to aptitude tests as well • Exclude tests taken by teachers • Include uses of student test scores to evaluate teachers • Exclude testing for individual diagnosis of special needs

Curriculum-Dependent Test Question Curriculum-Neutral Test Question • May assume prior knowledge and skills • May probe reasoning with what is already known • May “drill deeper,” testing application of concepts • Must include requisite information with item • Must set up context in order to probe reasoning • Often limited to testing knowledge of concept definitions Testing and Prior Instruction

Seven Broad Purposes of Testing

Instructional Guidance • Formative Assessment (informal) • Scoring • Sound items adequately sampling domain? • Generalization • Test scores with adequate precision? • Extrapolation • Mastery extends beyond test per se? • Decision or Implication • Used to adapt teaching work to meet learning needs?

• Scoring • Generalization • Extrapolation • Decision or Implication Instructional Guidance • Formative Assessment (highly structured) • Winnetka Plan • Programmed Instruction approaches • Benjamin Bloom’s Mastery Learning • Pittsburgh LRDC’s IPI Math Curriculum • Criterion-Referenced Testing movement

Instructional Guidance • Formative Assessment (highly structured) • Scoring • Questions mapped well to behavioral objectives • Generalization • Multiple items highly redundant • Extrapolation • ??? Assume decomposability, decontextualization • Decision or Implication • Relied on cut scores, simple rules; insufficient attention to actual effects

Student Placement and Selection • IQ-based tracking • GATE programs • English Learner status (Entry / Exit) • MCTs / HSEEs • Advanced Placement / International Baccalaureate • SAT / ACT • …

IQ-Based Tracking • Rationale • Teachers deliver uniform instruction to all students in a classroom • Students learn at different rates • Or, have different “capacities” • Grouping students by ability will improve efficiency because all will receive content at a rate appropriate to their ability • This will reduce wasted effort and frustration

IQ-Based Tracking • Context • Increasing immigration (since late 19th century) • Perceived success of Army Alpha • Scientific School Management movement • Prevailing hereditarian views

IQ-Based Tracking • Scoring • Scores free from bias and distortion? • Generalization • High correlations across forms and occasions • Extrapolation • Assumed based on strong theory, some criterion-related validity evidence • Decision or Implication • Largely unexamined

Student Placement and Selection • IQ-based tracking • GATE programs • English Learner status (Entry / Exit) • MCTs / HSEEs • Advanced Placement (AP) / International Baccalaureate (IB) • SAT / ACT • …

Comparing Educational Approaches • ESEA-mandated Project Head Start evaluations • Evaluations of NSF-sponsored science curricula • National Diffusion Network • What Works Clearinghouse • Both RCTs and Quasi-experimental research

Educational Management • Measuring Schools • NCLB • Adequate Yearly Progress (AYP) determinations • Intervention for schools “in need of improvement” • Measuring Teachers • “Value-Added” Models “Measuring” purpose (Educational Management) is only part of the story. “Influencing” interacts with “measuring.”

“Value-Added” Models forTeacher Evaluation • Scoring • May require vertical scaling • Bias due to violations of model assumptions • Generalization • Extra error due to student sampling and sorting • Extrapolation • Score gains as proxy for teacher effectiveness / teaching quality broadly defined • Decision or Implication • Largely unexamined

Influencing • Purposes of directing effort, focusing the system, and shaping perceptions rarely stand alone • Direct use of test scores for measuring is always included • Influencing purposes may nonetheless be more significant

Shaping Public Perceptions "Test results can be reported to the press. … Based on past experience, policymakers can reasonably expect increases in scores in the first few years of a program … with or without real improvement in the broader achievement constructs that tests … are intended to measure." R. L. Linn (2000, p. 4)

Attending to Influencing Purposes in Test Validation • Importance • Influence as ultimate rationale for testing • Place in the interpretive argument where unintended consequences arise • Challenge • Purposes not clearly articulated • Required data not available for years • Required research methods unfamiliar • Disincentives to look closely • Expensive, may not matter

Clarity of Purpose SBAC and PARCC Consortia must have: “A theory of action that describes in detail the causal relationships between specific actions or strategies … and … desired outcomes …, including improvement in student achievement and college- and career-readiness.”

Availability of Data • Familiar problem in literature on program evaluation • Plan ahead • Attend to implementation cycle • Do not ask for results too soon • Plan for “audit” tests? • Phased implementation?

Expanded Methods and Theories • Can we view testing phenomena through other disciplinary lenses? • Validation requires both empirical evidence and theoretical rationales • Common sense gets us part way there • Where does theory for “Influencing” purposes come from? • What research methods can we borrow?

Costs and Incentives • Need increased investment in comprehensive validation • Need help from agents, agencies beyond test makers, test administrators • Need more explicit press for comprehensive validation in RFPs, public discourse

Thank you

How is Testing Supposed to Improve Schooling?