400 likes | 524 Views
How is Testing Supposed to Improve Schooling?. Edward Haertel April 15, 2012 NCME Career Award Address Vancouver, British Columbia. How Many Purposes… ?. Purposes for Educational Testing. Measuring versus Influencing. Measuring
E N D
How is Testing Supposed toImprove Schooling? Edward Haertel April 15, 2012 NCME Career Award Address Vancouver, British Columbia
Measuring versus Influencing • Measuring • Relies directly on informational content of specific test scores • Influencing • Effects intended to flow from testing per se, independent of specific test results • Deliberate efforts to raise test scores • Changing perceptions or ideas
Example: Weekly Spelling Test • Measuring • Note words often missed (guides reteaching) • Assign grades • Guide students’ review following testing • Influencing • Motivate studying • Convey importance of spelling proficiency
Leap from measuring to influencing Arguments … claim … program will lead to improvements in school effectiveness and student achievement by focusing … attention … on demanding content. Yet, the validity arguments … attend only to the descriptive part of the interpretive argument …. The validity evidence … tends to focus on scoring and generalization to the content domain for the test. The claim that the imposition of the accountability requirements will improve the overall performance of schools and students is taken for granted. Kane, M. T. (2006). Validation. In R. L Brennan (Ed.), Educational Measurement (4th ed., pp. 17-64)
Interpretive Argument • Scoring • Alignment, DIF, scaling, norming, equating, … • Generalization • Score precision, reliability, generalizability, … • Extrapolation • Score as reflection of intended construct • Decision or Implication • Use in guiding action or informing description
“Appropriate test use and sound interpretation of test scores are likely to remain primarily the responsibility of the test user.” Standards for Educational and Psychological Testing, p. 111 Not our concern?
Process too linear? • Curriculum Framework • Test Specification • Item Writing • Forms Assembly • Tryout and revision • Administration • Scaling
Today’s Focus • Achievement tests taken by students • Some attention to aptitude tests as well • Exclude tests taken by teachers • Include uses of student test scores to evaluate teachers • Exclude testing for individual diagnosis of special needs
Curriculum-Dependent Test Question Curriculum-Neutral Test Question • May assume prior knowledge and skills • May probe reasoning with what is already known • May “drill deeper,” testing application of concepts • Must include requisite information with item • Must set up context in order to probe reasoning • Often limited to testing knowledge of concept definitions Testing and Prior Instruction
Instructional Guidance • Formative Assessment (informal) • Scoring • Sound items adequately sampling domain? • Generalization • Test scores with adequate precision? • Extrapolation • Mastery extends beyond test per se? • Decision or Implication • Used to adapt teaching work to meet learning needs?
• Scoring • Generalization • Extrapolation • Decision or Implication Instructional Guidance • Formative Assessment (highly structured) • Winnetka Plan • Programmed Instruction approaches • Benjamin Bloom’s Mastery Learning • Pittsburgh LRDC’s IPI Math Curriculum • Criterion-Referenced Testing movement
Instructional Guidance • Formative Assessment (highly structured) • Scoring • Questions mapped well to behavioral objectives • Generalization • Multiple items highly redundant • Extrapolation • ??? Assume decomposability, decontextualization • Decision or Implication • Relied on cut scores, simple rules; insufficient attention to actual effects
Student Placement and Selection • IQ-based tracking • GATE programs • English Learner status (Entry / Exit) • MCTs / HSEEs • Advanced Placement / International Baccalaureate • SAT / ACT • …
IQ-Based Tracking • Rationale • Teachers deliver uniform instruction to all students in a classroom • Students learn at different rates • Or, have different “capacities” • Grouping students by ability will improve efficiency because all will receive content at a rate appropriate to their ability • This will reduce wasted effort and frustration
IQ-Based Tracking • Context • Increasing immigration (since late 19th century) • Perceived success of Army Alpha • Scientific School Management movement • Prevailing hereditarian views
IQ-Based Tracking • Scoring • Scores free from bias and distortion? • Generalization • High correlations across forms and occasions • Extrapolation • Assumed based on strong theory, some criterion-related validity evidence • Decision or Implication • Largely unexamined
Student Placement and Selection • IQ-based tracking • GATE programs • English Learner status (Entry / Exit) • MCTs / HSEEs • Advanced Placement (AP) / International Baccalaureate (IB) • SAT / ACT • …
Comparing Educational Approaches • ESEA-mandated Project Head Start evaluations • Evaluations of NSF-sponsored science curricula • National Diffusion Network • What Works Clearinghouse • Both RCTs and Quasi-experimental research
Educational Management • Measuring Schools • NCLB • Adequate Yearly Progress (AYP) determinations • Intervention for schools “in need of improvement” • Measuring Teachers • “Value-Added” Models “Measuring” purpose (Educational Management) is only part of the story. “Influencing” interacts with “measuring.”
“Value-Added” Models forTeacher Evaluation • Scoring • May require vertical scaling • Bias due to violations of model assumptions • Generalization • Extra error due to student sampling and sorting • Extrapolation • Score gains as proxy for teacher effectiveness / teaching quality broadly defined • Decision or Implication • Largely unexamined
Influencing • Purposes of directing effort, focusing the system, and shaping perceptions rarely stand alone • Direct use of test scores for measuring is always included • Influencing purposes may nonetheless be more significant
Shaping Public Perceptions "Test results can be reported to the press. … Based on past experience, policymakers can reasonably expect increases in scores in the first few years of a program … with or without real improvement in the broader achievement constructs that tests … are intended to measure." R. L. Linn (2000, p. 4)
Attending to Influencing Purposes in Test Validation • Importance • Influence as ultimate rationale for testing • Place in the interpretive argument where unintended consequences arise • Challenge • Purposes not clearly articulated • Required data not available for years • Required research methods unfamiliar • Disincentives to look closely • Expensive, may not matter
Clarity of Purpose SBAC and PARCC Consortia must have: “A theory of action that describes in detail the causal relationships between specific actions or strategies … and … desired outcomes …, including improvement in student achievement and college- and career-readiness.”
Availability of Data • Familiar problem in literature on program evaluation • Plan ahead • Attend to implementation cycle • Do not ask for results too soon • Plan for “audit” tests? • Phased implementation?
Expanded Methods and Theories • Can we view testing phenomena through other disciplinary lenses? • Validation requires both empirical evidence and theoretical rationales • Common sense gets us part way there • Where does theory for “Influencing” purposes come from? • What research methods can we borrow?
Costs and Incentives • Need increased investment in comprehensive validation • Need help from agents, agencies beyond test makers, test administrators • Need more explicit press for comprehensive validation in RFPs, public discourse