Improving the Ways We Report Test Scores

Improving the Ways We Report Test Scores Ronald Hambleton, April Zenisky University of Massachusetts Amherst, USA CERA Annual Meeting, June 1, 2010.

Important Time in the Testing Field • New provincial tests in Canada and state tests in the USA being introduced as part of educational reform (e.g., MA went from 7 to more than 24 in 10 years). • Users need to understand and use the scores and score reports correctly (or substantial funding is wasted).

Considerable investment of time and money has been made to address technical problems: • IRT modeling of data, test scoring of performance data, test score equating, reliability estimation, computer technology, DIF analyses, standard-setting, and validity studies.

2. Surprisingly, test score reporting attracts very little attention! • Name a research study? • Without clear and meaningful reporting of information, the other steps are of less value! • Also, on this topic, more than other technical topics, many persons thinks they are experts—everyone has an idea here about what to do!

AERA, APA, NCME Test Standards: What do they say about score scales and reporting? 5.10. When test score information is released….those responsible should provide appropriate interpretations. --information is needed about content coverage, meaning of scores, precision of scores, common misinterpretations, and proper use.

13.14 …Score reports should be accompanied by a clear statement of the degree of measurement error associated with each score or classification level and information on how to interpret the scores.

Major Problems in Score Reporting! • Reporting scales and data displays (the reports) are confusing to many persons: • percents vs. percentiles; • IQ scores; • New scales developed by states and provinces • T scores, stanine scores.

Major Problems in Score Reporting! • Quantitative literacy is not high (three kinds of persons!). Half of population can’t read bus schedules in the US. What’s 20 million dollars for testing? (1/3 of 1% of education budget) • NRT vs. CRT scores.

Major Problems in Score Reporting! • Body of evidence highlighting score reporting problems (e.g., Jaeger) • Reporting scores without error bands • Too much meaningless score information on some reports (called “chart clutter” by Tufte) • Not providing meaningful diagnostic information

Goals of the Presentation 1. Consider student reports—improving the meaning of score scales and diagnostic reports. 2. Mention several emerging methodologies for researching score reports and their utility. 3. Identify a seven step model for improving score report design and evaluation.

Individual Test Score Reports • In the USA, over 30,000,000 individual reports, alone, to parents of school children. • Over 1000 credentialing exams, and some of the exams exceed 100,000 candidates (e.g., securities, accountants, nurses)

Shortcomings in the Student Reports(Goodman & Hambleton, AME, 2004) • No stated purpose, no advanced organizer, no clues about where to start reading. • Performance categories (typically) are not defined, even briefly. • No error bands on any of the reported scores, or even a hint that errors of measurement (i.e., imprecision) are present!

Shortcomings in the Student Reports • Font is often too small to read easily. • Instructional needs information is not always user-friendly—e.g. (to a parent), “You need help in “extending meaning by drawing conclusions and using critical thinking to connect and synthesize information within and across text, ideas, and concepts.”

Shortcomings in the Student Reports • Several undefined terms on the displays: percentile, prompt, z score, performance category, achievement level, and more. • Basically, the reports are crowded!

Two Ideas for Score Reports • Bench-marking is one of our favorites and most promising: • Capitalizes on item response theory (IRT)—strong modeling of data, and items and candidates being reported on the same scale. • Researchers have been slow to take advantage of this

Bench-Marking Solution: Makes Scale Scores More Meaningful • Place boundary points on the reporting scale • Choose a probability associated with “knowing/can do”, say, 65%. • Use the ICCs from IRT to develop descriptions of what examinees can and cannot do between boundary points.

(3P) Item characteristic Curve (ICC)

Item Characteristic Curves for 60 Items P=0.65 Reporting Items Points Category Topic 1 13 16 Topic 2 18 21 Topic 3 9 12 Topic 4 8 11 Topic 5 12 15 W N P

Making Score Scales More Meaningful

Making Score Scales More Meaningful 0.65 0.65 400 500 600 A B P

Making Score Scales More Meaningful 0.65 400

Making Score Scales More Meaningful 0.65 400 500 600

700 200 600 800 400 300 500 Level 500-590: Students at this level are able to solve multi-step problems in different content areas and can make connections between content areas. For example, they can solve multi-step percent problems and can use algebraic skills to solve geometry problems. Level 600-690: Students at this level show a clear increase in ability to solve more demanding problems, to generalize, to understand mathematical terminology, and to make connections. For example, they can solve complex counting problems involving permutations/ combinations, generalize complex patterns, and solve multi-step problems involving geometric/algebraic relationships. Level 700-790: Students at this level have the ability to apply insight, reasoning, and problem solving strategies to solve a wide range of problems both within and across the content areas. For example, they cansolve problems involving newly-defined functions in more than two variables and can solve conditional probability problems by constructing and analyzing a table of possible outcomes. Level 400-490: Students at this level display the ability to solve a greater variety of basic problems in each of the content areas. For example, they can recognize relationships and solve routine problems presented in verbal, mathematical, or graphical forms. Level 200-290: Students at this level can sometimes solve very basic problems in each of the content areas. For example, they can solve simple arithmetic problems and read simple data displays. Level 300-390: Students at this level show a beginning ability to recall and use mathematical facts and terminology to solve basic problems. For example, they can identify the rule for a simple pattern and solve very routine geometry problems. Meaning of the Mathematics Scale

Common Diagnostic Report • Candidate results by subdomain categories (e.g. math): 10% 75% 75% 44% 18% 0% 100%

Highly Problematic Report!! • No sense of measurement error • No guarantee that the items are representative • No basis for score interpretation

A Better Report!! • Confidence bands • A frame of reference: performance of borderline candidates, or passing candidates, for example.

Score Report Design & Evaluation • Experiments • Focus Groups • Think-alouds • Qualitative Reviews from the Field • Tryouts

3 1 2 Define purpose of score report Identify intended audience(s) Review report examples/literature 4 Develop reports(s) 5 Data collection/field test 7 Ongoing maintenance 6 Revise and redesign 7 Steps in Report Development

Necessary Research • Reducing the size of error bands for knowledge/skill areas • improving the quality of test items • Improving the targeting of the test • capitalizing on correlational information among the skills or other priors

Necessary Research (cont.) • Learning to move from the ICCs, to choosing the number of performance categories, to preparing the descriptive statements that can enhance the meaning of a score scale, and validation.

Final Remarks • Important advances have been made in score reporting. • More research needed on matching score reports to intended audiences, and evaluating score reports prior to use. • Diagnostic reports are important to users but need more research.

Final Remarks • Seven step model should be used, and exemplar reports compiled. • We are pleased to see the developments taking place. --States, provinces and countries are beginning to use the tools and progress can be seen. • See the NCME bibliography by Deng and Yoo with 70+ pages of references!

Improving the Ways We Report Test Scores

Improving the Ways We Report Test Scores

Presentation Transcript

The Three Ways We Persuade

How to Raise Test Scores

Linear Regression: Test scores vs. HW scores

Improving your Scores

Strategies for Improving School Climate Scores

Ways of Improving Democracy

Improving your Scores

Improving your Scores Part 2

Statistics: Test Scores

Improving ITBS Scores

RAISING TEST SCORES

Improving the way we work

Using the Common European Framework of Reference to Report Language Test Scores

Beyond test scores: the role of primary schools in improving multiple child outcomes

Improving Test Performance

Statistics: Test Scores

How to Interpret Test Scores

How do we improve test scores fast?

Test scores

Importing Standardized Test Scores

TransUnion is Improving Credit Scores for Consumers

Misadministration of standardized achievement tests: Can we count on test scores