1 / 72

Testing Screw-ups

Testing Screw-ups. Howard Wainer National Board of Medical Examiners. Rudiments of Quality Control. In any large enterprise, errors are inevitable. Thus the quality of the enterprise must be judged not only by the frequency and seriousness of the errors but also by

Download Presentation

Testing Screw-ups

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Testing Screw-ups Howard Wainer National Board of Medical Examiners

  2. Rudiments of Quality Control In any large enterprise, errors are inevitable. Thus the quality of the enterprise must be judged not only by the frequency and seriousness of the errors but also by how they are dealt with once they occur.

  3. I will illustrate my talk today with three principal examples: • A scoring error that was made by NCS Pearson, Inc. (under contract to the College Entrance Examination Board) on October 8, 2005 on an administration of the SAT Reasoning test. • A September 2008 report published by the National Association for College Admission Counseling in which one of the principal recommendations was for colleges and universities to reconsider requiring the SAT or the ACT for applicants. • A tragic misuse of the results on the PSSA third grade math test in an elementary school in southeastern Pennsylvania, where a teacher was suspended without pay because her class did unexpectedly well.

  4. One excuse for all of them was the plea that resources were limited and they couldn’t afford to do as complete a job as turned out to be required. This illustrates what is a guiding axiom, (it is almost a commandment), “If you think doing it right is expensive, try doing it wrong.”

  5. A guiding principle of quality control In any complex process, zero errors is an impossible dream. Striving for zero errors will only inflate your costs and hinder more important development (because it will use up resources that can be more effectively used elsewhere).

  6. Consider a process that has one error/thousand. How can this be fixed? Having inspectors won’t do it, because looking for a rare error is so boring they will fall asleep and miss it, or they will see proper things and incorrectly call them errors. Having recheckers check the checkers increases the costs and introduces its own errors. Quis custodiet ipsos custodes?

  7. As a dramatic example let us consider the accuracy of mammograms at spotting carcinomas.Radiologists correctly identify carcinomas in mammograms about 85% of the time. But they incorrectly identify benign spots as cancer about 10% of the time. To see what this means let’s consider what such errors yield in practice. Suppose you have a mammogram and it is positive. What is the probability that you have cancer?

  8. The probability of a false positive 10% and of a correct positive is 85%. There are 33.5 million mammograms administered each year and 187,000 breast cancers are correctly diagnosed. That means that there are about 33.3 million women without cancer getting a mammogram and that 10% (3.33 million) of them will get a positive initial diagnosis. Simultaneously, 85% of the 187 thousand women with breast cancer (159 thousand) will also get a positive diagnosis.

  9. So the probability of someone with a positive diagnosis actually having cancer is: 159,000/(159,000+3,330,000) = 159/(3,500) = 4.5% Or if you are diagnosed with breast cancer from a mammogram, more than 95% of the time you do not have it!

  10. If you are diagnosed negative you almost certainly are negative. The best advice one might give someone considering a mammogram is “if its negative, believe it; if it is positive, don’t believe it.”

  11. It is natural to ask is the expense of doing such a test on so many people worth this level of accuracy? Even if it is, it is still worth doing these error calculations; especially to do them separately by age/risk group. Current intelligence strongly suggests that if you have no other risk factors and are under 50 (under 60?), don’t bother.

  12. This brings us to the QC dictum that, “You cannot inspect quality into a product, you must build it in.” Nowhere is this better illustrated than with US automobile manufacturing.

  13. Because errors are inevitable, the measure of the quality of a testing program must be broader than just the proportion of times errors are made, although that is surely one part, but also the seriousness of the errors, and, most importantly, how the errors are dealt with once they are uncovered.

  14. My experience, gathered over 40 years in the testing business, has given me the distinct impression that while technical errors in test scoring, test equating, test calibrating, and all of the other minutiae that are required in modern large-scale assessment occur with disappointing frequency, their frequency and their import are dwarfed by errors in score interpretation and misuse; also by errors generated in attempting to cover-up. I will illustrate that today.

  15. Example 1. Mis-scoring the SAT On October 8, 2005 NCS Pearson, Inc., under contract to the College Entrance Examination Board, scored an administration of the SAT Reasoning test. Subsequently it was discovered that there was a scanning error that had affected 5,024 examinees’ scores (out of almost 1.5 million). After rescoring it was revealed that 4,411 test scores were too low and 613 were too high.

  16. The exams that were underscored were revised upward and the revised scores were reported to the designated colleges and universities. The College Board decided that “it would be unfair to re-report the scores of the 613 test takers” whose scores were improperly too high and hence did not correct them. A motion for a preliminary injunction to force the re-scoring of these students’ tests was then filed in the United States District Court for the District of Minnesota (Civil Action no, 06-1481 PAM/JSM).

  17. The College Board reported that: “550 of 613 test takers had scores that were overstated by only 10 points; an additional 58 had scores that were overstated by only 20 points. Only five test takers,…, had score differences greater than 20 points,”(three had 30 point gains, one 40 and one 50).

  18. Why not correct the errors? What were the statistical arguments used by College Board to not revise the erroneously too-high scores? 1. “None of the 613 test takers had overstated scores that were in excess of the standard error of measurement” or (in excess of) 2. “the standard error of the difference for the SAT.” 3. “More than one-third of the test takers – 215 out of 613 – had higher SAT scores on file”

  19. A statistical expansion The College Board’s argument is that if two scores are not significantly different (e.g. 10-30 points) they are not different. This is often called, “The margin of error fallacy” If it could be, it is.

  20. An empirical challenge Let me propose a wager aimed at those who buy the College Board argument. Let us gather a large sample, say 100,000 pairs of students, where each pair go to the same college and have the same major, but one has a 30 point higher (though not statistically significantly different) SAT score. If you truly believe that such a pair are indistinguishable you should be willing to take either one as eventually earning a higher GPA in college. So here is the bet – I’ll take the higher scoring one (for say $10 each). The amount of money I win (or lose) should be an indicant of the truth of the College Board’s argument.

  21. Tests have three possible goals: They can be used as: 1. measuring instruments, 2. contests, or 3. prods. Sometimes tests are asked to serve multiple purposes, but the character of a test required to fulfill one of these goals is often at odds with what would be required for another. But there is always a primary purpose, and a specific test is usually constructed to accomplish its primary goal as well as possible.

  22. A typical instance of tests being used for measurement are diagnostic tests that probe an examinee’s strengths and weaknesses with the goal of suggesting remediation. For tests of this sort the statistical machinery of standard errors is of obvious importance, for one would want to be sure that the weaknesses noted are likely to be real so that the remediation is not chasing noise.

  23. A test used as a contest does not require such machinery, for the goal is to choose a winner. Consider, as an example, the Olympic 100m dash – no one suggests that a difference of one-hundredth of a second in the finish portends a “significant” difference, nor that if the race were to be run again the same result would necessarily occur. The only outcome of interest is “who won.” Of key aspect here is that, the most important characteristic of test as contest is that it be fair; we cannot have stopwatches of different speeds on different runners.

  24. Do these arguments hold water? The SAT is, first and foremost, a contest. Winners are admitted, given scholarships, etc. losers are not. The College Board’s argument that because the range of inflated test scores is within the standard error of measurement or standard error of difference then the scoring errors are “not material” is specious. The idea of a standard error is to provide some idea of the stability of the score if the test should be taken repeatedly without any change in the examinee’s ability.

  25. But that is not relevant in a contest. For the SAT, what matters is not what your score might have been had you taken it many more times, but rather what your score actually was – indeed what it was in comparison with the others who took it and are in competition with you for admission, for scholarships, etc.

  26. The results shown in this figure tell us many things. Two of them are: A ten-point gain in the middle of the distribution yields a bigger increase in someone’s percentile rank than would be the case in the tails (to be expected since the people are more densely packed in the middle and so a ten-point jump carries you over more of them in the middle), and The gain can be as much as 3.5 percentile ranks.

  27. But 1,475,623 people took the SAT in 2005, which means that we can scale the figure in terms of people instead of percentiles. So very low or very high scoring examinees will move ahead of only four or five thousand other test takers through the addition of an extra, unearned, ten-points. But someone who scores near the middle of the distribution (where, indeed, most people are) will leapfrog ahead of as many as 54,000 other examinees.

  28. Note that the error in scoring was tiny, affecting only a small number of examinees. The big problem was caused by the College Board’s decision to leave erroneous scores uncorrected and to make up some statistical gobbly-gook to try to justify a scientifically and ethically indefensible position.

  29. Apparently others agreed with my assessment of the validity of the College Board’s argument, for on August 24, 2007 the New York Times reported that the College Board and Pearson agreed to pay $2.85 million ($4,650/person). This was ratified by Judge Joan Ericksen on November 29, 2007. Edna Johnson, a spokeswoman for the College Board, said, “We were eager to put this behind us and focus on the future.” Pearson spokesman Dave Hakensen said the company declined comment.

  30. Example 2. National Association for College Admission Counseling’s September 2008 report on admissions testing On September 22, 2008, the New York Times carried the first of three articles about a report, commissioned by announcement by the National Association for College Admission Counseling, that was critical of the current, widely used, college admissions exams, the SAT and the ACT. The commission was chaired by William R. Fitzsimmons, the dean of admissions and financial aid at Harvard. The report was reasonably wide-ranging and drew many conclusions while offering alternatives. Although well-meaning, many of the suggestions only make sense if you say them very fast.

  31. Among their conclusions were: • 1. Schools should consider making their admissions “SAT optional,” that is allowing applicants to submit their SAT/ACT scores if they wish, but they should not be mandatory. The commission cites the success that pioneering schools with this policy have had in the past as proof of concept. • 2. Schools should consider eliminating the SAT/ACT altogether and substituting instead achievement tests. They cite the unfair effect of coaching as the motivation for this – they were not naïve enough to suggest that because there was no coaching for achievement tests now that, if they became more high stakes coaching for them would not be offered, but rather that such coaching would be directly related to schooling and hence more beneficial to education that coaching that focuses on test-taking skills. • 3. That the use of the PSAT with a rigid qualification cut-score for such scholarship programs as the Merit Scholarships be immediately halted.

  32. Recommendation 1. Make SAT optional: It is useful to examine those schools that have instituted “SAT Optional” policies and see if the admissions process been hampered in those schools. The first reasonably competitive school to institute such a policy was Bowdoin College, in 1969. Bowdoin is a small, highly competitive, liberal arts college in Brunswick, Maine. A shade under 400 students a year elect to matriculate at Bowdoin, and roughly a quarter of them choose to not submit SAT scores. In the following table is a summary of the classes at Bowdoin and five other institutions whose entering freshman class had approximately the same average SAT score. At the other five institutions the students who didn’t submit SAT scores used ACT scores instead.

  33. Table 1 : Six Colleges/Universities with similar observed mean SAT scores for the entering class of 1999.

  34. To know how Bowdoin’s SAT policy is working we will need to know two things: • How did the students who didn’t submit SAT scores do at Bowdoin in comparison to those students that did submit them? • Would the non-submitters performance at Bowdoin have been predicted by their SAT scores had the admissions office had access to them?

  35. The first question is easily answered by looking at their first year grades at Bowdoin.

  36. But would their SAT scores have provided information missing from other submitted information? Ordinarily this would be a question that is impossible to answer, for these students did not submit their SAT scores. However all of these students actually took the SAT, and through a special data-gathering effort at the Educational Testing Service we found that the students who didn’t submit these scores behaved sensibly. And, realizing that their lower than average scores would not help their cause at Bowdoin, chose to not submit them. Here is the distribution of SAT scores for those who submitted them as well as those who did not.

  37. As it turns out, the SAT scores for the students who did not submit them would have accurately predicted their lower performance at Bowdoin. In fact the correlation between grades and SAT scores was higher for those who didn’t submit them (0.9) than for those who did (0.8).

  38. So not having this information does not improve the academic performance of Bowdoin’s entering class – on the contrary it diminishes it. Why would a school opt for such a policy? Why is less information preferred to more?

  39. There are surely many answers to this, but one is seen in an augmented version of the earlier table 1: We see that if all of the students in Bowdoin’s entering class had their SAT scores included, the average SAT at Bowdoin would shrink from 1323 to 1288, and instead of being second among these six schools they would have been tied for next to last.

  40. Since mean SAT scores are a key component in school rankings, a school can game those rankings by allowing their lowest scoring students to not be included in average. I believe that Bowdoin’s adoption of this policy pre-dates US News & World Report’s rankings, so that was unlikely to have been their motivation, but I cannot say the same thing for schools that have chosen such a policy more recently.

  41. Recommendation 2. Using Achievement Tests Instead Driving the Commission’s recommendations was the notion that the differential availability of commercial coaching made admissions testing unfair. They recognized that the 100 point gain (on the 1200 point SAT scale) coaching schools often tout as a typical outcome was hype and agreed with the estimates from more neutral sources of about 20 points was more likely. But, they deemed even 20 points too many. The Commission pointed out that there was no wide-spread coaching for achievement tests, but agreed that should the admissions option shift to achievement tests the coaching would likely follow. This would be no fairer to those applicants who could not afford extra coaching, but at least the coaching would be of material more germane to the subject matter and less related to test-taking strategies.

  42. One can argue with the logic of this – that a test that is less subject oriented and related more to the estimation of a general aptitude might have greater generality. And that a test that is less related to specific subject matter might be fairer to those students whose schools have more limited resources for teaching a broad range of courses. I find these arguments persuasive, but I have no data at hand to support them. So instead I will take a different, albeit more technical, tack – the psychometric reality associated with replacing general aptitude tests with achievement tests means that making the kinds of comparisons that schools need among different candidates impossible.

  43. When all students take the same tests we can compare their scores on the same basis. The SAT and ACT were constructed specifically to be suitable for a wide range of curricula. SAT–Math is based on mathematics no more advanced than 8th grade. Contrast this with what would be the case with achievement tests. There would need to be a range of tests and students would chose a subset of them that best displayed both the coursework they have had and the areas they felt they were best in. Some might take chemistry, others physics; some French, others music. The current system has students typically taking three achievement tests (SAT-II). How can such very different tests be scored so that the outcome on different tests can be compared?

  44. Do you know more French than I know physics? Was Mozart a better composer than Einstein was a physicist? How can admissions officers make sensible decisions through incomparable scores?

  45. How are SAT-II exams scored currently? Or more specifically, how they had been scored for decades when I left the employ of ETS nine years ago – I don’t know if they have changed anything in the interim. They were all scored on the familiar 200-800 scales, but similar scores on two different tests are only vaguely comparable. How could they be comparable? What is currently done is that tests in mathematics and science are roughly equated using the SAT-Math, the aptitude test that everyone takes, as an equating link. In the same way tests in the humanities and social sciences are equated using the SAT-Verbal. This is not a great solution, but is the best that can be done in a very difficult situation. Comparing history with physics is not worth doing for even moderately close comparisons.

  46. One obvious approach would be to norm reference each test, so that someone who scores average for all those who take a particular test gets a 500 and someone a standard deviation higher gets a 600, etc. This would work if the people who take each test were, in some sense, of equal ability. But that is not only unlikely, it is empirically false. The average student taking the French achievement test would starve to death in a French restaurant, whereas the average person who takes the Hebrew achievement test, if dropped onto the streets of Tel Aviv in the middle of the night would do fine. Happily the latter students also do much better on the SAT-VERBAL test and so the equating helps. This is not true for the Spanish test, where a substantial portion of those taking it come from Spanish speaking homes.

  47. Substituting achievement tests is not a feasible option unless admissions officers are prepared to have subject matter quotas. Too inflexible for the modern world I reckon.

  48. Recommendation 3. Halt the use of a cut-score on the PSAT to qualify for Merit Scholarships One of the principal goals of the Merit Scholarship program is to distribute a limited amount of money to highly deserving students without regard to their sex, ethnicity, or geographic location. This is done by first using a very cheap and wide ranging screening test. The PSAT is dirt-cheap and is taken by about 1.5 million students annually. The Commission objected to a rigid cut-off on the screening test. They believed that if the cut-off was, say, at a score of 50, we could not say that someone who scored 49 was different enough to warrant excluding them from further consideration. They suggested replacing the PSAT with a more thorough and accurate set of measures for initial screening.

  49. The problem with a hard and fast cut score is one that has plagued testing for more than a century. The Indian Civil Service system, on which the American Civil Service system is based, found a clever way around it. The passing mark to qualify for a civil service position was 20. But if you received a 19 you were given one ‘honor point’ and qualified. If you scored 18 you were given two honor points, and again qualified. If you scored 17, you were given three honor points, and you qualified. But if you scored 16 you did not qualify, for you were four points away.

More Related