820 likes | 968 Views
Statistical Detection of Cheating on Multiple Choice Exams: Software, Implementation, and Controversy. George O. Wesolowsky Professor Emeritus of Management Science De Groote School of Business McMaster University, Hamilton. Ontario,. Outline of this Presentation.
E N D
G.O. Wesolowsky Statistical Detection of Cheating on Multiple Choice Exams:Software, Implementation, and Controversy George O. Wesolowsky Professor Emeritus of Management Science De Groote School of Business McMaster University, Hamilton. Ontario,
G.O. Wesolowsky Outline of this Presentation • Introduction: Cheating on multiple choice tests • How I got into this. • Outline of statistical detection methodology • Practical capabilities of SCheck • Common attitudes to detection and prevention • Recommendations
G.O. Wesolowsky Ideal* Writing Conditions * I have seen more than 30% cheating under such conditions
G.O. Wesolowsky Less than Ideal Writing Conditions http://math.berkeley.edu/~ribet/113/
G.O. Wesolowsky Prevalence of MC Tests and Exams • 30% ? of marks in UG classes given through MC • At McGill (20000 undergrads) in the Fall Semester of 2002: Finals: 83 courses,15072 students Midterms: 70+ courses, 14000 students
G.O. Wesolowsky How They Do It: Copying Sampler
G.O. Wesolowsky How They Do It: Types of Cheating not Resulting in Similar Responses I am an impostor Usually not vulnerable to statistical detection
G.O. Wesolowsky A guide to cheating during tests and examinationsFrom Wikibooks, the open-content textbooks collectionhttp://en.wikibooks.org/wiki/A_guide_to_cheating_during_tests_and_examinations • Contents • [hide] • 1 Preamble • 1.1 A few definitions to consider • 1.2 The rewards/dangers of cheating • 1.2.1 Rationales of cheating • 1.2.2 Rationales of prosecuting cheaters • 1.2.3 Possible Penalties • 2 General notes • 3 Techniques • 3.1 Copying from a person • 3.1.1 Application of codes • 3.2 Copying from a pre-written source • 3.2.1 Directly from textbook/notes • 3.2.2 Cheat Sheet • 3.3 Precautions • 3.4 Copying from a planted source • 3.5 Locating Cheating Material on the Web • 3.6 Test previewing
G.O. Wesolowsky Some Statistics plagiarized text (CAI) CAI Research Conducted By Don McCabe (Released In June, 2005) is typical of many studies: As part of CAI’s Assessment Project, almost 50,000 undergraduates on more than 60 campuses have participated in a nationwide survey of academic integrity since the fall of 2002. The results were disturbing, provocative, and challenging. On most campuses, 70% of students admitted to some cheating. Close to one-quarter of the participating students admitted to serious test cheating in the past year and half admitted to one or more instances of serious cheating on written assignments. Faculty are reluctant to take action against suspected cheaters. In Assessment Project surveys involving almost 10,000 faculty, 44% of those who were aware of student cheating in their course in the last three years, have never reported a student for cheating to the appropriate campus authority. Students suggest that cheating is higher in courses where it is well known that faculty members are likely to ignore cheating.
G.O. Wesolowsky One Method of Cheating Detection
G.O. Wesolowsky Questionable Statistical Detection It is not infrequent that instructors, when confronted by a suspected cheating situation, invent their own methodology on the spot. This is usually what I call ‘outlier methodology’. The basis is some way of using the number of wrong answers that two students have in common. It could be simply a count of such 'wrong matches', a proportion, a run length, a ratio with other counts, or a multivariate plot of such variables. The idea is to look for outliers and attribute them to cheating.
G.O. Wesolowsky Example Bonnie and Clyde engaged in suspicious behavior. A comparison of responses revealed: “Bonnie and Clyde are surprisingly similar; 23 matches out of 23 wrong.” .....C......B...........C........BD............D..DB.A..ABD.B..A..CB...B.........ABAC..A..C .....C......B...........C........BD............D..DB.A..ABD.B..A..CB...B.........ABAC..A..C http://www.astro.washington.edu/fraser/multiple-choice-cheating.html Both chose C, which is wrong . = correct
But then: G.O. Wesolowsky The instructor wrote a program: “My script just returns any match that has a high percentage of matching errors (and sufficient errors to convince you that some thing's up!) “ “Holmes and Watson are surprisingly similar; 8 matches out of 11 wrong.” ..........................D......D....................B.....D.A.......BBA.........BD.B..... CD..BCC....BB...DC..CA..D.EC.....D......AD...B.DDAB...B..A..A.AA....CDB.AA.C......BDDBE.B.. Intuitive override: “ I had found by chance (Bonnie and Clyde), but what about the rest? It's very unlikely that Holmes and Watson were cheating, but I think it's likely that the others were”. This instructor then concluded that statistical detection is not really reliable. Bad statistical detection often discredits the good.
G.O. Wesolowsky Aside: A Better “quickie” Index • A better but not good simple index is the Harpp-Hogan index, which is the number of wrong matches divided by the number of differences. One is supposed to be suspicious when it is > 1. For Holmes and Watson this works out to 8/32.
G.O. Wesolowsky Problems with “Simple” Indices or combinations thereof • The value of the indices can depend in an unknown way on class size, number of questions, number of choices, etc. • They use very little information. Capability of students and difficulty of questions are often not incorporated • The risk of “false accusations” is not predictable • Many combinations of indices and plots are possible, and they may point in different directions.
G.O. Wesolowsky How I Got Into This • Request from an administrator • Two students were suspected in another course, how many exactly similar answers did they have in my course? • Probability tree diagrams • Checked the literature Wesolowsky G.O. (2000) "Detecting Excessive Similarity in Answers on Multiple Choice Exams", Journal of Applied Statistics, Vol. 27, 909-921.
pki G.O. Wesolowsky Probability of a match by students j and k on question I = sum of match probabilities match Probability correct pji w1i Cond. probability wrong match 1 - pki w1i Question i 1 - pji w2i w2i 1 - pki match w3i 1 - pki w3i match w4i 1 - pki w4i match
G.O. Wesolowsky Assumptions • The probability that a student gets an answer right depends on the ability of the student and the difficulty of the question • The probability of a match on wrong questions depends on the ‘popularity’ of wrong answers • Independencies as implicit in the diagram
G.O. Wesolowsky But how to we estimate wli and pji ?
G.O. Wesolowsky depends on two things Above average student 1 Below average student Proportion of class that answered correctly on question i 0 1
G.O. Wesolowsky Finding cj= proportion of questions answered correctly by student j Find by solving
G.O. Wesolowsky P value for each pair of students = probability of the observed number of matches or more Question q Question 1 Question j M M M Compound Binomial Distribution because the probability of a match is different on each question
Example of SCheck Output G.O. Wesolowsky ** pair = 2 78 ** Harpp-Hogan stat = #wr.mat/#diff = 19.00 ################################################################## Zb = 7.891 'equivalent' z from the BVP model Significance of Zb on a pre-selected pair = 1.5E-15 Significance bound (Bonferroni) on program selected pairs = 1.3E-11 #matches = 33 | 34 (mu,s)=( 11.410, 2.689) prop. right for 2 = 0.441 prop. right for 78 = 0.412 Quest. range = [ 1 34 ] #students = 132 ---------------------------------------------------------------- .d.abccd.e .e.abedb.. ...da..b.. ea.e --------------------------------------------------------------- .d.abccdee .e.abedb.. ...da..b.. ea.e ---------------------------------------------------------------- estimated match probabilities: 0.423 0.357 0.360 0.324 0.367 0.377 0.376 0.232 0.285 0.316 0.237 0.236 0.369 0.283 0.249 0.423 0.254 0.321 0.255 0.483 0.696 0.310 0.371 0.238 0.345 0.536 0.258 0.211 0.460 0.290 0.224 0.326 0.388 0.231 ** pair = 2 78 ** Harpp-Hogan stat = #wr.mat/#diff = 19.00 ################################################################## Zb = 7.891 'equivalent' z from the BVP model Significance of Zb on a pre-selected pair = 1.5E-15 Significance bound (Bonferroni) on program selected pairs = 1.3E-11 #matches = 33 | 34 (mu,s)=( 11.410, 2.689) prop. right for 2 = 0.441 prop. right for 78 = 0.412 Quest. range = [ 1 34 ] #students = 132 ---------------------------------------------------------------- .d.abccd.e .e.abedb.. ...da..b.. ea.e --------------------------------------------------------------- .d.abccdee .e.abedb.. ...da..b.. ea.e ---------------------------------------------------------------- estimated match probabilities: 0.423 0.357 0.360 0.324 0.367 0.377 0.376 0.232 0.285 0.316 0.237 0.236 0.369 0.283 0.249 0.423 0.254 0.321 0.255 0.483 0.696 0.310 0.371 0.238 0.345 0.536 0.258 0.211 0.460 0.290 0.224 0.326 0.388 0.231
G.O. Wesolowsky suspicious Data Dredging • The number of student pairs examined is n(n-1)/2. • For 693 students this is 239,778 pairs
G.O. Wesolowsky “Unusual” Z’s Depend on Class Size
G.O. Wesolowsky Multiply the Pvalue by n(n-1)/2 ** pair = 2 78 ** Harpp-Hogan stat = #wr.mat/#diff = 19.00 ################################################################## Zb = 7.891 'equivalent' z from the BVP model Significance of Zb on a pre-selected pair = 1.5E-15 Significance bound (Bonferroni) on program selected pairs = 1.3E-11 #matches = 33 | 34 (mu,s)=( 11.410, 2.689) prop. right for 2 = 0.441 prop. right for 78 = 0.412 Quest. range = [ 1 34 ] #students = 132 ---------------------------------------------------------------- .d.abccd.e .e.abedb.. ...da..b.. ea.e --------------------------------------------------------------- .d.abccdee .e.abedb.. ...da..b.. ea.e ---------------------------------------------------------------- estimated match probabilities: 0.423 0.357 0.360 0.324 0.367 0.377 0.376 0.232 0.285 0.316 0.237 0.236 0.369 0.283 0.249 0.423 0.254 0.321 0.255 0.483 0.696 0.310 0.371 0.238 0.345 0.536 0.258 0.211 0.460 0.290 0.224 0.326 0.388 0.231 A similarity this unusual will occur at most 1.3 times, on the average, per 100 billion classes.
G.O. Wesolowsky Important! • The significance (probability that a similarity that high will occur for an innocent pair) is different for a pair that is pre-selected by, say, suspicious behavior, from that of a pair that was selected purely by the program. In other words, the former case does not need as high a level of similarity evidence. Scheck, therefore, allows pre-selected pairs to be forced into the analysis
Up to 30000 students Up to 200 questions Up to 27 choices, numbers or letters True or false or multiple choice in any combination Select a contiguous block of questions Option for pre-selected student pairs Option for similarity scores for all students Options for removing student identification from input and output files New: Two Type I methods for setting cutoffs Adjustment for “speed tests” Interactive or stored option choice Batch processing of multiple files Optional Excel grades output Files with all components necessary for verification of calculations Diagnostic graph Optional fine tuning (T parameter) Compact and intuitive question diagnostics Utility programs (format translators) G.O. Wesolowsky Features of SCheck Developed from experience with large scale testing, research into cheating psychology, tribunal cases, different data formats, etc.
G.O. Wesolowsky This box and the previous one allow selection of a block of questions. Useful if,say, some questions only gather information.
G.O. Wesolowsky This forces suspect pairs into the output for analysis.
G.O. Wesolowsky • Vertical red line indicates similarity cutoff. Position depends on class size • Straightness = normality • Slope indicates stdev of Z’s • Innocent class is symmetrical within the lines
G.O. Wesolowsky Forced pairs in NAM file
G.O. Wesolowsky Forced pairs in OUT file
G.O. Wesolowsky Diagnostics on questions
It’s Cheating time G.O. Wesolowsky
G.O. Wesolowsky Detected Pairs Summary of significances of identified pairs --------------------------------------- pair Z A Priori Bonferroni Signif Signif. ----------------------------------------- 2, 78 7.891 1.5E-15 1.3E-11 2, 97 7.428 5.5E-14 4.8E-10 36, 69 6.253 2.0E-10 1.7E-6 36, 70 4.755 9.9E-7 8.5E-3 36, 72 5.514 1.8E-8 1.5E-4 60, 119 4.931 4.1E-7 3.5E-3 69, 70 6.527 3.3E-11 2.9E-7 69, 72 5.474 2.2E-8 1.9E-4 70, 72 6.527 3.3E-11 2.9E-7 78, 97 7.067 7.9E-13 6.9E-9 ---------------------------------------- All pairs were found in adjacent seating
G.O. Wesolowsky 36 60 2 69 78 119 70 97 72 132 students teamwork
G.O. Wesolowsky Have you ever seen anything like it? (Contact at a testing agency)
G.O. Wesolowsky Pair Z Pair Z Pair Z Pair Z 16, 39 6.453 16, 41 5.453 16, 42 6.090 16, 44 5.051 16, 46 5.089 16, 50 5.074 16, 65 5.837 39, 41 7.385 39, 42 8.178 39, 43 6.196 39, 44 6.061 39, 46 6.916 39, 50 6.878 39, 57 5.662 39, 64 5.896 39, 65 7.887 39, 69 5.515 41, 42 6.958 41, 43 5.515 41, 44 5.043 41, 46 6.196 41, 50 5.811 41, 57 5.011 41, 64 4.907 41, 65 6.745 42, 43 5.786 42, 44 5.614 42, 46 7.236 42, 50 7.190 42, 57 5.293 42, 64 6.183 42, 65 7.458 42, 69 5.786 43, 46 5.386 43, 64 5.118 43, 65 5.660 44, 46 5.176 44, 50 5.083 44, 64 4.941 44, 65 5.590 46, 50 6.013 46, 57 4.933 46, 64 5.786 46, 65 6.337 46, 69 5.055 49, 65 4.964 50, 57 4.900 50, 64 5.737 50, 65 6.314 57, 65 5.106 64, 65 6.021 65, 69 5.660 82, 87 8.456 82, 109 7.750 82, 110 8.945 86, 92 6.381 87, 109 7.375 87, 110 8.456 91, 93 5.896 91, 100 7.385 91, 107 7.385 93, 100 6.592 93, 107 6.592 100,107 8.716 109,110 7.750 113,117 5.328 113,119 6.013 113,124 5.386 113,137 5.515 117,119 7.062 118,138 8.415 120,122 6.319 120,124 5.257 120,137 7.054 120,141 5.345 122,137 5.664 122,141 7.419 124,137 5.185
Really enthusiastic teamwork! G.O. Wesolowsky 86 92 91 93 100 107 82 113 16 87 49 117 109 39 109 119 50 110 110 41 120 57 122 118 42 124 64 138 43 137 65 44 141 69 46 35 suspects, 79 pairs, 200 students