1 / 61

Demystifying Data Forensics An Overview of the Logic Underlying Cheating Detection Techniques

Demystifying Data Forensics An Overview of the Logic Underlying Cheating Detection Techniques. Jim Wollack , Ph.D. Associate Professor, Educational Psychology Director, Testing and Evaluation Services Director, UW Center for Placement Testing. Is Cheating Really a Problem. Answer Copying

sela
Download Presentation

Demystifying Data Forensics An Overview of the Logic Underlying Cheating Detection Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Demystifying Data ForensicsAn Overview of the Logic Underlying Cheating Detection Techniques Jim Wollack, Ph.D. Associate Professor, Educational Psychology Director, Testing and Evaluation Services Director, UW Center for Placement Testing

  2. Is Cheating Really a Problem Answer Copying • Example with MBE • Tends not to steal the headlines • Most common type of cheating without premeditation • ≈ 20% undergraduates copy annually • Moderately serious • One’s ability to help themselves depends on ability of Source(s), visual acuity, and line of sight

  3. Is Cheating Really a Problem Item Preknowledge • 2002 GRE braindump site • 2010 FSBPT group email accounts • IT Certification Test prep sites • Certexperts.com: 60,000 items from 60 testing programs (2006) • Testking.com: Microsoft exams (2006) • Scoretop.com: GMAT (2008) • Huge Problem • Upwards of 85% examinees • Single most common type of cheating among undergrads • Extremely serious

  4. Is Cheating Really a Problem Illegal Coaching and Test Tampering • 2003 TAKS in Dallas • Numerous public schools (Atlanta, Washington, Philly, Dallas, LA, etc.) • Surprisingly frequent • 2 - 4% of educators • Extremely serious

  5. Is Cheating Really a Problem Proxy Testing • SAT/ACT proxies • Very infrequent • Extremely serious • Led to changes in registration process

  6. Combatting Cheating Data Forensics • Statistical approaches to identify examinees whose scores are of questionable validity • Utilized by almost all major testing programs • Can be used as a trigger and/or to corroborate suspicion Not a substitute for an investigation • False positives are possible • Legitimate reasons may also exist for statistical irregularities

  7. Types of Data Forensics Addressed Answer Copying Preknowledge Other forms of collusion • Illegal Coaching • Test tampering • Proxy Testing

  8. The Creation a Forensic Tool Tailor method to specific type of cheating • Approaches should focus on specific, statistically observable elements of cheating • Observable means that it is evident in the data record • Using cellphones, testing with a fake ID, talking during test are NOT observable to the statistician. • Answers to specific test questions, test scores, testing history, etc. ARE observable to the statistician. • What strange patterns would I expect to see if someone were engaged in the cheating behavior of interest that I would not expect to see otherwise?

  9. Answer Copying What are observable characteristics that we’d expect to see if one examinee copies from another? How persuasive is that evidence? How can we convert that observable into a statistic that is likely to be high for cheaters and low for non-cheaters? • What issues might be associated with that statistic?

  10. Answer Copying • Observable: Large # of identical responses • #Matches between examinees should vary with • Abilities of C and S

  11. Answer Copying • Observable: Large # of identical responses • #Matches between examinees should vary with • Abilities of C and S • Number of questions • Number of item alternatives • Difficulty of questions • Attractiveness of alternatives • Most common approach is to standardize the number of matches

  12. Standardization Conversion of raw data to a scale that makes direct comparisons possible Approach involves two steps • Evaluating each data point with its expected value • Expected value is the value that, on average, we would expect to see under these exact circumstances if there really were no cheating. • Evaluation against expected value tells us if #Matches is more or less than we’d expect of this C from this S • Dividing this difference by a measure of the expected variability (standard error).

  13. Standardization For tests of 30+ items, this statistic follows the standard normal distribution (the bell curve) Index Values Likelihood 3.09 1 in 1,000 3.72 1 in 10,000 4.27 1 in 100,000 5.20 1 in 1M 50% = 1 in 2 15.9% = 1 in 6 2.3% = 1 in 44 0.14% = 1 in 740

  14. How Does One Find Expected # Matches? Empirically-based • Construct large dataset of pairs of examinees who could not have copied • Condition the dataset • Divide data into smaller, homogeneous groups • Test scores for one or both examinees • Sum or product of examinees’ test scores • Longest string of consecutive matches • Compute the average #Matches across all examinee pairs within the group into which the C-S pair of interest falls. • St.dev(#Match) is the standard deviation across those same values.

  15. How Does One Find Expected # Matches? Model-based • Use a statistical model to estimate the probability of examinees selecting each item choice. • Nominal Response Model

  16. Example Probability Function Prob(selecting alternative k) Low Average High Ability of Alleged Copier

  17. Example Probability Function Prob(selecting alternative k) Low Average High Ability of Alleged Copier

  18. Example Probability Function Prob(selecting alternative k) A C B D Low Average High Ability of Alleged Copier

  19. Example Probability Function Alleged S Alleged C Prob(selecting alternative k) A B C C D A B D Low Average High Ability of Alleged Copier

  20. Example Probability Function Alleged S Alleged C Prob(selecting alternative k) S selected (A) A .12 Prob(Match) = B C C D A B D Low Average High Ability of Alleged Copier

  21. How Does One Find Expected # Matches? Model-based • Estimate probability of C selecting each item choice. • Find prob of C selecting S’s answer • Sum these probabilities across items • How unusual is it to observe 3 answer matches given the expected is 1.13? 0.12 + 0.57 + 0.34 + 0.06 + 0.04 = 1.13 3 Answer Matches

  22. Unusual or Not? • Find standard error of the number of answer matches • Find P(Match)× [1 − P(Match)] and sum across items • 0.12 (0.88) + 0.57 (0.43) + 0.34 (0.66) + 0.06 (0.94) + 0.04 (0.96) = 0.6699 • Take square root = 0.818 0.12 + 0.57 + 0.34 + 0.06 + 0.04 = 1.13 3 Answer Matches

  23. How Unusual is 2.29? 2.29 2.9% = 1 in 91

  24. Statistical Detection of Answer Copying Answer copying detection indexes provide a probability statement about the likelihood of C’s and S’s responses having been produced • Quantifies how unusual observed similarity is • Very small probabilities are quite compelling From a recent case: • Examinees C and S completed Scrambled forms • C scored 26 of 100 S scored 76 of 100 • Matched answers on 66 of 100 items • Index value: 9.89

  25. How Unusual is 9.89? Probability of C’s answers being independently of S? 1 in 45,000,000,000,000,000,000,000

  26. How Unusual is 9.89? Probability of C’s answers being independently of S? 1 in 45,000,000,000,000,000,000,000 How big is 45,000,000,000,000,000,000,000 # people ever born 108,000,000,000 # stars in galaxy: 400,000,000,000 earth’s age (in seconds) 150,000,000,000,000,000 # grains of sand: 7,500 000 000 000 000 000 Evidence doesn’t need to be this overwhelming to be useful. • Depending on other evidence, statistical evidence in order of 1 in 1,000 or 1 in 10,000 may be adequate.

  27. Other Observables for Answer Copying Scrambled form • Helps with copying detection index • Also possible to look at likelihood of C’s score under alternate test key Different success rates on common items and unique items • Can find expected score over both sets and ask whether changes are in keeping with expectations

  28. Other Types of Cheating Trick is finding or deriving an observable that is predictive of the issue What data patterns might we expect to observe to identify candidates with preknowledge?

  29. Preknowledge Premise: Examinees who have studied live items should do much better on those items than on unfamiliar items Challenge: We often do not know which items are compromised. We can design the test with preknowledge detection in mind

  30. Internal Verification Test Test-within-a-test • Embed a set of new items on test Observable: change in performance across set of secure and operational items Use score from operational items to predict score on new items

  31. Example Probability Function Prob(selecting alternative k) A C B D Low Average High Ability of Alleged Copier

  32. Expected Number Correct

  33. Likelihood Check extremity LOW scores suggestive of preknowledge

  34. Internal Verification Test Works well if • Items are highly compromised • VT is long enough

  35. Preknowledge Premise: Examinees view braindump/test prep sites as “Ultimate Authorities,” and will trust their materials unconditionally. Observable: Examinees who have studied INCORRECTLY POSTED live items should do much worse on those items than on unfamiliar items

  36. Trojan Horse Items Testing program releases some of its actual items to known braindumps • Released items are very easy • Items are posted verbatim with incorrect key marked Items appear on exam, but do not count towards score Use score from operational items to predict score on new items

  37. Expected Number Correct Ideally suited for cases with very high compromise rates. Scores on operational test are high Items are easy High probabilities of success on items

  38. Likelihood LOW scores suggestive of preknowledge

  39. Trojan Horse Items If test is very highly compromised, methodology works well to detect biggest and dumbest offenders. • Can catch biggest offenders with as few as 5-6 items Ethics Dilemma: • Should testing companies really be exposing illegitimate information on their program. • If they know where people are going to access stolen content, wouldn’t bringing the site down be the right thing to do?

  40. Preknowledge Premise: Examinees will be able to answer quickly any questions for which they have preknowledge Observable: Response times for compromised items should be less than for secure items.

  41. Ways to use RT Data If you have VT, can compare a person’s standardized RT on VT with their standardized RT on operational test • Essentially the same as asking whether the percentile rank of person’s RT is markedly different Can plot RTs to look for anomalies

  42. Response Time Items re-ordered from shortest to longest average RT

  43. Ways to use RT Data If you have VT, can compare a person’s standardized RT on VT with their standardized RT on operational test • Essentially the same as asking whether the percentile rank of person’s RT is markedly different Can plot RTs to look for anomalies • Possible flags • Within person RT sd • Finished test too quickly (< 4 SEs below mean) • RT < 20sec for too many questions • Really long RTs could signal item harvesting

  44. Preknowledge Observable: Unusual similarity among examinees with access to the same materials Similarity index • Conceptually very similar to answer copying index, except expected value/st.dev are computed differently

  45. Example Probability Function Alleged S Alleged C Prob(selecting alternative k) S selected (A) A .12 Prob(Match) = B C C D A B D Low Average High Ability of Alleged Copier

  46. Example Probability Function Examinee 2 Examinee 1 Prob(selecting alternative k) A B C C D A B D Low Average High Ability of Alleged Copier

  47. Example Probability Function Ex. 2 Ex. 1 Prob(selecting alternative k) A A C C B B D D Low Average High Ability of Alleged Copier

  48. Example Probability Function 2 1 Prob(selecting alternative k) A A C C B & D B & D Low Average High Ability of Alleged Copier

  49. Similarity • Find P(Match) for all items and sum across items • Find P(Match)× [1 − P(Match)] and sum across items • Square root = 1.064

  50. Detecting Preknowledge with Similarity Indexes Compute index between all possible pairs Identify all pairs with probability < some criterion Use a clustering method to unite linked examinees

More Related