Plagiarism Monitoring and Detection -- Towards an Open Discussion

Plagiarism Monitoring and Detection -- Towards an Open Discussion Edward L. Jones Computer Information Sciences Florida A & M University Tallahassee, Florida

Outline • What is Plagiarism, and Why Address It • Plagiarism Detection & Countermeasures • A Metrics-Based Detection Approach • Extending the Approach • Conclusions & Future Work

Why Tackle Plagiarism? • Plagiarism undermines educational objectives • Failure to address sends wrong message • A non-contrived ethical issue in computing • Plagiarism is hard to define • Plagiarism is costly to pursue/prosecute • An interesting problem for tinkering

What is Plagiarism? • “use of another’s ideas, writings or inventions as one’s own” (Oxford American Dictionary, 1980) • Shades of Gray • Theft of work • Gift of work • Collusion • Collaboration • Coincidence • Intent to Deceive

How is it Detected? • By chance • Anomalies • Temporal proximity when grading • Automation methods • Direct text comparison (Unix diff) • Lexical pattern recognition • Structural pattern recognition • Numeric profiling

Plagiarism Concealment Tactics • None • Change comments • Change formatting • Rename identifiers • Change data types • Reorder blocks • Reorder statements • Reorder expressions • Superfluous code • Alternative control structures

Prosecution -- DA in the House? • Course syllabus broaches the subject • Concrete definition generally lacking • Sense of “we’ll know it when we see it” • N? Tolererance Policy • Investigation Stage • Prosecution Stage • Missed opportunity to teach?

An Awareness Approach • Monitor closeness of student programs • Objective measures • Automated • Post anonymous closeness results in public • Nonconfrontational awareness • “A word to the wise … “ • Benchmark student behavior • Establishing thresholds • Effects of course, language

Closeness Measures -- Physical Program 1 ( lines1, words1, characters1 Euclidean Distance ( lines2, words2, characters2) Program 2

Closeness Measures -- Halstead Program 1 ( length1, vocabulary1, volume1) Euclidean Distance ( length2, vocabulary2, volume2) Program 2

Comparison of Measures • Physical profile ==> weight test • Simple/cheap to compute (Unix wc command) • Sensitive to character variations • Halstead profile ==> content test • More complex/expensive to compute • Ignores comments and white space • Sensitive only to changes in program content • Detection effectiveness vs. plagiarism tactic

Closeness Computation • Normalization • Establish upper bound for comparison (1.414) • Distance computed on normalized (unit) vectors • Normalization I -- Self normalization • p = (a, b, c) ==> (a/L, b/L, c/L) • Largest component dominates • Normalization II -- Global scaling • p = (a, b, c) ==> q = (a/aMAX, b/bMAX, c/cMAX) • Self normalization applied to q

Distribution Of Closeness Values

Comparison of Profiles

Closeness values vary by assignment Programming language may lead clustering at the lower end of the spectrum Reuse of modules leads to cluster ingat the lower end of the spectrum No a priori threshold pin-pointing plagiarism All measures exhibit these behaviors Closeness Distribution

Suspect Identification Collaboration Suspects (5-th Percentile) Rank Closeness student1 student2 1 0.00000000 alpha alpha 2 0.00000652 alpha beta 3 0.00026963 beta gamma 4 0.00026981 alpha gamma 5 0.00031262 gamma epsilon 6 0.00048815 sigma delta 7 0.00049825 alpha epsilon 8 0.00050169 beta epsilon 9 0.00066481 gamma theta 10 0.00073158 beta theta

Independence Index Student Independence Indices Index student1 1 alpha 2 beta 3 gamma 5 epsilon 6 sigma 6 delta 9 theta Index = position at which student debuts on Closeness List

Preponderance of Evidence • Historical Record of Student Behavior • Collaboration/partnering • Independence indices • Profile and analyze other artifacts • Compilation logs • Execution logs

Another Approach • Make student demonstrate familiarity with submitted program • Seed errors into program • Time limit for removing error and resubmitting • Holistic approach • Intentional, not accidental

Conclusions • We can do something about plagiarism -- the first step is to develop eyes and ears • Simple metrics appear to be adequate • Tools are essential • Sophistication is not as necessary as automation • Students are curious to know how they compare with other students

On-Going & Future Work • Complete the toolset • Student Independence Index • Incorporate other Artifacts • Compilation logs • Execution logs • Integrate into Automated Grading • Disseminate Results • Package tool as shareware

Questions? Questions? Questions?

Thank You

Flow Chart Student Programs Profile Compute Closeness Suspicious Programs

Plagiarism Monitoring and Detection -- Towards an Open Discussion