140 likes | 206 Views
Graphical Information on Plagiarism Activates. Poon Yan Horn Jonathan. Table of Content. Background Motivation System structure Pair-wise detection Clustering Demo Q & A. Background .
E N D
Graphical Information on Plagiarism Activates Poon Yan Horn Jonathan
Table of Content • Background • Motivation • System structure • Pair-wise detection • Clustering • Demo • Q & A
Background • In spite of years of effort, plagiarism in student assignment submissions still causes considerable difficulties for course designers. • CAI (June 2005) – 40% of students admitted to engaging in plagiarism. • NUS FASS (AY 2008 – 2009) – 70 students were found guilty in committing plagiarism.
Motivation • There are many detection systems can detect the similarities between submissions for an assignment. • The results, however, do not provide sufficient information on how program code is being exchanged among a group of students. • Most importantly, how does plagiarism works within a group of students throughout all assignments.
System Structure Pair-wise plagiarism detection engine Clustering engine (DBSCAN) HTML / Graph generator Database
Pair-wise Detection • Tokenize each submission. • Construct N-Gram representation for each submission • Determine the sub-sequence pairs of N-Grams between each submission. • Compute asymmetric similarities among each submission.
Pair-wise Detection • Tokenize each submission • Removing whitespaces • Converting: • Keywords => ‘K’ • Identifiers => ‘V’ • Strings => ‘S’ • Constants => ‘C’ int main() { int a = 1; String b = “sb”;} KV(){KV=C;KV=S;}
Pair-wise Detection • N-Gram construction • Compose sequence of 4-gram tokens KV(){KV=C;KV=S;} KV() V(){ (){K ){KV {KV= KV=C V=C; =C;K C;KV ;KV= KV=S V=S; =S;}
Pair-wise Detection • Determine the sub-sequence pairs between 2 sequences of N-Gram, A and B: • Check if each N-Gram in A can be found in B. • If a matched sub-sequence is longer than a minimum matching requirement, report this as a match. • A minimum matching requirement is 2 statements. KV() V(){ (){K ){KV {KV= KV=C V=C; =C;K C;KV ;KV= KV=S V=S; =S;K S;KV ;KV= KV=C V=C; =C;} KV() V(){ (){K ){KV {KV= KV=C V=C; =C;K C;KV ;KV= KV=C V=C; =C;} KV() V(){ (){K ){KV {KV= KV=C V=C; =C;K C;KV ;KV= KV=C V=C; =C;K C;KV ;KV= KV=C V=C; =C;}
Pair-wise Detection • Compute the asymmetric similarity for File f1 to File f2
Clustering • DBSCAN • Advantages • Fast Algorithm (O(n log n)) • Number of Clusters is automatically determined • Node (submitter) is classified as noise and omitted if in low density regions (not quite similar to other submitters) • Two properties • Eps – User defined grouping criteria base • MinPts – System predefined as 2