CodeSimian

CodeSimian CS491B – Andrew Weng

Motivation • Academic integrity is a universal issue • Plagiarism is still common today • Kaavya Viswanathan (Harvard Student) • Book contains many plagiarized passages • Yoshihiko Wada (Painter, Japan) • Artwork plagiarized from Alberto Sughi • Scott D. Miller (Wesley College President) • Plagiarized material found on his website

Is Plagiarism Harmful? • Who does plagiarism really hurt? • The student • The class • The University • Plagiarism is not only concerned with the protection of intellectual property rights

Plagiarism Detection Benefits of Utilizing Plagiarism Detection • Prevention • Enforcement • Objective standpoint

Platform Overview • Developed on Visual Studio .NET 2005 • Coded in Microsoft Visual C# .NET • Windows Forms application • Simple and familiar GUI (Windows) • Intended focus is ease of use

Theoretical Overview CodeSimian is based on two primary principles • Kolmogorov Complexity • Information Distance

Kolmogorov Complexity • Simple definition: The shortest length program that can be written on a universal Turing machine to produce a specified output • Purely theoretical • Impossible to calculate exactly

Kolmogorov Complexity Define x to be a desired output string K(x) = The length of the program that produces x K(x|y) = The length of the program that produces x given y as an input K(xy) = The length of the program that produces x concatenated with y

Kolmogorov Complexity Compare two infinitely long numbers π and a randomly generated number between 0 and 1: π =3.1415926535897932384626433832795… n = 0.5234958723957329875320935293853… K(π) is a small and finite number, which represents the code required to generate the value of π to an infinite

Kolmogorov Complexity π =3.1415926535897932384626433832795… K(π) is a small and finite number, which represents the code required to generate the value of π to an infinite Perhaps something as simple as the implementation of Leibniz’s formula:

Kolmogorov Complexity n = 0.5234958723957329875320935293853… In order to generate the full output of a truly random number n, the length of the program would be infinitely long. The code would essentially be System.out.println(“0.52349587…”);

Kolmogorov Complexity So how does this apply to plagiarism detection? Define x = π and y = π/4 K(x|y) would be a very small value. Given y, one can calculate the result of π with a simple multiplier.

Information Distance The distance (or difference) between two objects Formula used:

Information Distance • Similarity Factor If we remove the amount of information contained in x by y, and we normalize the number by the amount of information in both x and y, we can obtain a percentage of similarity

Implementation What does CodeSimian do to obtain the similarity factors? • Parse and Tokenize the code • Compress the tokenized strings • Compare the compressed strings

Parsing the Code • Utilized ANTLR to parse and tokenize the code • ANTLR, ANother Tool for Language Recognition, (formerly PCCTS) is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing Java, C#, C++, or Python actions. (www.antlr.org)

Tokenizing the Code • The tokenized output is a string of characters, each of which represents a token within the code • For Example: { int c = 0; } contains 7 “letters” Open Bracket Integer type declaration Variable name Assignment operator Integer Value Statement end Close Bracket

Compressing the String This string is then compressed using a Lempel-Ziv compression algorithm with unbounded buffers • As the string is being read, a library is generated as it progresses. • When repeats are detected, it utilizes pointers to the library to recreate the required section

Compressing the String • Normally limitations exist on library size and the “word” length stored • Memory utilization and efficiency is not important • Lempel-Ziv is suitable for this application

Comparing the Compressed String • K(x) is the size of the compressed and tokenized code x. • K(x|y) is the size of the compressed and tokenized code x, given y as a “free” library • K(xy) is the size of the compressed and tokenized code x+y.

Results Using the test on trivial examples: • LinkedList.java • LinkedList2.java • LinkedList3.java • Changes included only variable names, reformatting, removing comments, rearranging variable declaration, adding “junk” code, such as random debugging text output. • All files came out as >85% similar

Results Using the test on a small real-world sample Professor Kang’s CS201 HW1 • Relatively simple homework assignment • 30-50% similarity average • 95% similarity detected on one pair of submissions • Confirmed by Professor Kang as correct

Results Using the test on another small real-world sample Professor Kang’s CS201 HW4 • More complex homework assignment involving 2-3 files; break down of java files according to function • Problem being that specialized function files may possible present false positives? • 30-70% similarity average • 95+% similarity detected on pairs of submissions • Confirmed by Professor Kang as correct

Results • Things to note… • The results showed a similarity of 80% on one pair of results, which is deemed significant by the application but necessarily conclusive • Careful inspection by hand of the suspected files revealed one block of code that was apparently copied with variable name changes

Conclusions • Successful test cases • Simple and straightforward to use • Based on an objective principle which works!

Future Work • Enhancing the application to be able to compare internal “blocks” of code • Improving the compression algorithm to better handle and adapt to “approximate matches” • Improving the functionality with the GUI • Providing a report printing capability of directories

CodeSimian

CodeSimian

Presentation Transcript