1 / 26

CodeSimian

CodeSimian. CS491B – Andrew Weng. Motivation. Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student) Book contains many plagiarized passages Yoshihiko Wada (Painter, Japan) Artwork plagiarized from Alberto Sughi

ting
Download Presentation

CodeSimian

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CodeSimian CS491B – Andrew Weng

  2. Motivation • Academic integrity is a universal issue • Plagiarism is still common today • Kaavya Viswanathan (Harvard Student) • Book contains many plagiarized passages • Yoshihiko Wada (Painter, Japan) • Artwork plagiarized from Alberto Sughi • Scott D. Miller (Wesley College President) • Plagiarized material found on his website

  3. Is Plagiarism Harmful? • Who does plagiarism really hurt? • The student • The class • The University • Plagiarism is not only concerned with the protection of intellectual property rights

  4. Plagiarism Detection Benefits of Utilizing Plagiarism Detection • Prevention • Enforcement • Objective standpoint

  5. Platform Overview • Developed on Visual Studio .NET 2005 • Coded in Microsoft Visual C# .NET • Windows Forms application • Simple and familiar GUI (Windows) • Intended focus is ease of use

  6. Theoretical Overview CodeSimian is based on two primary principles • Kolmogorov Complexity • Information Distance

  7. Kolmogorov Complexity • Simple definition: The shortest length program that can be written on a universal Turing machine to produce a specified output • Purely theoretical • Impossible to calculate exactly

  8. Kolmogorov Complexity Define x to be a desired output string K(x) = The length of the program that produces x K(x|y) = The length of the program that produces x given y as an input K(xy) = The length of the program that produces x concatenated with y

  9. Kolmogorov Complexity Compare two infinitely long numbers π and a randomly generated number between 0 and 1: π =3.1415926535897932384626433832795… n = 0.5234958723957329875320935293853… K(π) is a small and finite number, which represents the code required to generate the value of π to an infinite

  10. Kolmogorov Complexity π =3.1415926535897932384626433832795… K(π) is a small and finite number, which represents the code required to generate the value of π to an infinite Perhaps something as simple as the implementation of Leibniz’s formula:

  11. Kolmogorov Complexity n = 0.5234958723957329875320935293853… In order to generate the full output of a truly random number n, the length of the program would be infinitely long. The code would essentially be System.out.println(“0.52349587…”);

  12. Kolmogorov Complexity So how does this apply to plagiarism detection? Define x = π and y = π/4 K(x|y) would be a very small value. Given y, one can calculate the result of π with a simple multiplier.

  13. Information Distance The distance (or difference) between two objects Formula used:

  14. Information Distance • Similarity Factor If we remove the amount of information contained in x by y, and we normalize the number by the amount of information in both x and y, we can obtain a percentage of similarity

  15. Implementation What does CodeSimian do to obtain the similarity factors? • Parse and Tokenize the code • Compress the tokenized strings • Compare the compressed strings

  16. Parsing the Code • Utilized ANTLR to parse and tokenize the code • ANTLR, ANother Tool for Language Recognition, (formerly PCCTS) is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing Java, C#, C++, or Python actions. (www.antlr.org)

  17. Tokenizing the Code • The tokenized output is a string of characters, each of which represents a token within the code • For Example: { int c = 0; } contains 7 “letters” Open Bracket Integer type declaration Variable name Assignment operator Integer Value Statement end Close Bracket

  18. Compressing the String This string is then compressed using a Lempel-Ziv compression algorithm with unbounded buffers • As the string is being read, a library is generated as it progresses. • When repeats are detected, it utilizes pointers to the library to recreate the required section

  19. Compressing the String • Normally limitations exist on library size and the “word” length stored • Memory utilization and efficiency is not important • Lempel-Ziv is suitable for this application

  20. Comparing the Compressed String • K(x) is the size of the compressed and tokenized code x. • K(x|y) is the size of the compressed and tokenized code x, given y as a “free” library • K(xy) is the size of the compressed and tokenized code x+y.

  21. Results Using the test on trivial examples: • LinkedList.java • LinkedList2.java • LinkedList3.java • Changes included only variable names, reformatting, removing comments, rearranging variable declaration, adding “junk” code, such as random debugging text output. • All files came out as >85% similar

  22. Results Using the test on a small real-world sample Professor Kang’s CS201 HW1 • Relatively simple homework assignment • 30-50% similarity average • 95% similarity detected on one pair of submissions • Confirmed by Professor Kang as correct

  23. Results Using the test on another small real-world sample Professor Kang’s CS201 HW4 • More complex homework assignment involving 2-3 files; break down of java files according to function • Problem being that specialized function files may possible present false positives? • 30-70% similarity average • 95+% similarity detected on pairs of submissions • Confirmed by Professor Kang as correct

  24. Results • Things to note… • The results showed a similarity of 80% on one pair of results, which is deemed significant by the application but necessarily conclusive • Careful inspection by hand of the suspected files revealed one block of code that was apparently copied with variable name changes

  25. Conclusions • Successful test cases • Simple and straightforward to use • Based on an objective principle which works!

  26. Future Work • Enhancing the application to be able to compare internal “blocks” of code • Improving the compression algorithm to better handle and adapt to “approximate matches” • Improving the functionality with the GUI • Providing a report printing capability of directories

More Related