1 / 71

New Approaches for Detecting Similarities in Program Code

Mike Joy 25 February 2010. New Approaches for Detecting Similarities in Program Code. Overview of Talk. What is the Problem? Historical Overview New Approaches Where Next?. Part 1 – What is the Problem?. Document similarity What do we mean? Why is software an issue?

Download Presentation

New Approaches for Detecting Similarities in Program Code

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mike Joy25 February 2010 New Approaches for Detecting Similarities in Program Code

  2. Overview of Talk • What is the Problem? • Historical Overview • New Approaches • Where Next?

  3. Part 1 – What is the Problem? Document similarity What do we mean? Why is software an issue? Why is this interesting?

  4. Four stages Collection  Detection  Confirmation  Investigation From Culwin and Lancaster (2002)‏.

  5. Stage 1: Collection Get all documents together online so they can be processed formats? security? BOSS (Warwick) Coursemaster (Nottingham) Managed Learning Environment

  6. Stage 2: Detection Compare with other submissions Compare with external documents essay-based assignments We’ll come back to this later it’s the interesting bit!

  7. Stage 3: Confirmation Software tool says “A and B similar” Are they? Never rely on a computer program! Requires expert human judgement Evidence must be compelling Might go to court

  8. Stage 4: Investigation A from B, or B from A, or joint work? If A from B, did B know? open networked file printer output Did the culprit/s understand? University processes must be followed

  9. Why is this Interesting? How do you compare two programs? This is an algorithm question Stages 2 and 3: detection and confirmation How do you use the results (of a comparison) to educate students? This is a pedagogic question Stage 4, and before stage 1!

  10. Digression: Essays Plagiarism in essays is easier to detect Lots of “tricks” a lecturer can use! Google search on phrases Abnormal style ... etc. Software tools Let's have a look ...

  11. Pedagogy Can be used by academics to detect plagiarism provide evidence Can be used by students to check their own work

  12. Part 2 – Historical Overview How has similar code been detected in the past? How well do the approaches work?

  13. Why not use Turnitin? It won’t work! String matching algorithm inappropriate Database does not contain code Commercial involvement E.g. Black Duck Software

  14. /* Program 1 */ public class Hello { public static void main(String[] argv) { System.out.println(“Hello World”) } } /* Program 2 */ public class HelloWorld { public static void main(String[] x) { System.out.println(“hello world!”) } }

  15. Is This Plagiarism? Is Program 2 derived from Program 1 in a manner which is “plagiarism”? Probably No It's too simple Too many copies in books / on the web Most of it is generic syntax

  16. Program 3 (Source code for MS Windows 7) Program 4 (code 98% identical to the source code for MS Windows 7)

  17. Is This Plagiarism? Is Program 4 derived from Program 3 in a manner which is “plagiarism”? Definitely Yes It's too complicated to happen by chance Millions of lines of code The source is “closed” Microsoft guard it very well!

  18. /* Program 5 */ public class Sun { static final double latitude=52.4; static final double longitude=-1.5; static final double tpi = 2.0*pi; /* ... */ public static void main(String[] args) { calculate(); } public static double FNrange(double x) { double b = x / tpi; double a = tpi * (b - (long)(b)); if (a < 0) a = tpi + a; return a; }; public static void calculate() { /* ... */ } /* ... */ /* Program 6 */ public class SunsetCalculator { static float latitude=52.4; static float longitude=-1.5; /* ... */ public static void main(String[] args) { findSunsetTime(); } public static double rangeCalc(float arg) { float x = arg / tpi; float y = 2*3.14159 * (x - (int)(x)); if (y < 0) y = 2*3.14159 + y; return y; }; public static void findSunsetTime() { /* ... */ } /* ... */

  19. Is This Plagiarism? Is Program 6 derived from Program 5 in a manner which is “plagiarism”? Maybe Structure is similar – cosmetic changes But the algorithm is public domain Maybe 6 derived from 5, maybe the other way round

  20. History ... First known plagiarism detection system was an attribute counting program developed by Ottenstein (1976) More recent systems compare the structure of source-code programs Structure-based systems include: YAP3, MOSS, JPlag, Plague, and Sherlock.

  21. Detection Tools (1) Attribute counting systems (Halstead, 1972): Numbers of unique operators Numbers of unique operands Total numbers of operator occurrences Total numbers of operand occurrences

  22. Detection Tools (2) Structure-based systems: Each program is converted into token strings (or something similar) Token streams are compared for determining similar source-code fragments Tools: JPlag, MOSS, and Sherlock

  23. Example (code 1) int calculate(String arg) { int ans=0; for (int j=1; j<=100; j++) { ans *= j; } return ans; }

  24. Example (code 2) Integer doit(String v) { float result=0.0; for (float f=100.0; f > 0.0; f--) result *= f; return result; }

  25. Example (tokenised) type name(type name) start type name=number loop (type name=number name compare number operation name) start name operation name end return name end

  26. Detectors‏ MOSS (Berkeley/Stanford, USA) JPlag (Karlsruhe, Germany) Java only Programs must compile? Sherlock (Warwick, UK) MOSS and JPlag are Internet resources Data Protection?

  27. MOSS Developed by Alex Aiken in 1994 MOSS (for a Measure Of Software Similarity) determines the similarity of C, C++, Java, Pascal, Ada, ML, Lisp, or Scheme programs. MOSS is free, but you must create an account MOSS home page: http://theory.stanford.edu/~aiken/moss/

  28. MOSS – Algorithm “Winnowing” (Schleimer et al.,2003) Local document fingerprinting algorithm Efficiency proven (33% of lower bound) Guarantees detection of matches longer than a certain threshold

  29. Using MOSS Moss is being provided as an Internet service User must download MOSS Perl script for submitting files to the MOSS server The script uses a direct network connection The MOSS server produces HTML pages listing pairs of programs with similar code MOSS highlights similar code-fragments within programs that appear the same Data Protection? – US service Maintenance?

  30. JPlag Developed by Guido Malpohl in 1996 JPlag currently supports Java, C#, C, C++, Scheme, and natural language text Use of JPlag is free, but user must create an account JPlag can be used to compare student assignments but does not compare with code on the Internet JPlag home page: www.ipd.uni-karlsruhe.de/jplag

  31. JPlag – Algorithm Parse (or scan) programs Convert programs to tokens 3) Pairwise compare “Greedy String Tiling” maximises percentage of common token strings worst case θ(n3), average case linear Precheltet al. (2002)

  32. JPlag File Processing

  33. JPlag - Results Results in HTML Format Histogram of similarity values found for all pairs of programs Similar pairs and their similarity values displayed Select file pairs to view

  34. JPlag - Matches Similar lines matched with the same colour Code fragment similarity values based on similar tokens found

  35. Sherlock Developed at the University of Warwick Department of Computer Science Sherlock was fully integrated with the BOSS online submission software in 2002 and Open-Sourced Sherlock detects plagiarism on source-code and natural language assignments BOSS home page: www.boss.org.uk

  36. Sherlock - Preprocessing Whitespace Comments Normalisation Tokenisation

  37. Sherlock – Results Results displayed Similarity values of suspicious files Similarity values depend on the length of similar lines found as a percentage of the whole file size Select suspicious matches to examine Mark suspicious files

  38. Sherlock – Matches Suspected sections marked with **begin suspicious section** and **end suspicious section**

  39. Sherlock – Document Set User can view graph Each node represents one submission An edge means two submissions Options to select threshold Click on lines to view or to mark suspicious matches

  40. CodeMatch Commercial product Free academic use for small data sets Exact algorithm not published patent pending?

  41. Example of Identical “Instruction Sequences” /* File 1*/ for (int i=1; i<10; i++) { if (a==10) print(“done”); else a++; } /* File 2*/ for (int x=100; x > 0; x--) { if (z99 > -10) print(“ans is ” + z99); else { abc += 65; } }

  42. CodeMatch – Algorithm Remove comments, whitespace and lines containing only keywords/syntax; compare sequences of instructions Extract comments, and compare Extract identifiers, and count similar; x, xxx, xx12345 are “similar” Combine (1), (2) and (3) to give correlation score

  43. Heuristics Comments Spelling mistakes Unusual English (Thai, German, …)‏ Use of Search Engines Unusual style Code errors

  44. Tool Efficiency MOSS, JPlag and Sherlock are effective Results returned are similar Results returned are not identical User interface issues may be important

  45. Part 3 – New Approaches Eschew the “syntax driven” approach Lateral thinking? Case study: Latent Semantic Analysis

  46. Digression: Similarity What do we actually mean by “similar”? This is where the problems start ...

  47. (1) Staff Survey We carried out a survey in order to: gather the perceptions of academics on what constitutes source-code plagiarism, and create a structured description of what constitutes source-code plagiarism from a UK academic perspective Cosma and Joy (2008)

  48. Data Source On-line questionnaire distributed to 120 academics Questions were in the form of small scenarios Mostly multiple-choice responses Comments box below each question Anonymous – option for providing details Received 59 responses, from more that 34 different institutions Responses were analysed and collated to create a universally acceptable source-code plagiarism description.

More Related