710 likes | 914 Views
Mike Joy 25 February 2010. New Approaches for Detecting Similarities in Program Code. Overview of Talk. What is the Problem? Historical Overview New Approaches Where Next?. Part 1 – What is the Problem?. Document similarity What do we mean? Why is software an issue?
E N D
Mike Joy25 February 2010 New Approaches for Detecting Similarities in Program Code
Overview of Talk • What is the Problem? • Historical Overview • New Approaches • Where Next?
Part 1 – What is the Problem? Document similarity What do we mean? Why is software an issue? Why is this interesting?
Four stages Collection Detection Confirmation Investigation From Culwin and Lancaster (2002).
Stage 1: Collection Get all documents together online so they can be processed formats? security? BOSS (Warwick) Coursemaster (Nottingham) Managed Learning Environment
Stage 2: Detection Compare with other submissions Compare with external documents essay-based assignments We’ll come back to this later it’s the interesting bit!
Stage 3: Confirmation Software tool says “A and B similar” Are they? Never rely on a computer program! Requires expert human judgement Evidence must be compelling Might go to court
Stage 4: Investigation A from B, or B from A, or joint work? If A from B, did B know? open networked file printer output Did the culprit/s understand? University processes must be followed
Why is this Interesting? How do you compare two programs? This is an algorithm question Stages 2 and 3: detection and confirmation How do you use the results (of a comparison) to educate students? This is a pedagogic question Stage 4, and before stage 1!
Digression: Essays Plagiarism in essays is easier to detect Lots of “tricks” a lecturer can use! Google search on phrases Abnormal style ... etc. Software tools Let's have a look ...
Pedagogy Can be used by academics to detect plagiarism provide evidence Can be used by students to check their own work
Part 2 – Historical Overview How has similar code been detected in the past? How well do the approaches work?
Why not use Turnitin? It won’t work! String matching algorithm inappropriate Database does not contain code Commercial involvement E.g. Black Duck Software
/* Program 1 */ public class Hello { public static void main(String[] argv) { System.out.println(“Hello World”) } } /* Program 2 */ public class HelloWorld { public static void main(String[] x) { System.out.println(“hello world!”) } }
Is This Plagiarism? Is Program 2 derived from Program 1 in a manner which is “plagiarism”? Probably No It's too simple Too many copies in books / on the web Most of it is generic syntax
Program 3 (Source code for MS Windows 7) Program 4 (code 98% identical to the source code for MS Windows 7)
Is This Plagiarism? Is Program 4 derived from Program 3 in a manner which is “plagiarism”? Definitely Yes It's too complicated to happen by chance Millions of lines of code The source is “closed” Microsoft guard it very well!
/* Program 5 */ public class Sun { static final double latitude=52.4; static final double longitude=-1.5; static final double tpi = 2.0*pi; /* ... */ public static void main(String[] args) { calculate(); } public static double FNrange(double x) { double b = x / tpi; double a = tpi * (b - (long)(b)); if (a < 0) a = tpi + a; return a; }; public static void calculate() { /* ... */ } /* ... */ /* Program 6 */ public class SunsetCalculator { static float latitude=52.4; static float longitude=-1.5; /* ... */ public static void main(String[] args) { findSunsetTime(); } public static double rangeCalc(float arg) { float x = arg / tpi; float y = 2*3.14159 * (x - (int)(x)); if (y < 0) y = 2*3.14159 + y; return y; }; public static void findSunsetTime() { /* ... */ } /* ... */
Is This Plagiarism? Is Program 6 derived from Program 5 in a manner which is “plagiarism”? Maybe Structure is similar – cosmetic changes But the algorithm is public domain Maybe 6 derived from 5, maybe the other way round
History ... First known plagiarism detection system was an attribute counting program developed by Ottenstein (1976) More recent systems compare the structure of source-code programs Structure-based systems include: YAP3, MOSS, JPlag, Plague, and Sherlock.
Detection Tools (1) Attribute counting systems (Halstead, 1972): Numbers of unique operators Numbers of unique operands Total numbers of operator occurrences Total numbers of operand occurrences
Detection Tools (2) Structure-based systems: Each program is converted into token strings (or something similar) Token streams are compared for determining similar source-code fragments Tools: JPlag, MOSS, and Sherlock
Example (code 1) int calculate(String arg) { int ans=0; for (int j=1; j<=100; j++) { ans *= j; } return ans; }
Example (code 2) Integer doit(String v) { float result=0.0; for (float f=100.0; f > 0.0; f--) result *= f; return result; }
Example (tokenised) type name(type name) start type name=number loop (type name=number name compare number operation name) start name operation name end return name end
Detectors MOSS (Berkeley/Stanford, USA) JPlag (Karlsruhe, Germany) Java only Programs must compile? Sherlock (Warwick, UK) MOSS and JPlag are Internet resources Data Protection?
MOSS Developed by Alex Aiken in 1994 MOSS (for a Measure Of Software Similarity) determines the similarity of C, C++, Java, Pascal, Ada, ML, Lisp, or Scheme programs. MOSS is free, but you must create an account MOSS home page: http://theory.stanford.edu/~aiken/moss/
MOSS – Algorithm “Winnowing” (Schleimer et al.,2003) Local document fingerprinting algorithm Efficiency proven (33% of lower bound) Guarantees detection of matches longer than a certain threshold
Using MOSS Moss is being provided as an Internet service User must download MOSS Perl script for submitting files to the MOSS server The script uses a direct network connection The MOSS server produces HTML pages listing pairs of programs with similar code MOSS highlights similar code-fragments within programs that appear the same Data Protection? – US service Maintenance?
JPlag Developed by Guido Malpohl in 1996 JPlag currently supports Java, C#, C, C++, Scheme, and natural language text Use of JPlag is free, but user must create an account JPlag can be used to compare student assignments but does not compare with code on the Internet JPlag home page: www.ipd.uni-karlsruhe.de/jplag
JPlag – Algorithm Parse (or scan) programs Convert programs to tokens 3) Pairwise compare “Greedy String Tiling” maximises percentage of common token strings worst case θ(n3), average case linear Precheltet al. (2002)
JPlag - Results Results in HTML Format Histogram of similarity values found for all pairs of programs Similar pairs and their similarity values displayed Select file pairs to view
JPlag - Matches Similar lines matched with the same colour Code fragment similarity values based on similar tokens found
Sherlock Developed at the University of Warwick Department of Computer Science Sherlock was fully integrated with the BOSS online submission software in 2002 and Open-Sourced Sherlock detects plagiarism on source-code and natural language assignments BOSS home page: www.boss.org.uk
Sherlock - Preprocessing Whitespace Comments Normalisation Tokenisation
Sherlock – Results Results displayed Similarity values of suspicious files Similarity values depend on the length of similar lines found as a percentage of the whole file size Select suspicious matches to examine Mark suspicious files
Sherlock – Matches Suspected sections marked with **begin suspicious section** and **end suspicious section**
Sherlock – Document Set User can view graph Each node represents one submission An edge means two submissions Options to select threshold Click on lines to view or to mark suspicious matches
CodeMatch Commercial product Free academic use for small data sets Exact algorithm not published patent pending?
Example of Identical “Instruction Sequences” /* File 1*/ for (int i=1; i<10; i++) { if (a==10) print(“done”); else a++; } /* File 2*/ for (int x=100; x > 0; x--) { if (z99 > -10) print(“ans is ” + z99); else { abc += 65; } }
CodeMatch – Algorithm Remove comments, whitespace and lines containing only keywords/syntax; compare sequences of instructions Extract comments, and compare Extract identifiers, and count similar; x, xxx, xx12345 are “similar” Combine (1), (2) and (3) to give correlation score
Heuristics Comments Spelling mistakes Unusual English (Thai, German, …) Use of Search Engines Unusual style Code errors
Tool Efficiency MOSS, JPlag and Sherlock are effective Results returned are similar Results returned are not identical User interface issues may be important
Part 3 – New Approaches Eschew the “syntax driven” approach Lateral thinking? Case study: Latent Semantic Analysis
Digression: Similarity What do we actually mean by “similar”? This is where the problems start ...
(1) Staff Survey We carried out a survey in order to: gather the perceptions of academics on what constitutes source-code plagiarism, and create a structured description of what constitutes source-code plagiarism from a UK academic perspective Cosma and Joy (2008)
Data Source On-line questionnaire distributed to 120 academics Questions were in the form of small scenarios Mostly multiple-choice responses Comments box below each question Anonymous – option for providing details Received 59 responses, from more that 34 different institutions Responses were analysed and collated to create a universally acceptable source-code plagiarism description.