430 likes | 538 Views
GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis. Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at Urbana-Champaign IBM T.J. Waston Research Center Presented by Chao Liu. Motivations. Blossom of open-source projects
E N D
GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen, Jiawei Han, Philip S. Yu University of Illinois at Urbana-Champaign IBM T.J. Waston Research Center Presented by Chao Liu
Motivations • Blossom of open-source projects • SourceForge.net: 125,090 projects as July 2006 • Convenience for software plagiarism? • You can always find something online • Core-part plagiarism • Ripping off GUIs and irrelevant parts • (Illegally) reuse the implementations of core-algorithms • Our goal • Efficient detection of core-part plagiarism
Challenges • Effectiveness • Professional plagiarists • Automated plagiarism • Efficiency • Only a small part of code is plagiarized, how to detect it efficiently?
Outline • Plagiarism Disguises • Review of Plagiarism Detection • GPLAG: PDG-based Plagiarism Detection • Efficiency and Scalability • Experiments • Conclusions
Original Program A procedure in a program, called join 01 static void 02 make_blank (struct line *blank, int count) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 07 blank->nfields = count; 08 blank->buf.size = blank->buf.length = count + 1; 09 blank->buf.buffer = (char*) xmalloc (blank->buf.size); 10 buffer = (unsigned char *) blank->buf.buffer; 11 blank->fields = fields = (struct field *) xmalloc (sizeof (struct field) * count); 12 for (i = 0; i < count; i++){ 13 ... 14 } 15 }
Disguise 1: Format Alteration Insert comments and blanks 01 static void 02 make_blank (struct line *blank, int count) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 07 blank->nfields = count; // initialization 08 blank->buf.size = blank->buf.length = count + 1; 09 blank->buf.buffer = (char*) xmalloc (blank->buf.size); 10 buffer = (unsigned char *) blank->buf.buffer; 11 blank->fields = fields = (struct field *) xmalloc (sizeof (struct field) * count); 12 for (i = 0; i < count; i++){ 13 ... 14 } 15 }
Disguise 2: Identifier Renaming Rename variables consistently 01 static void 02 fill_content (struct line *fill, int num) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 07 fill->nfields = num; // initialization 08 fill->buf.size = fill->buf.length = num + 1; 09 fill->buf.buffer = (char*) xmalloc (fill->buf.size); 10 buffer = (unsigned char *) fill->buf.buffer; 11 fill->fields = fields = (struct field *) xmalloc (sizeof (struct field) * num); 12 for (i = 0; i < num; i++){ 13 ... 14 } 15 }
Disguise 3: Statement Reordering Reorder non-dependent statements 01 static void 02 fill_content (struct line *fill, int num) 03 { 04 int i; 05 unsigned char *buffer; 06 struct field *fields; 11fill->fields = fields = (struct field *) xmalloc (sizeof (struct field) * num); 08 fill->buf.size = fill->buf.length = num + 1; 09 fill->buf.buffer = (char*) xmalloc (fill->buf.size); 10 buffer = (unsigned char *) fill->buf.buffer; 07fill->nfields = num; // initialization 12 for (i = 0; i < num; i++){ 13 ... 14 } 15 }
Disguise 4: Control Replacement Use equivalent control structure • 01 static void • 02 fill_content (struct line *fill, int num) • 03 { • 04 int i; • 05 unsigned char *buffer; • 06 struct field *fields; • 11fill->fields = fields = • (struct field *) xmalloc (sizeof (struct field) * num); • 08 fill->buf.size = fill->buf.length = num + 1; • 09 fill->buf.buffer = (char*) xmalloc (fill->buf.size); • 10 buffer = (unsigned char *) fill->buf.buffer; • 07fill->nfields = num; // initialization • i = 0; • while (i < num){ • ... • 15 i++; • 16 } • 17 }
Disguise 5: Code Insertion Insert immaterial code • 01 static void • 02 fill_content (struct line *fill, int num) • 03 { • 04 int i; • 05 unsigned char *buffer; • 06 struct field *fields; • 11fill->fields = fields = • (struct field *) xmalloc (sizeof (struct field) * num); • 08 fill->buf.size = fill->buf.length = num + 1; • 09 fill->buf.buffer = (char*) xmalloc (fill->buf.size); • 10 buffer = (unsigned char *) fill->buf.buffer; • 07fill->nfields = num; // initialization • i = 0; • while (i < num){ • ... for (int j = 0; j < i; j++); • 15 i++; • 16 } • 17 }
Outline • Plagiarism Disguises • Review of Plagiarism Detection • GPLAG: PDG-based Plagiarism Detection • Efficiency and Scalability • Experiments • Conclusions
Review of Plagiarism Detection • String-based [Baker et al. 1995] • A program represented as a string • Blanks and comments ignored. • AST-based [Baxter et al. 1998, Kontogiannis et al. 1995] • A program is represented as an Abstract Syntax Tree (AST) • Fragile to statement reordering, control replacement and code insertion • Token-based [Kamiya et al. 2002, Prechelt et al. 2002] • Variables of the same type are mapped to the same token • A program is represented as a token string • Fingerprint of token strings is used for robustness [Schleimer et al. 2003] • Partially robust to statement reordering, control replacement and code insertion • Representatives: Moss and JPlag
Outline • Plagiarism Disguises • Review of Plagiarism Detection • GPLAG: PDG-based Plagiarism Detection • Efficiency and Scalability • Experiments • Conclusions
Graphic representation of source code int sum(int array[], int count) { int i, sum; sum = 0; for(i = 0; i < count; i++){ sum = add(sum, array[i]); } return sum; } int add(int a, int b) { return a + b; }
Graphic representation of source code int sum(int array[], int count) { int i, sum; sum = 0; for(i = 0; i < count; i++){ sum = add(sum, array[i]); } return sum; } int add(int a, int b) { return a + b; }
Control Dependency int sum(int array[], int count) { int i, sum; sum = 0; for(i = 0; i < count; i++){ sum = add(sum, array[i]); } return sum; } int add(int a, int b) { return a + b; }
Data Dependency int sum(int array[], int count) { int i, sum; sum = 0; for(i = 0; i < count; i++){ sum = add(sum, array[i]); } return sum; } int add(int a, int b) { return a + b; }
Corresponding PDGs PDG for the Original Code PDG for the Plagiarized Code
PDG-based Plagiarism Detection • A program is represented as a set of PDGs • Let g be a PDG of Procedure P in the original program • Let g’ be a PDG of Procedure P’ in the plagiarism suspect • Subgraph isomorphism implies plagiarism • If g is subgraph isomorphic to g’, P’ is likely plagiarized from P • γ-isomorphism: Graph g is γ-isomorphic to g’ if there exists a subgraph s of g such that s is subgraph isomorphic to g’, and |s|≥ γ |g|. • If g is γ–isomorphic to g’, the PDG pair (g, g’) is regarded as a plagiarized PDG pair, and is then returned to human beings for examination.
Advantages • Robust because it is hard to overhaul PDGs • Dependencies encode program logic • Incentive of plagiarism
Outline • Plagiarism Disguises • Review of Plagiarism Detection • GPLAG: PDG-based Plagiarism Detection • Efficiency and Scalability • Experiments • Conclusions
Efficiency and Scalability • Search space • If the original program has n procedures and the plagiarism suspect has m procedures • n*m subgraph isomorphism testings • Pruning search space • Lossless filter • Statistical lossy filter
Lossless filter • Interestingness • PDGs smaller than an interesting size K are excluded from both sides • γ-isomorphism definition • A PDG pair (g, g’) is discarded if |g’| <γ|g|.
Lossy Filter • Observation • If procedure P’ is plagiarized from procedure P, its PDG g’ should look similar to g. • So discard those dissimilar PDG pairs • Requirement • This filter must be light-weighted
Vertex Histogram • Represent PDG g by h(g) = (n1, n2, …, nk), where ni is the frequency of the ith kind of vertices. • Similarly, represent PDG g’ by h(g’) = (m1, m2, …, mk). • Direct similarity measurement? • How to define a proper similarity threshold? • Is thus defined threshold program-independent?
Hypothesis Testing-based Approach • Basic idea • Estimate a k-dimensional multinomial distribution from h(g) • Test whether h(g’) is likely an observation from • If it is, g’ looks similar to g, and an isomorphism testing is needed. • Otherwise, (g, g’) is discarded
Work-flow of GPLAG • PDGs are generated with Codesurfer • Isomorphism testing is implemented with VFLib.
Outline • Plagiarism Disguises • Review of Plagiarism Detection • GPLAG: PDG-based Plagiarism Detection • Efficiency and Scalability • Experiments • Conclusions
Experiment Design • Subject programs • Effectiveness • Filter efficiency • Core-part plagiarism detection
Effectiveness • 2-hour manual plagiarism, but can be automated? • GPLAG detects all plagiarized PDG pairs within 1 second • PDG isomorphism also reveals what plagiarism disguises are applied
Efficiency • Subject programs • bc, less and tar. • Exact copy as plagiarism. • Lossless and lossy filter • Pruning PDG-pairs. • Implication to overall time cost.
Pruning Uninteresting PDG-pairs • Lossless only • Lossless and lossy
Implication to Overall Time Cost • Time-out for subgraph isomorphism testing, time hogs. • Lossless filter does not save much time. • Lossy filter significantly reduces the time cost. • Major time saving comes from the avoidance of time hogs.
Detection of Core-part Plagiarism • Lower time cost with lossy filter. • Lower false positives with lossy filter.
Outline • Plagiarism Disguises • Review of Plagiarism Detection • GPLAG: PDG-based Plagiarism Detection • Efficiency and Scalability • Experiments • Conclusions
Conclusions • We developed a new algorithm GPLAG for software plagiarism detection • It is more effective to fight against “professional” plagiarists • We developed a statistical lossy filter, which improves the efficiency of GPLAG • We experimentally verified the effectiveness and efficiency of GPLAG
Q & A Thank You!
References [1] B. S. Baker. On finding duplication and near duplication in large software systems. In Proc. of 2nd Working Conf. on Reverse Engineering, 1995. [2] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. Clone detection using abstract syntax trees. In Proc. of Int. Conf. on Software Maintenance, 1998. [3] K. Kontogiannis, M. Galler, and R. DeMori. Detecting code similarity using patterns. In Working Notes of 3rd Workshop on AI and Software Engineering, 1995. [4] T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7), 2002. [5] L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with JPlag. J. of Universal Computer Science, 8(11), 2002. [6] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. SIGMOD, 2003. [7] V. B. Livshits and T. Zimmermann. Dynamine: Finding common error patterns by mining software revision histories. In Proc. of 13th Int. Symp. on the Foundations of Software Engineering, 2005. [8] C. Liu, X. Yan, and J. Han. Mining control flow abnormality for logic error isolation. In In Proc. 2006 SIAM Int. Conf. on Data Mining, 2006. [9] C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu. Mining behavior graphs for ”backtrace” of noncrashing bugs. In SDM, 2005.