130 likes | 272 Views
Clone Detection by Exploiting Assembler. Ian Davis, Mike Godfrey University of Waterloo Ontario, Canada. The Original Assembler. .LC107: .string "merge “ … pushl $ .LC107 pushl command_buf+8 .LCFI378: call prefixcmp addl $16,%esp testl %eax,%eax jne .L485 subl $8,%esp
E N D
Clone Detection by Exploiting Assembler Ian Davis, Mike Godfrey University of Waterloo Ontario, Canada
The Original Assembler .LC107: .string "merge “ … pushl $.LC107 pushl command_buf+8 .LCFI378: call prefixcmp addl $16,%esp testl %eax,%eax jne .L485 subl $8,%esp pushl $32 pushl command_buf+8 call strchr addl $16,%esp incl %eax movl %eax,-16(%ebp) subl $12,%esp pushl $24 call xmalloc addl $16,%esp movl %eax,-8(%ebp) subl $12,%esp pushl -16(%ebp) call lookup_branch … .L485 • Identify function boundaries • Relate assembler back to source • Remove comments, white space, etc. • Normalize instruction set if needed • Convert to relative addressing • Inline string constants • Reconstruct parameter names • Reconstruct local variable names Clone Detection by Exploiting Assembler
The Annotated Assembler pushl $"merge " pushl command_buf+8 call prefixcmp addl $16,%esp testl %eax,%eax jne +124 subl $8,%esp pushl $32 pushl command_buf+8 call strchr addl $16,%esp incl %eax movl %eax,from(%ebp) subl $12,%esp pushl $24 call xmalloc addl $16,%esp movl %eax,n (%ebp) subl $12,%esp pushl from(%ebp) call lookup_branch • Identify function boundaries • Relate assembler to source • Remove comments, white space, etc. • Normalize instruction set if needed • Convert to relative addressing • Inline string constants • Reconstruct parameter names • Reconstruct local variable names Clone Detection by Exploiting Assembler
The Matching Algorithm • Scan entire source once • Use hashing to find first pairing • Ignore pairings in identified clones • Don’t cross function boundaries • Terminate clone before later in function • Weight matches (+) and mismatches (-) • Special logic for matching branches • Advance greedily while weight ≥ 0 • Then employ hill climbing • Continue while improvement possible • Accept if clones satisfy minimum length • Alternative minimum for matching functions Clone Detection by Exploiting Assembler
Source Clone 1 from = strchr(command_buf.buf, ' ') + 1; n = xmalloc(sizeof(*n)); s = lookup_branch(from); if (s) hashcpy(n->sha1, s->sha1); else if (*from == ':') { uintmax_t idnum = strtoumax(from + 1, NULL, 10); struct object_entry*oe = find_mark(idnum); if (oe->type != OBJ_COMMIT) die("Mark :%" PRIuMAX " not a commit", idnum); hashcpy(n->sha1, oe->sha1); } else if (!get_sha1(from, n->sha1)) { unsigned long size; char *buf = read_object_with_reference(n->sha1, commit_type, &size, n->sha1); if (!buf || size < 46) die("Not a valid commit: %s", from); free(buf); } else die("Invalid ref name or SHA1 expression: %s", from); Clone Detection by Exploiting Assembler
Source Clone 2 from = strchr(command_buf.buf, ' ') + 1; s = lookup_branch(from); if (s) hashcpy(sha1, s->sha1); else if (*from == ':') { struct object_entry *oe; from_mark = strtoumax(from + 1, NULL, 10); oe = find_mark(from_mark); if (oe->type != OBJ_COMMIT) die("Mark :%" PRIuMAX " not a commit", from_mark); hashcpy(sha1, oe->sha1); } else if (!get_sha1(from, sha1)) { unsigned long size; char *buf; buf = read_object_with_reference(sha1, commit_type, &size, sha1); if (!buf || size < 46) die("Not a valid commit: %s", from); free(buf); } else die("Invalid ref name or SHA1 expression: %s", from); Clone Detection by Exploiting Assembler
Benefits and Conclusions • Assembler easy to derive from source / object / executable • Compliments other clone detection approaches • Compiler performs useful normalization of source for free • The analysis is semantic – not syntactic • By function (forbidding overlapped clones pairs) • Can handle branching sensibly • Case statements easier to handle • Can weight different assembler instructions differently • Can reason about assembler when performing detection Clone Detection by Exploiting Assembler
Thank You Clone Detection by Exploiting Assembler