210 likes | 329 Views
CS590 Z Matching Program Versions. Xiangyu Zhang. Problem Statement. Suppose a program P’ is created by modifying P. Determine the difference between P and P’. For an artifact c’ in P’, decide if c’ belongs to the difference, if not, find the correspondence of c’ in P. Static mapping
E N D
CS590 Z Matching Program Versions Xiangyu Zhang
Problem Statement • Suppose a program P’ is created by modifying P. Determine the difference between P and P’. For an artifact c’ in P’, decide if c’ belongs to the difference, if not, find the correspondence of c’ in P. • Static mapping • Non-trivial • Name comparison? • What if • Clone analysis, comparison checking
Motivations • Validate compiler transformations • Facilitate regression testing • Reverse obfuscation • Information propagation • Debugging • Code plagiarism detection • Information Assurance
Approaches • Static Approaches • Entity name based • String based (MOSS) • AST based (DECKARD) • CFG based (JDIFF) • PDG based (PDIFF) • Binary based (BMAT) • Log based (editor plugin, comparison checking) • Dynamic Approaches (not today)
Static Approaches • Entity name matching • Model a function/field as tuples • Coarse grained matching • String matching • Diff (CVS, Subservion) • Longest common subsequence (LCS) • Available operations are addition and deletion • Matched pairs can not cross one another • Programs are far more complicated than strings • Copy, paste, move • CP-Miner (scale to linux kernel clone detection) • Frequent subsequence mining
MOSS • Code plagiarism detection • It also handles other digital contents • Challenges • White space (variable name) • Noise (“the”, “int i”); • Order scrambling (paragraph reorders) • Problem statement • Given a set of documents, identify substring matches that satisfy two properties: • If there is a substring match at least as long as the guarantee threshold t, then this match is detected; • Do not detect any matches shorter than the noise threshold, k.
MOSS • k-gram • A continuous substring of length k
MOSS • Incremental hashing • Hashing strings of length k is expensive for large k. • “rolling” hash function • The (i+1)th k-gram hash = F (the ith k-gram hash, …)
MOSS • Fingerprint selection • A subset of hash values • Our goals: find all matching substrings >t; ignore matchings <k) • One of every tth hash values • 0 mod p
MOSS • Winnowing • Observation: given a sequence of hashes h1,…hn, if n>t-k, then at least one of the hi must be chosen • Have a sliding window with size w=t-k+1 • In each window select the minimum hash value, break ties by select the rightmost occurrence.
MOSS • Algorithm • Build an index mapping fingerprints to locations for all documents. • Each document is fingerprinted a second time and the selected fingerprints are looked up in the index; this gives the list of all matching fingerprints for each document. • Sort (d,d1,fx), (d, d2,fy) by the first two elements. • Matches between documents are rank-ordered by size (number of fingerprints)
MOSS • Advantages • Guarantee to detect any >t substring matches • Limitations • Minor edits fail MOSS. • x= a*b + c vs. z= c + a*b • Insertion, deletion
AST based matching • [YANG, 1991, Software Practice and Experience] • Given two functions, build the ASTs • Match the roots • If so, apply LCS to align subtrees • Continue recursively • Fragile
DECKARD • Advantages • Scalability • Insensitive to minor structural changes such as reordering, insertion, deletion • Limitations • Structural similarity only • Insertion that incurs structure change.
CFG matching • Hammock graph (JDIFF ,ASE 2004) • Match classes by names • Match fields by types • Match methods by signatures • Match instruction in methods by hammock graphs • A hammock is a single entry single exit subgraph of a CFG.
CFG matching • Pros • Orthogonal • Can be combined with other matching techniques • Simple • Cons • Coarse grained matching only • Not good at clone detection • In case of code transformation
Semantic Based Matched • Using PDG (SAS’01)
Semantic Based • Pros • Non-contiguous, intertwined, reordered • Insensitive to code transformations. • Cons • Scalability • Points-to analysis • Starting from a matching pair seems to be a problem
Wrap Up • For clone detection • Maybe structural / text similarity is a good idea • For whole program matching / method matching with code transformations • Semantic based is more appropriate • Scalability • PDG < CFG | AST < STRING < NAME