1 / 21

CS590 Z Matching Program Versions

CS590 Z Matching Program Versions. Xiangyu Zhang. Problem Statement. Suppose a program P’ is created by modifying P. Determine the difference between P and P’. For an artifact c’ in P’, decide if c’ belongs to the difference, if not, find the correspondence of c’ in P. Static mapping

york
Download Presentation

CS590 Z Matching Program Versions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS590 Z Matching Program Versions Xiangyu Zhang

  2. Problem Statement • Suppose a program P’ is created by modifying P. Determine the difference between P and P’. For an artifact c’ in P’, decide if c’ belongs to the difference, if not, find the correspondence of c’ in P. • Static mapping • Non-trivial • Name comparison? • What if • Clone analysis, comparison checking

  3. Motivations • Validate compiler transformations • Facilitate regression testing • Reverse obfuscation • Information propagation • Debugging • Code plagiarism detection • Information Assurance

  4. Approaches • Static Approaches • Entity name based • String based (MOSS) • AST based (DECKARD) • CFG based (JDIFF) • PDG based (PDIFF) • Binary based (BMAT) • Log based (editor plugin, comparison checking) • Dynamic Approaches (not today)

  5. Static Approaches • Entity name matching • Model a function/field as tuples • Coarse grained matching • String matching • Diff (CVS, Subservion) • Longest common subsequence (LCS) • Available operations are addition and deletion • Matched pairs can not cross one another • Programs are far more complicated than strings • Copy, paste, move • CP-Miner (scale to linux kernel clone detection) • Frequent subsequence mining

  6. MOSS • Code plagiarism detection • It also handles other digital contents • Challenges • White space (variable name) • Noise (“the”, “int i”); • Order scrambling (paragraph reorders) • Problem statement • Given a set of documents, identify substring matches that satisfy two properties: • If there is a substring match at least as long as the guarantee threshold t, then this match is detected; • Do not detect any matches shorter than the noise threshold, k.

  7. MOSS • k-gram • A continuous substring of length k

  8. MOSS • Incremental hashing • Hashing strings of length k is expensive for large k. • “rolling” hash function • The (i+1)th k-gram hash = F (the ith k-gram hash, …)

  9. MOSS • Fingerprint selection • A subset of hash values • Our goals: find all matching substrings >t; ignore matchings <k) • One of every tth hash values • 0 mod p

  10. MOSS • Winnowing • Observation: given a sequence of hashes h1,…hn, if n>t-k, then at least one of the hi must be chosen • Have a sliding window with size w=t-k+1 • In each window select the minimum hash value, break ties by select the rightmost occurrence.

  11. MOSS • Algorithm • Build an index mapping fingerprints to locations for all documents. • Each document is fingerprinted a second time and the selected fingerprints are looked up in the index; this gives the list of all matching fingerprints for each document. • Sort (d,d1,fx), (d, d2,fy) by the first two elements. • Matches between documents are rank-ordered by size (number of fingerprints)

  12. MOSS • Advantages • Guarantee to detect any >t substring matches • Limitations • Minor edits fail MOSS. • x= a*b + c vs. z= c + a*b • Insertion, deletion

  13. AST based matching • [YANG, 1991, Software Practice and Experience] • Given two functions, build the ASTs • Match the roots • If so, apply LCS to align subtrees • Continue recursively • Fragile

  14. DECKARD (ICSE 2007)

  15. DECKARD • Advantages • Scalability • Insensitive to minor structural changes such as reordering, insertion, deletion • Limitations • Structural similarity only • Insertion that incurs structure change.

  16. CFG matching • Hammock graph (JDIFF ,ASE 2004) • Match classes by names • Match fields by types • Match methods by signatures • Match instruction in methods by hammock graphs • A hammock is a single entry single exit subgraph of a CFG.

  17. CFG matching • Pros • Orthogonal • Can be combined with other matching techniques • Simple • Cons • Coarse grained matching only • Not good at clone detection • In case of code transformation

  18. Semantic Based Matched • Using PDG (SAS’01)

  19. Semantic Based

  20. Semantic Based • Pros • Non-contiguous, intertwined, reordered • Insensitive to code transformations. • Cons • Scalability • Points-to analysis • Starting from a matching pair seems to be a problem

  21. Wrap Up • For clone detection • Maybe structural / text similarity is a good idea • For whole program matching / method matching with code transformations • Semantic based is more appropriate • Scalability • PDG < CFG | AST < STRING < NAME

More Related