1 / 16

Duplicate code detection using Clone Digger

Duplicate code detection using Clone Digger. Peter Bulychev Lomonosov Moscow State University CS department. Outline. Theoretic part Clone detection problem in general The theory behind the tool Practical part

amelia
Download Presentation

Duplicate code detection using Clone Digger

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Duplicate code detection using Clone Digger Peter Bulychev Lomonosov Moscow State University CS department

  2. Outline • Theoretic part • Clone detection problem in general • The theory behind the tool • Practical part • Clone Digger and the results of its application to several Python open-source projects • Other ongoing projects

  3. What is software clone? • Two fragments of code form clone if they are similar enough (according to a given measure of similarity)

  4. Why is it important to detect code clones? • 5% - 20% of code in software systems are clones1 • Why do programmers produce clones?2 • Development strategy • Maintenance benefits • Overcoming underlying limitations • Cloning by accident • Why is the presence of code clones bad? • Errors in the original must be fixed in every clone 1. I.D. Baxter, et.al. Clone Detection Using Abstract Syntax Trees, 1998. 2. C.K. Roy and J.R. Cordy. A Survey on Software Clone Detection Research, 2007.

  5. Our definition of clone • Different clone definitions can be classified according to the level of granularity: • List of strings • Sequence of tokens • Abstract syntax trees (AST) • Semantic information • We work on the AST level • We consider two sequences of statements as a clone if one of them can be obtained from the other by replacing some subtrees

  6. Example block block = = print = = print x a y f x + y f y y x i a b x j

  7. The sketch of the algorithm • Partition similar statements into clusters • Find pairs of identical cluster sequences • Refine by examining identified code sequences for structural similarity i=0 f(i) i+=1 i=0 f(k) k+=1 k=0 f(k)

  8. Main problems • How to compute similarity between two trees? • Use editing distance • How to compute similarity between a new tree and an existing tree cluster? • Comparing with each tree in cluster is expensive • Compare new tree with an average value stored for a cluster

  9. Anti-unification • Anti-unifier of two trees is the most specific generalization that matches both of them f f f + + * ? + / 2 x x y x ? x z x 2 ?

  10. Anti-unification features • Anti-unifier of a set of trees keeps common features: the common upper part • Anti-unification can be used to compute editing distance between two trees: Ө1и Ө2 - substitutions, E0 Ө1=E1 и E0 Ө2=E2 distance = |Ө1| + |Ө2|

  11. Clone Digger • Is the first clone detection tool focused on Python (except Pylint) • Is provided under the GPL license • Writes the information on found clones to HTML in two column format with highlighting of differences • http://clonedigger.sourceforge.net

  12. Comparison with existing tools working with ASTs • CloneDR by Semantic Designs, I. Baxter, 1998 • Hash functions on subtrees, some kind of editing distance • Asta by Microsoft Research, S. Evans, et. al, 2007 • Subtree patterns (similar to anti-unification), hash functions on subtrees

  13. Quick Start • $ easy_install clonedigger • $ clonedigger --recursive source_tree • $ firefox output.html • Additional parameters such as thresholds can be also set (use --help to know more)

  14. Running on real-life open-source projects These numbers mean nothing … … except that every large project has clones and they should be detected

  15. What to do with found clones? • Remove clones by refactoring. Extract method and Pull Up method can be used • Detect library candidates • Search for bugs

  16. Any questions?

More Related