1 / 28

DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool. Simone Livieri Yoshiki Higo Makoto Matsushita Katsuro Inoue. Background. Open-Source Software (OSS) is used in many software systems Relations between software systems can be exposed through code clone analysis

nonnie
Download Presentation

DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DCCFinder: A Very-Large Scale Code Clone Analysis and Visualization Tool Simone Livieri Yoshiki Higo Makoto Matsushita Katsuro Inoue

  2. Background • Open-Source Software (OSS) is used in many software systems • Relations between software systems can be exposed through code clone analysis • Large collections of OSS exist • Huge memory requirements, long running time • Computing power is cheap • Large number of computers are often easy accessible • Code clone analysis can be distributed

  3. In the beginning was CCFinder • CCFinder is a code-clone analysis tool • Widely used and cited • Token based • Many languages supported (e.g. C, C++, Java) • Good scalability (but can’t handle very large input)

  4. DCCFinder • D(istributed)CCFinder is a tool for distributed code clone analysis • Master-slave distributed system • Data sharing through a shared file system • Uses CCFinder to perform the code clone analysis • The prototype ran on 80 computers of the Student Laboratory of our department

  5. Piece i,j CCFinder unit i unit j unit 1 unit i-1 unit i unit i+1 unit j-1 unit j unit j+1 unit n project 1 project 2 project 3 project 4 project 5 project 6 project 7 project 8 category 1 category 2 category 3 category 4 Computational Model Slave Node A unit is a set of source files that may cross multiple projects A category is a set of source file sharing a specific feature or use Target is the set of source file undergoing code clone analysis A project is a single software system Two units make a piece. A piece is the collection of file that will be analyzed on each slave node target

  6. System Implementation (1) • Written in Java (about 20kLoc) • Master-Slave-Registry communication handled with Java RMI • Basic fault tolerance

  7. Analysis Process

  8. System Implementation (2) • Indexer • Examines the target and collect file size, LoC, project and category name • Computes unit boundaries • Master Node • Creates the input files for CCFinder and assigns jobs to the slaves • Slave Node • Copies the files on the local storage • Executes CCFinder • Copies the output to the shared storage

  9. System Implementation (2) • Indexer • Examines the target and collect file size, LoC, project and category name • Computes unit boundaries • Master Node • Creates the input files for CCFinder and assigns jobs to the slaves • Slave Node • Copies the files on the local storage • Executes CCFinder • Copies the output to the shared storage

  10. System Implementation (3) • Clone Coverage Analyzer • Compute the number of shared line of code between each pair of files, projects and categories • Image Generator • Generate scatter plot, heat maps or bar chart from the clone coverage data

  11. System Implementation (3) • Clone Coverage Analyzer • Compute the number of shared line of code between each pair of files, projects and categories • Image Generator • Generate scatter plot, heat maps or bar chart from the clone coverage data

  12. Case Study I: The FressBSD Target • Vast collection of Open-Source software used by the FreeBSD OS • Unit size: 15MBytes • Minimum code clone length: 50 tokens • Total number of tasks: 269,745

  13. Case Study I: Result

  14. Case Study I: Result php4 and php5 duplicated source tree

  15. Case Study I: Result gstream’s main source tree is duplicated inside all the gstream plugin projects

  16. Case Study I: Result Multiple copies of the X-Windows System source tree

  17. Case Study I: Result

  18. Case Study I: Result • Database Category • CCC1: 41% • Causes: • Different version of the same software • Database drivers for different languages • Multiple copies of the phpX source tree

  19. Case Study I: Result • Development Category • CCC1: 38% • Causes: • Mainly the presence of different versions of the GNU binary utilities and compilers

  20. Case Study I: Result • Lang and Development Categories • CCC1: 28% • Causes: • The presence in both categories of the suite of GNU compilers

  21. Case Study I: Result • X11 Fonts Category • CCC1: 46% • Causes: • Small category size • Seven copies of the X Window System source tree

  22. Case Study II: SPARS-J and the FressBSD Target • SPARS-J is a Java component analysis tool • About 47000 line of code; written in C • Code clones between the SPARS-J and the whole FreeBSD target were detected

  23. Case Study II: Code Clone Coverage (before) Most of the code clones were from a single file: getopt.c

  24. Case Study II: Code Clone Coverage (after) • Code clones from CGI handling source code • Specialized version of getopt.c

  25. Summary • Proposed a new approach to distributed large scale code clone analysis • Obtained a global overview of code clones in the FreeBSD target • In SPARS-J, effortlessly individuated the use of code from the FreeBSD target

  26. Summary (2) • The acceleration gain was 20. Limited by: • data transfer, network congestion, master-slave coordination • Generating of reasonable size scatter-plot traded speed for accuracy. Effects: • Source code organization easily visible, enhanced artifacts, finer details not distinguishable • Currently can’t efficiently filter unnecessary or not-so-interesting code clones • Being addressed by exploring fingerprint based source code analysis

  27. Future Work • Currently D-CCFinder is being rewritten • Better fault tolerance • GUI Interface • Distributed post processing and image generation • Exploring the evolution of different software systems with code clone analysis

  28. Metrics CCC1 is the percentage of shared line of code between M0 and M1 computed over the total line of code of M0 and M1 CCC2 is the percentage of line of code that M0 shares with M1 computed over the total line of code of M0 A pair of files or projects or categories Segments of the cone clones between M0 and M1 Segments of the cone clones between M0 and M1 in M0 Number of lines of code in x

More Related