CCFinder: Efficient Code Clone Detection Tool for Large Scale Source Code

“I can just copy these lines. That is the safest thing to do. The code has been tested afterall.” “What a mess. This code has been copied, then changed a bit, all over the code base.” Fyrirlestrar 9 & 10 CCFinder: A Tool to Detect Clones MSc Software MaintenanceMS Viðhald hugbúnaðar Dr Andy Brooks

einrækt Case Study Dæmisaga Reference CCFinder: A Multi-Linguistic Token-based Code Clone Detection System For Large Scale Source Code, Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue, Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University. http://sel.ist.osaka-u.ac.jp/~kamiya/ http://sel.ics.es.osaka-u.ac.jp/cdtools/index.html.en Dr Andy Brooks

Reasons For Clones • Copy-and-paste • one of the easiest ways to re-use code • one of the safest ways to re-use code in legacy applications as the original code base is unaltered • Mental macro • frequently coded computations are remembered and coded the same way • Repeated code portions for performance • inlined code is faster than called code • Systematic code generation from a single base • several variations of code needed Dr Andy Brooks

The Problem With Clones • It is difficulty to consistently modify source files with many clones. • When a fault is found, the engineer has to identify all occurences in every subsystem. • In large and complex systems, there can be dozens of engineers, each working on only one subsystem. • Documenting the existence of clones as they are introduced does not happen in practise. Dr Andy Brooks

Motivation For CCFinder • Government software system • 1 million lines of code • 2 thousand modules • Written in COBOL and PL/I-like language • Developed over 20 years ago • Continually maintained by a large number of engineers • Suspected that clones heavily reduce maintainability of system Dr Andy Brooks

Underlying Concepts CCFinder • Industrial strength • deals with million-line size systems • without excessive demands on time or memory • token-by-token matching more expensive than line-by-line • several optimization technqiues employed • Report only interesting clones • apply heuristic knowledge to remove unwanted clones • Copy-and-paste detection • deal with variable renaming and other small changes • Limited language dependence • easy to adapt tool to specific languages • adaptation for Java took two person days Dr Andy Brooks

Definitions and Terms • A clone relation holds between two code portions if and only if they are the same sequence. • A pair of code portions is called a clone pair if the clone relation holds between the portions. • A clone class is a maximal set of code portions in which a clone relation holds between any pair of code portions. • In CCFinder, clone relations are determined for transformed token sequences. Dr Andy Brooks

a x y z b x y z c x y d 12 tokens • Clone class C1 • a x y z b x y z c x y d • Clone Class C2 • a x y z b x y z c x y d • Note how the 3rd x y is not in C1 • Clone class C3 • a x y z b x y z c x y d • Portions are in C1 and this class is not of interest because it is not maximal. Dr Andy Brooks

Identification Of Structures • A code portion that begins in the middle of a function definition and ends some way through another function definition can be very difficult to rewrite as shared code. • CCFinder separates each function definition. • A code portion that is part of table initialization code can be very difficult to rewrite as shared code. • CCFinder identifies table definition code. Dr Andy Brooks

Clone Detection Process 1. 2. 3. 4. Dr Andy Brooks

1. Lexical Analysis • Source files are divided into tokens according to the rules of the language. • The tokens from all source files are concatenated into a single sequence of tokens. • Whitespaces, newlines, tabs, and comments between tokens are removed. • Sent to ‘Formatting’ to enable reconstruction of the original source files. Dr Andy Brooks

2.1 Transformation By Rules Dr Andy Brooks

2.2 Parameter Replacement • After 2.1, identifiers for types, variables, and constants are replaced with a special token 3. Match Detection • All clone pairs detected • (Leftbegin,LeftEnd,RightBegin,RightEnd) with respect to the token sequence 4. Formatting • Locations of clone pairs converted into line numbers in the original source files Dr Andy Brooks

Sample Code * * * * Dr Andy Brooks

Sample Code Transformed 2.1 * * * * * remove template parameters Dr Andy Brooks

Sample Code Transformed 2.2 Clone pairs Lines 1:7 and 11:17 Lines 8:10 and 19:21 Dr Andy Brooks

Matrix Visualization token line 11. 17. 19. 21.

Metrics For Clone Pairs/Classes • LEN(p), LEN(C) • Length can be measures in tokens, SLOC, and LOC (LOC excludes null or comment lines). • The token length of each portion of a clone class is identical when measured on the transformed token sequence. • LOC is used in the following metric definitions. • POP(C) • The number of elements in a clone class C. • A large POP means similar code portions appear in many places. Dr Andy Brooks

Metrics For Clone Pairs/Classes • DFL(C) • Deflation is an estimate of how much code is removed when a clone class is rewritten as shared code. • Suppose USELEN(C) is length of the caller statement. • LEN(C) x POP(C) - (USELEN(C) x POP(C) + LEN(C)) • COVERAGE (%LOC) • percentage of lines that include any portion of a clone • COVERAGE (%FILE) • percentage of files that include any clones Dr Andy Brooks

Metrics For Clone Pairs/Classes RAD(C) Dr Andy Brooks

Metrics For Clone Pairs/Classes • RAD(C) is the maximum length of path from each file (containing a clone code portion belonging to C) to the lowest common ancestor. • If all code portions of C are included in one file then RAD(C) = 0. • A large RAD implies code portions spread throughout different subsystems. • Making maintenance difficult if each subsystem is maintained by different engineers. Dr Andy Brooks

CCFinder Time and Space Complexities • CCFinder uses a suffix-tree algorithm with a time and space complexity of O(n). • Complexity measurements made on a PC (Pentium 4, 1.5GHz, 640 MB RAM) given various sized subsets of Linux 2.4.9 source files (2600K lines) Dr Andy Brooks

Leading Token Restriction Optimization • Identifying as clones, code portions which begin and end on the middle of statements, is not that useful. • Leading tokens at the beginning of clones are restricted to labels or keywords that either initiate or terminate statements. • Leading token restriction reduces the number of nodes in the suffix tree to one third in the C, C++, and Java case studies. • Very important restriction to make the tool scalable. Dr Andy Brooks

The clone class {a2,a3,a4,a5,a6,b1-b3} will be detected. 6C2 = 15 clone pairs Repeated Code Removal Optimization a1 switch (c) { a2 case ´0´: value = 0; break; a3 case ´1´: value = 1; break; a4 case ´2´: value = 2; break; a5 case ´3´: value = 3; break; a6 case ´4´: value = 4; break; a7 } b1 case ´a´: b2 flag = 2; b3 break; Dr Andy Brooks

Repeated Code Removal Optimization • To reduce the number of clone pairs, when building a suffix tree, after the first identification (repetition of a2 at a3), succeeding repetitions are not inserted. • Clone pair (a2,b1-b3) is still reported. • Repeated code removal is also said to stop reporting of self clones e.g. (a2-a5,a3-a6). Andy asks: how would you test this is working? Dr Andy Brooks

Token Concatenation Optimization • Abutting tokens that are not punctuator keywords are joined together. • The token sequence is made shorter in exchange for greater variation in what a token stands for. Andy asks: what if this optimization was not applied? Dr Andy Brooks

Clones in the JDK 1.3.0 >= 30 tokens java/awt/*.java javax/swing/*.java org/omg/CORBA/*.java

Clones in the JDK 1.3.0 >= 30 tokens • 570k lines in 1877 files. • CCFinder 3 minutes on a PC. • Files in the same directory are next to one another on the diagram axes. • Most line segments look like dots because of the scale of the graph. • Most cloning is near the main diagonal which means most clones occur within a file or between neighbouring directories.

Similar source files in the JDK 1.3.0 • These section D files are identical apart from lines 32, 161, 163.

Longest clone in the JDK 1.3.0 • 1647 tokens, 627 lines • WindowFileChooserUI.java and MetalfileChoserUI.java each have nine internal classes, one constructor and 45 methods • All but three methods are clones. Dr Andy Brooks

Effects Of Rules And Preprocessing Techniques • Disabling various rules and techniques has dramatic effects on the number of clone pairs and classes detected. Andy asks: what is the best combination to use? Dr Andy Brooks

Population And Length Of Clone Classes JDK 1.3.0 POP LEN(Token) Dr Andy Brooks

Clone Classes Of Top 5% DFL • Source file investigation reveals various kinds of cloning: • sequence of several methods • single method body • source files generated by tool • routines within a method • entire class body • Evidence points to different kinds of copy-and-paste style reuse in the JDK. Dr Andy Brooks

POP And RAD In JDK 1.3.0 exception classes exception classes swing Over 20 transformed tokens. Dr Andy Brooks

niðurstöður Conclusions • Tools to detect clones are themselves complex pieces of software. • Clone detection in CCFinder is sensitive to the rules, techniques, and clone threshold size employed. • CCFinder has been successfuly used to detect clones in the JDK 1.3.0. • As software systems get even bigger, clone detection will play an increasingly important part in code reengineering. Dr Andy Brooks

CCFinder: Efficient Code Clone Detection Tool for Large Scale Source Code

CCFinder: Efficient Code Clone Detection Tool for Large Scale Source Code

Presentation Transcript

CHENIN BLANC CLONES AVAILABLE

ABSTRACT

Spelling is a tool for writing

Effective Straggler Mitigation: Attack of the Clones [1]

HIKING TOOL – CLIMBIMG ROPE

Development and use of a multiplex PCR to detect common mastitis pathogens in ewe’s milk

Refactoring Clones: A New Perspective

Twins: Nature’s Clones

Refactoring Support Tool: Cancer

Bootstrap-based standard error for DETECT

Pre-genomic era: finding your own clones

Sequence Assembly of Medicago Truncatula Chromosomes

Effective Straggler Mitigation: Attack of the Clones [1]

Pre-genomic era: finding your own clones

Scc-3

Time irreversibility analysis: a tool to detect non linear

J 1 Characterization of clones

ICP Tool Pack

Cloning: Somatic Cell Nuclear Transfer

Object Detect