360 likes | 369 Views
CCFinder is a powerful code clone detection system that can identify and analyze clones in large-scale source code, helping to improve software maintenance and reduce redundancy. This tool utilizes token-based matching and employs several optimization techniques for enhanced performance.
E N D
“I can just copy these lines. That is the safest thing to do. The code has been tested afterall.” “What a mess. This code has been copied, then changed a bit, all over the code base.” Fyrirlestrar 9 & 10 CCFinder: A Tool to Detect Clones MSc Software MaintenanceMS Viðhald hugbúnaðar Dr Andy Brooks
einrækt Case Study Dæmisaga Reference CCFinder: A Multi-Linguistic Token-based Code Clone Detection System For Large Scale Source Code, Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue, Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University. http://sel.ist.osaka-u.ac.jp/~kamiya/ http://sel.ics.es.osaka-u.ac.jp/cdtools/index.html.en Dr Andy Brooks
Reasons For Clones • Copy-and-paste • one of the easiest ways to re-use code • one of the safest ways to re-use code in legacy applications as the original code base is unaltered • Mental macro • frequently coded computations are remembered and coded the same way • Repeated code portions for performance • inlined code is faster than called code • Systematic code generation from a single base • several variations of code needed Dr Andy Brooks
The Problem With Clones • It is difficulty to consistently modify source files with many clones. • When a fault is found, the engineer has to identify all occurences in every subsystem. • In large and complex systems, there can be dozens of engineers, each working on only one subsystem. • Documenting the existence of clones as they are introduced does not happen in practise. Dr Andy Brooks
Motivation For CCFinder • Government software system • 1 million lines of code • 2 thousand modules • Written in COBOL and PL/I-like language • Developed over 20 years ago • Continually maintained by a large number of engineers • Suspected that clones heavily reduce maintainability of system Dr Andy Brooks
Underlying Concepts CCFinder • Industrial strength • deals with million-line size systems • without excessive demands on time or memory • token-by-token matching more expensive than line-by-line • several optimization technqiues employed • Report only interesting clones • apply heuristic knowledge to remove unwanted clones • Copy-and-paste detection • deal with variable renaming and other small changes • Limited language dependence • easy to adapt tool to specific languages • adaptation for Java took two person days Dr Andy Brooks
Definitions and Terms • A clone relation holds between two code portions if and only if they are the same sequence. • A pair of code portions is called a clone pair if the clone relation holds between the portions. • A clone class is a maximal set of code portions in which a clone relation holds between any pair of code portions. • In CCFinder, clone relations are determined for transformed token sequences. Dr Andy Brooks
a x y z b x y z c x y d 12 tokens • Clone class C1 • a x y z b x y z c x y d • Clone Class C2 • a x y z b x y z c x y d • Note how the 3rd x y is not in C1 • Clone class C3 • a x y z b x y z c x y d • Portions are in C1 and this class is not of interest because it is not maximal. Dr Andy Brooks
Identification Of Structures • A code portion that begins in the middle of a function definition and ends some way through another function definition can be very difficult to rewrite as shared code. • CCFinder separates each function definition. • A code portion that is part of table initialization code can be very difficult to rewrite as shared code. • CCFinder identifies table definition code. Dr Andy Brooks
Clone Detection Process 1. 2. 3. 4. Dr Andy Brooks
1. Lexical Analysis • Source files are divided into tokens according to the rules of the language. • The tokens from all source files are concatenated into a single sequence of tokens. • Whitespaces, newlines, tabs, and comments between tokens are removed. • Sent to ‘Formatting’ to enable reconstruction of the original source files. Dr Andy Brooks
2.1 Transformation By Rules Dr Andy Brooks
2.1 Transformation By Rules Dr Andy Brooks
2.2 Parameter Replacement • After 2.1, identifiers for types, variables, and constants are replaced with a special token 3. Match Detection • All clone pairs detected • (Leftbegin,LeftEnd,RightBegin,RightEnd) with respect to the token sequence 4. Formatting • Locations of clone pairs converted into line numbers in the original source files Dr Andy Brooks
Sample Code * * * * Dr Andy Brooks
Sample Code Transformed 2.1 * * * * * remove template parameters Dr Andy Brooks
Sample Code Transformed 2.2 Clone pairs Lines 1:7 and 11:17 Lines 8:10 and 19:21 Dr Andy Brooks
Matrix Visualization token line 11. 17. 19. 21.
Metrics For Clone Pairs/Classes • LEN(p), LEN(C) • Length can be measures in tokens, SLOC, and LOC (LOC excludes null or comment lines). • The token length of each portion of a clone class is identical when measured on the transformed token sequence. • LOC is used in the following metric definitions. • POP(C) • The number of elements in a clone class C. • A large POP means similar code portions appear in many places. Dr Andy Brooks
Metrics For Clone Pairs/Classes • DFL(C) • Deflation is an estimate of how much code is removed when a clone class is rewritten as shared code. • Suppose USELEN(C) is length of the caller statement. • LEN(C) x POP(C) - (USELEN(C) x POP(C) + LEN(C)) • COVERAGE (%LOC) • percentage of lines that include any portion of a clone • COVERAGE (%FILE) • percentage of files that include any clones Dr Andy Brooks
Metrics For Clone Pairs/Classes RAD(C) Dr Andy Brooks
Metrics For Clone Pairs/Classes • RAD(C) is the maximum length of path from each file (containing a clone code portion belonging to C) to the lowest common ancestor. • If all code portions of C are included in one file then RAD(C) = 0. • A large RAD implies code portions spread throughout different subsystems. • Making maintenance difficult if each subsystem is maintained by different engineers. Dr Andy Brooks
CCFinder Time and Space Complexities • CCFinder uses a suffix-tree algorithm with a time and space complexity of O(n). • Complexity measurements made on a PC (Pentium 4, 1.5GHz, 640 MB RAM) given various sized subsets of Linux 2.4.9 source files (2600K lines) Dr Andy Brooks
Leading Token Restriction Optimization • Identifying as clones, code portions which begin and end on the middle of statements, is not that useful. • Leading tokens at the beginning of clones are restricted to labels or keywords that either initiate or terminate statements. • Leading token restriction reduces the number of nodes in the suffix tree to one third in the C, C++, and Java case studies. • Very important restriction to make the tool scalable. Dr Andy Brooks
The clone class {a2,a3,a4,a5,a6,b1-b3} will be detected. 6C2 = 15 clone pairs Repeated Code Removal Optimization a1 switch (c) { a2 case ´0´: value = 0; break; a3 case ´1´: value = 1; break; a4 case ´2´: value = 2; break; a5 case ´3´: value = 3; break; a6 case ´4´: value = 4; break; a7 } b1 case ´a´: b2 flag = 2; b3 break; Dr Andy Brooks
Repeated Code Removal Optimization • To reduce the number of clone pairs, when building a suffix tree, after the first identification (repetition of a2 at a3), succeeding repetitions are not inserted. • Clone pair (a2,b1-b3) is still reported. • Repeated code removal is also said to stop reporting of self clones e.g. (a2-a5,a3-a6). Andy asks: how would you test this is working? Dr Andy Brooks
Token Concatenation Optimization • Abutting tokens that are not punctuator keywords are joined together. • The token sequence is made shorter in exchange for greater variation in what a token stands for. Andy asks: what if this optimization was not applied? Dr Andy Brooks
Clones in the JDK 1.3.0 >= 30 tokens java/awt/*.java javax/swing/*.java org/omg/CORBA/*.java
Clones in the JDK 1.3.0 >= 30 tokens • 570k lines in 1877 files. • CCFinder 3 minutes on a PC. • Files in the same directory are next to one another on the diagram axes. • Most line segments look like dots because of the scale of the graph. • Most cloning is near the main diagonal which means most clones occur within a file or between neighbouring directories.
Similar source files in the JDK 1.3.0 • These section D files are identical apart from lines 32, 161, 163.
Longest clone in the JDK 1.3.0 • 1647 tokens, 627 lines • WindowFileChooserUI.java and MetalfileChoserUI.java each have nine internal classes, one constructor and 45 methods • All but three methods are clones. Dr Andy Brooks
Effects Of Rules And Preprocessing Techniques • Disabling various rules and techniques has dramatic effects on the number of clone pairs and classes detected. Andy asks: what is the best combination to use? Dr Andy Brooks
Population And Length Of Clone Classes JDK 1.3.0 POP LEN(Token) Dr Andy Brooks
Clone Classes Of Top 5% DFL • Source file investigation reveals various kinds of cloning: • sequence of several methods • single method body • source files generated by tool • routines within a method • entire class body • Evidence points to different kinds of copy-and-paste style reuse in the JDK. Dr Andy Brooks
POP And RAD In JDK 1.3.0 exception classes exception classes swing Over 20 transformed tokens. Dr Andy Brooks
niðurstöður Conclusions • Tools to detect clones are themselves complex pieces of software. • Clone detection in CCFinder is sensitive to the rules, techniques, and clone threshold size employed. • CCFinder has been successfuly used to detect clones in the JDK 1.3.0. • As software systems get even bigger, clone detection will play an increasingly important part in code reengineering. Dr Andy Brooks