330 likes | 430 Views
Aries: Refactoring Support Environment Based on Code Clone Analysis. Yoshiki Higo, Toshihiro Kamiya, Shinji Kusumoto, Katsuro Inoue Graduate School of Information Science and Technology, Osaka University Presto, Japan Science and Technology Agency
E N D
Aries: Refactoring Support Environment Based on Code Clone Analysis Yoshiki Higo, Toshihiro Kamiya, Shinji Kusumoto, Katsuro Inoue Graduate School of Information Science and Technology, Osaka University Presto, Japan Science and Technology Agency {y-higo,kamiya,kusumoto,inoue}@ist.osaka-u.ac.jp
Background • What is code clone? • a code fragment that has identical or similar fragments in the same or different files in a system • introduced in the source program because of various reasons such as reusing code by `copy-and-paste’ • makes software maintenance more difficult. copy-and-paste copy-and-paste
Requirements for Code Clone Detection • Appropriate code clones should be detected in compliance with demands. • To understand the amount and distribution of code clones, it is desirable to detect all code clones • To remove code clones (Restructuring or Refactoring), it is useful to detect code clones that can be removed, and also removing them improves software maintainability
Research Objective and Approach • We aim to extract code clones which can be easily refactored • Approach • To detect code clones efficiently, we use a code clone detection tool, CCFinder. • Then, we extract the specific code clones easily refactored and provide applicable refactoring patterns for the code clones. • Finally, we develop a refactoring support tool and apply it to an open source program.
Refactoring Process Support • Commonly used refactoring process Step 1: Determine where refactoring should be applied Step 2: Determine which refactoring patterns can/should be applied Step 3: Investigate the effectiveness of the refactoring patterns Step 4: Modify source code Step 5: Conduct regression tests • Proposed method supports Steps1 and 2 • High scalability: it take less of high time complexity. • Detect fine-grained clone: it detect more fine-graded code clone than method unit.
Outline of CCFinder • CCFinder directly compares source code on token unit, and detects code clones • Normalization of name space • Replacement of names defined by user • Removal of table initialization • Consideration of module delimiter • CCFinder can analyze the system of millions line scale in practical use time
Source files Lexical analysis Token sequence Transformation Transformed token sequence Match detection Clones on transformed sequence Formatting Clone pairs 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Lexical analysis Lexical analysis Lexical analysis Token sequence Token sequence Token sequence Transformation Transformation Transformation Transformed token sequence Transformed token sequence Transformed token sequence Match detection Match detection Match detection Clones on transformed sequence Clones on transformed sequence Clones on transformed sequence Formatting Formatting Formatting CCFinder:Clone Detection Process 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. }
C1 C2 C3 C4 C5 Definitions:Clone Pair and Clone Set • Clone Pair: a pair of identical or similar fragments • Clone Set: a set of identical or similar fragments • CCFinder detects code clones as a clone pair • After detection process, clone pairs are transformed into clone sets
Extraction of code clones easily refactored • Structural code clones are regarded as the target of refactoring • Detect clone pairs by CCFinder • Transform the detected clone pairs into clone sets • Extract structural parts as structural code clones from the detected clone sets • What is structural code clone ? • example: Java language • Declaration: class declaration, interface declaration • Method: method body, constructor, static initializer • statement: do, for, if, switch, synchronized, try, while
fragment 1 Code clones which CCFinder detects Code clones which proposed method detects 609: reset(); 610: grammar = g; 611: // Lookup make-switch threshold in the grammar generic options 612: if (grammar.hasOption("codeGenMakeSwitchThreshold")) { 613: try { 614: makeSwitchThreshold = grammar.getIntegerOption("codeGenMakeSwitchThreshold"); 615: //System.out.println("setting codeGenMakeSwitchThreshold to " + makeSwitchThreshold); 616: } catch (NumberFormatException e) { 617: tool.error( 618: "option 'codeGenMakeSwitchThreshold' must be an integer", 619: grammar.getClassName(), 620: grammar.getOption("codeGenMakeSwitchThreshold").getLine() 621: ); 622: } 623: } 624: 625: // Lookup bitset-test threshold in the grammar generic options 626: if (grammar.hasOption("codeGenBitsetTestThreshold")) { 627: try { 628: bitsetTestThreshold = grammar.getIntegerOption("codeGenBitsetTestThreshold"); fragment 2 623: } 624: 625: // Lookup bitset-test threshold in the grammar generic options 626: if (grammar.hasOption("codeGenBitsetTestThreshold")) { 627: try { 628: bitsetTestThreshold = grammar.getIntegerOption("codeGenBitsetTestThreshold"); 629: //System.out.println("setting codeGenBitsetTestThreshold to " + bitsetTestThreshold); 630: } catch (NumberFormatException e) { 631: tool.error( 632: "option 'codeGenBitsetTestThreshold' must be an integer", 633: grammar.getClassName(), 634: grammar.getOption("codeGenBitsetTestThreshold").getLine() 635: ); 636: } 637: } 638: 639: // Lookup debug code-gen in the grammar generic options 640: if (grammar.hasOption("codeGenDebug")) { 641: Token t = grammar.getOption("codeGenDebug"); 642: if (t.getText().equals("true")) {
Code clones which CCFinder detects fragment 3 1007: if ( inputState.guessing==0 ) { 1008: buf.append(a.getText()); 1009: } 1010: { 1011: _loop144: 1012: do { 1013: if ((LA(1)==WILDCARD)) { 1014: match(WILDCARD); 1015: a=id(); 1016: if ( inputState.guessing==0 ) { 1017: buf.append('.'); buf.append(a.getText()); 1018: } 1019: } fragment 4 1527: if ( inputState.guessing==0 ) { 1528: t=a.getText(); 1529: } 1530: { 1531: _loop84: 1532: do { 1533: if ((LA(1)==COMMA)) { 1534: match(COMMA); 1535: id(); 1536: if ( inputState.guessing==0 ) { 1537: t+=","+b.getText(); 1538: } 1539: }
Provision of applicable refactoring patterns • Following refactoring patterns[1][2] can be used to remove clone sets including structural code clones • Extract Class, • Extract Method, • Extract Super Class, • Form Template Method, • Move Method, • Parameterize Method, • Pull Up Constructor, • Pull Up Method, • For each clone set, the proposed method determines which refactoring pattern is applicable by using several metrics. [1]: M. Fowler: Refactoring: Improving the Design of Existing Code, Addison-Wesley, 1999. [2]: http://www.refactoring.com/, 2004.
Metrics(1):Volume Metrics for Clone SetLEN, POP, DFL • LEN(S): is the average length of token sequence for a clone set S • POP(S): is the number of elements (code fragments) of a clone set S • DFL(S): indicates an estimation of how many tokens would be removed from source files when all code fragments in a clone set S are reconstructed new sub routine caller statements
example: ・Clone set S includes fragments f1 andf2. ・In fragment f1 , externally defined variable bisreferred and ais assigned to. ・Fragment f2 is same as f1. then,NRV(S) = ( 1 + 1 ) / 2 = 1 NSV(S) = ( 1 + 1 ) / 2 = 1 int a , b; … if( … ){ …; … = b; a = …; …; } … Fragment f1 reference assignment Metrics(2): Coupling Metrics for Clone SetNRV, NSV • NRV(S): represents the average number of externally defined variables referred in the fragments of a clone set S • NSV(S): represents the average number of externally defined variables assigned to in the fragments of a clone set S • Definition • Clone set S includes fragment f1, f2, ・・・, fn • si is the number of externally defined variable which fragment fi refers • ti is the number of externally defined variable which fragment fi assigns
example 2: ・Clone set S includes fragments f1 and f2. ・If all fragments of clone set S are included in a class and its direct child classes, then,DCH(S) = 1 example 1: ・Clone set S includes fragments f1 and f2. ・If all fragments of clone set S are included in a same class, then, DCH(S) = 0 class A class C class B fragment f1 fragment f2 class A fragment f1 fragment f2 Metrics(3):Inheritance Metric for Clone SetDCH • DCH(S): represents the position and distance between each fragment of a clone set S • Definition • Clone set S includes fragment f1, f2, ・・・,fn • Fragment fi exists in class Ci • Class Cp is a class which locates lowest position in C1, C2, ・・・,Cn on class hierarchy • If no common parent class of C1,C2,・・・,Cn exists, the value of DCH(S) is -1 • This metric is measured for only the class hierarchy where target software exists.
Aries: Refactoring Support ToolOverview • Target: Java programs • Runtime environment: JDK1.4 or above • Implementation • Analysis component: Java 32,000 Lines • CCFinder is used as code clone detection component • JavaCC is used to construct syntax and semantic analysis component • GUI component: Java14,000 Lines • User can specify target clone sets through GUI operations.
Case Study: AntOverview • Ant is one of build tools like ‘make’ • Input for Aries • Source files of Ant: 627 • LOC: about 180,000 • It took 30 seconds to extract structural code clones • We got 151 clone sets. • Environment • OS: FreeBSD 4.9 • CPU: Xeon 2.8G x 2 • Memory: 4GB
Case Study: AntExtract Method (conditions) • To apply ‘Extract Method’ pattern, we filtered clone sets by using following conditions • The unit of clone is statement (do, for, if, …) • Set the value of DCH(S) = 0 • All fragments of a clone set are included in a class • Set the value of NSV(S) < 2 • Each fragment of a clone set assigns any value to 1 or no externally defined variable. • 32 clone sets satisfied these conditions
assignment if (iSaveMenuItem == null) { try {iSaveMenuItem = new MenuItem();iSaveMenuItem.setLabel("Save BuildInfo To Repository"); } catch (Throwable iExc) { handleException(iExc); }} // javacoptsif (javacopts != null && !javacopts.equals("")) {genicTask.createArg().setValue("-javacopts");genicTask.createArg().setLine(javacopts);} local variable Case Study: AntExtract Method(result) • 32 clone set can be categorized as followings if (!isChecked()) { // make sure we don't have a circular reference hereStack stk = new Stack();stk.push(this);dieOnCircularReference(stk, getProject());} if (name == null) { if (other.name != null) {return false;}} else if (!name.equals(other.name)) {return false;}
Conclusion • We have • proposed refactoring support method • implemented a refactoring support tool, Aries • conducted a case study to Ant, which is an open source program, and most of filtered clone sets could be removed.
Future Works • As future works, we are going to • evaluate whether or not each refactoring should be done as the viewpoint of software quality (support Step 3) • find a group of clone sets that can be refactored at once to conduct refactoring more effectively • Commonly used refactoring process Step 1: Determine where refactoring should be applied Step 2: Determine which refactoring patterns can/should be applied Step 3: Investigate the effectiveness of the refactoring patterns Step 4: Modify source code Step 5: Conduct regression tests
Code clone detection for refactoring:Related Works • Detect similar sub-graphs as clone on program dependency graph [1]. • High accuracy: This approach finds out data-dependence and control dependence in source codes. • High time complexity: It takes O(n2) time to construct program dependency graph. • Detect similar methods and functions as clone using metrics [2]. • Low accuracy: if the size of target method or function is small, the values of metric make no difference. • detection unit restriction: only method and function unit clone can be detected. [1] R. Komondoor and S. Horwitz, “Using slicing to identify duplication insource code”, In Proc. of the 8th International Symposium on Static Analysis, Paris, France, July 16-18, 2001. [2] Magdalena Balazinska, Ettore Merlo, Michel Dagenais, Bruno Lague, and Lostas Kontogiannis, “Advanced Clone-Analysis to Support Object-Oriented System Refactoring”, WCRE 2000, pp. 98-107
The difference between ‘diff’ and clone detection tools • Diff finds the longest common sub-string. • Given a code portion, diff does not report two or more same code portions (clones). • Clone detection tool finds all the same or similar code portions.
Suffix-tree • Suffix tree is a tree that satisfies the following conditions. • A leaf node represents the starting position of sub-string. • A path from root node to a leaf node represents a sub-string. • First characters of labels of all the edges from one node are different from each other. → A common path means a clone
Example of transformation rules in Java • All identifiers defined by user are transformed to same tokens. • Unique identifier is inserted at each end of the top-level definitions and declarations. • Prevents detecting clones that begin at the middle of class definition and end at the middle of another one. • ”java. lang. Math. PI” is transformed to ”Math. PI”. • By using import sentence, a class is referred to with either full package name or a shorter name • ” new int[] {1, 2, 3} ” is transformed to ” new int[] {$} ” • Eliminates table initialization code.
Object file ID( file 0 in Group 0 ) Location of a clone pair ( Lines 53 - 63 in file 0.1 and Lines 542 - 553 in file 1.10 are identical or similar to each other) The output ofCCFinder #version: ccfinder 3.1 #langspec: JAVA #option: -b 30,1 #option: -k + #option: -r abcdfikmnprsv #option: -c wfg #begin{file description} 0.0 52 C:\Gemini.java 0.1 94 C:\GeneralManager.java : : #end{file description} #begin{clone} 0.1 53,9 63,13 1.10 542,9 553,13 35 0.1 53,9 63,13 1.10 624,9 633,13 35 0.2 124,9 152,31 0.2 154,9 216,51 42 : : #end{clone} • Output of CCFinder • It is difficult to analyze source code by only this text-based information of the location of clone pairs.
The corresponding code A (2 students) Similar code fragments were from source code of sample compiler described in textbook. B (4 students) Many code fragments were similar even with respect to name of variables or comments. B The analysis of comparison among students (non-gapped clones only) A
Clone class metrics • LEN (C ): Length of token sequence of each element in clone class C • LNR (C) : Length of non-repetitive token sequence of LEN(C) • POP (C ): Number of elements in clone class C • DFL (C ): Estimation of how many tokens would be removed from source files when all code fragments of clone class C are replaced with caller statements of a new identical routine • RAD (C ): Distribution in the file system of elements in clone class C new sub routine caller statements
Comparison with AST approach • Features of AST approach • Extract the same sub-trees of AST as a clone • The result is precise because of strict syntax analysis. • High space and time complexity • Features of Our approach • Hybrid approach of CCFinder’s quick but inaccurate clone detection and CCShaper’s filtering considering syntax structure.
The other approaches • AST(Abstract syntax tree) approach • Clone = the same sub-trees in an AST • Deep dependence on program language • PDG(Program dependency Graph) approach • Clone = the same sub-graph in a PDG • Graph comparison is difficult • Code metric • Clone = the routines which have the same metric values • Severe restriction in granularity • CCFinder&CCShaper • Clone = the code fragments which have the same syntax structure • Limited precision
Why I choose “a” • I selected the clones by the following criteria • All clone code fragments appear in the same class • The metric LEN is high • The code fragment includes a whole method body
example 2: ・Clone set S includes fragments f1 and f2. ・If all fragments of clone set S are included in a class and its direct child classes, then,DCH(S) = 1 example 3: ・Clone set S includes fragments f1 and f2. ・If all classes which include f1 and f2 don’t have common parent class, then,DCH(S) = -1 example 1: ・Clone set S includes fragments f1 and f2. ・If all fragments of clone set S are included in a same class, then, DCH(S) = 0 class A class B class A fragment f1 fragment f2 class C class B fragment f1 fragment f2 class A fragment f1 fragment f2 Metrics(3):Inheritance Metric for Clone SetDCH • DCH(S): represents the position and distance between each fragment of a clone set S • Definition • Clone set S includes fragment f1, f2, ・・・,fn • Fragment fi exists in class Ci • Class Cp is a class which locates lowest position in C1, C2, ・・・,Cn on class hierarchy • If no common parent class of C1,C2,・・・,Cn exists, the value of DCH(S) is -1 • This metric is measured for only the class hierarchy where target software exists.