500 likes | 643 Views
Finding Code Clones for Refactoring with Clone Metrics : A Case Study of Open Source Software. Eunjong Choi † , Norihiro Yoshida ‡ , Takashi Ishio † , Katsuro Inoue † , and Tateki Sano*. †Osaka University, Japan ‡ Nara Institute of Science and Technology , Japan *NEC Corporation, Japan.
E N D
Finding Code Clones for Refactoring with Clone Metrics : A Case Study of Open Source Software Eunjong Choi†, Norihiro Yoshida‡, Takashi Ishio†,Katsuro Inoue†, and Tateki Sano* †Osaka University, Japan‡Nara Institute of Science and Technology , Japan *NEC Corporation, Japan
Contents • Background • Clone Metrics • Industrial Case Study • Case Study of Open Source Software • Summary and Future Work
Background: Clone Clone • Identical or similar code fragments in source code • The presence of code clones • indication of low maintainability of software • if a bug is found in a code clone, the other code clone have to be checked for defect detection. Similar
Background: Refactoring [Fowler1999](1/2) • Refactoring is a process of restructuring an existing code. • Alter software’s internal structure without changing its external behavior • Improve the maintainability of software [Fowler1999] M. Fowler, et al., Refactoring: Improving The Design of Existing Code, Addition Wesley, 1999.
Background: Refactoring [Fowler1999](2/2) • Refactoring Code Clones • Merge code clones into a single program unit Refactoring call statement [Fowler1999] M. Fowler, et al., Refactoring: Improving The Design of Existing Code, Addition Wesley, 1999.
Background: Language-dependent Code Clone • It is unavoidable to exist in source code • because of specifications of the used program language. replacement.setTaskType(taskType); replacement.setTaskName(taskName); replacement.setLocation(location); replacement.setOwningTarget(target); replacement.setRuntime (wrapper); wrapper.setProxy(replacement); Example of the language-dependent code clone (Consecutive setter invocations)
Background: Clone Set • A set of code clones Code Clone 1 Code Clone 3 Clone Set Code Clone 2
Background: Clone Metrics [Higo2007] • Quantitative information on clone sets • E.g., LEN(S), RNR(S), POP(S) • Purposes • To check features of code clones in software • To extract code clones for several purposes • E.g., The highest length of code clones… [Higo2007] Y.Higo, T. Kamiya, S.Kusumoto, K.Inoue, "Method and Implementation for Investigating Code Clones in a Software System", Information and Software Technology, pp. 985-998 (2007-9)
Clone Metrics: LEN(S) • The average length of token sequences of code clones in a clone set S A token sequence [a b b] is detected as a code clone a b b a b b a b b LEN(S) = 3 Clone set S
Clone Metrics: RNR(S) • The ratio of non-repeated token sequences of code clones in a clone set S • Eliminate language dependent code clones • High RNR value The length of non-repeated token sequence a b b a b b a b b 1 RNR(S) =• 100 = 33.3 Clone set S 3 The length of whole token sequence
Clone Metrics: POP(S) • The number of code clones in a clone set S 1 POP(S) = 3 3 2 Clone set S
Single Clone Metric (1/3) • Clone sets whose LEN(S) is higher • They Include many consecutive if (of if-else) blocks • involve similar but different conditional expressions. if ((p = getProject().getProperty("ant.netrexxc.binary")) != null) { this.binary = Project.toBoolean(p); } // classpath makes no sense if ((p = getProject().getProperty("ant.netrexxc.comments")) != null) { this.comments = Project.toBoolean(p); } …………The last part is omitted…………………… Code Clone in a clone set whose POP(S) is the highest in Ant1.7.0
Single Clone Metric (2/3) • Clone sets whose RNR(S) is higher • They do not organize a single semantic unit • semantic unit : many instructions forming a single functionality else { // is the zip file in the cache ZipFilezipFile = (ZipFile) zipFiles.get(file); if (zipFile == null) { zipFile = new ZipFile(file); zipFiles.put(file, zipFile); } ZipEntry entry = zipFile.getEntry(resourceName); if (entry != null) { a part of semantic unit Code Clone in a clone set whose RNR(S) is the second highest in Ant 1.7.0
Single Clone Metric (3/3) • Clone sets whose POP(S) is higher • They Include many language-dependent code clones out.println("\">"); out.println(""); out.print("<!ELEMENT project (target | "); out.print(TASKS); out.print(" | "); out.print(TYPES); Code Clone in a clone set whose POP(S) is higher than others
Key Idea • It is not appropriate to extract code clones for refactoring using just a single clone metric • According to our experiences • We propose a method based on combined clone metrics • To improve the weakness of single-metric-based extraction
Combined Clone Metrics • Clone sets whose RNR(S), POPS(S) are higher • Each code clone organizes a single semantic units if (ifProperty != null && p.getProperty(ifProperty) == null) { return false; } else if (unlessProperty != null && p.getProperty(unlessProperty) != null) { return false; } return true; } Appropriate for Refactoring! Code Clone in a clone set whose RNR(S), POP(S) are higher than others
Industrial Case Study (1/2) • Goal: validating our key idea • Using combined clone metrics is a feasible method to extract code clone for refactoring • Target System • Industrial Java software developed by NEC • 110KLOC, 736 clone sets
Industrial Case Study (2/2) • Experimental Step • Selected 62 clone sets from CCFinder's output using clone metrics. • Conducted a survey about these clone sets and got feedback from a developer. Survey Feed back CCFinder Source files Clone sets using clone metrics
Subject Code Clones (1/2) • Clone sets whose either clone metric value is high • SLEN : Clone sets whose LEN(S) value is top 10 high • SRNR : Clone sets whose RNR(S) value is top 10 high • SPOP : Clone sets whose POP(S) value is top 10 high
Subject Code Clones (2/2) • Clone sets whose combined clone metrics values are high • SLEN•RNR: 15 clone sets whose LEN(S) and RNR(S) values are high rank in the top 15 • SLEN•POP: 7 clone sets whose LEN(S) and POP(S) values are high rank in the top 15 • SRNR•POP: 18 clone sets whose RNR(S) and POP(S) values are high rank in the top 15 • SLEN•RNR•POP : 1 clone set whose LEN(S), RNR(S) and POP(S) values are high rank in the top 15
In Survey : About Clone set XXX Q. Which practice is appropriate for this clone set? [] Perform refactoring [] Write comments about code clones, but don’t perform refactoring. [] Change nothing. [] Others. ( )
In Survey : About Clone set XXX Q. Which practice is appropriate for this clone set? [] Perform refactoring [] Write comments about code clones, but don’t perform refactoring. [] Change nothing. [] Others. ( ) √ = Appropriate for refactoring
In Survey : About Clone set XXX Q. Which practice is appropriate for this clone set? [] Perform refactoring [] Write comments about code clones, but don’t perform refactoring. [] Change nothing. [] Others. ( ) √ √ √ =Inappropriate for refactoring
Results of Case Study (1/2) • #Selected Clone Sets: The number of selected clones • #Refactoring: The number of clone sets marked as “Perform refactoring“ in survey
Results of Case Study (2/2) • Precision : “How many refactoring candidates were accepted by a developer?“ #Refactoring Precision = #Selected Clone Sets Combined clone metrics is more accepted as refactoring candidates by a developer
Case Study of Open Source Software • Goal: validating our key idea • Using combined clone metrics is a feasible method to extract code clone for refactoring • Using open source software • Experimental Step • Selected clone sets from CCFinder's output using clone metrics. • Checked Clone sets whether they are appropriate for performing refactoring.
Target systems • implementation in java • Apache Ant: • 198KLOC, 998 clone sets • Jboss: • 633KLOC, 4284 clone sets
Subject clone sets • Subject clone sets • Apached Ant: 87 clone sets • Jboss: 299 clone sets • Clone sets whose either clone metric value is top 10 high • Clone sets whose combined clone metrics values are high rank in the 15
Subject Code Clones (Jboss) Q.Why results are different between the software? Because of the open source software dose not allow coding rule?
Analysis of Results: defects of RNR metric (1/2) • RNR metric sometimes extract unintentional code clones • E.g., Language-dependent code clones
Analysis of Results: defects of RNR metric (2/2) lIndex = lReturn.indexOf( "*" ); while( lIndex >= 0 ) { lReturn = ( lIndex > 0 ? lReturn.substring( 0, lIndex ) : "" ) + "%2a" + ( ( lIndex + 1 ) < lReturn.length() ? lReturn.substring( lIndex + 1 ) : "" ); lIndex = lReturn.indexOf( "*" ); } lIndex = lReturn.indexOf( ":" ); while( lIndex >= 0 ) { lReturn = ( lIndex > 0 ? lReturn.substring( 0, lIndex ) : "" ) + "%3a" + ( ( lIndex + 1 ) < lReturn.length() ? lReturn.substring( lIndex + 1 ) : "" ); lIndex = lReturn.indexOf( ":" ); } Code Clone in a clone sets whose LEN(S) and RNR(S) (=96) values are high rank in the top 15 in JBOSS
Analysis of Results: defects of RNR metric (2/2) lIndex = lReturn.indexOf( "*" ); while( lIndex >= 0 ) { lReturn = ( lIndex > 0 ? lReturn.substring( 0, lIndex ) : "" ) + "%2a" + ( ( lIndex + 1 ) < lReturn.length() ? lReturn.substring( lIndex + 1 ) : "" ); lIndex = lReturn.indexOf( "*" ); } lIndex = lReturn.indexOf( ":" ); while( lIndex >= 0 ) { lReturn = ( lIndex > 0 ? lReturn.substring( 0, lIndex ) : "" ) + "%3a" + ( ( lIndex + 1 ) < lReturn.length() ? lReturn.substring( lIndex + 1 ) : "" ); lIndex = lReturn.indexOf( ":" ); } The value of RNR is really 96? Code Clone in a clone sets whose LEN(S) and RNR(S) (=96) values are high rank in the top 15 in JBOSS
Analysis of Results: defects of RNR metric (2/2) lIndex = lReturn.indexOf( "*" ); while( lIndex >= 0 ) { lReturn = ( lIndex > 0 ? lReturn.substring( 0, lIndex ) : "" ) + "%2a" + ( ( lIndex + 1 ) < lReturn.length() ? lReturn.substring( lIndex + 1 ) : "" ); lIndex = lReturn.indexOf( "*" ); } lIndex = lReturn.indexOf( ":" ); while( lIndex >= 0 ) { lReturn = ( lIndex > 0 ? lReturn.substring( 0, lIndex ) : "" ) + "%3a" + ( ( lIndex + 1 ) < lReturn.length() ? lReturn.substring( lIndex + 1 ) : "" ); lIndex = lReturn.indexOf( ":" ); } Code Clone in a clone sets whose LEN(S) and RNR(S) (=96) values are high rank in the top 15 in JBOSS
Analysis of Results: defects of RNR metric (2/2) • Code Clone in a clone sets whose LEN(S) and RNR(S) (=96) values are high rank in the top 15 in JBOSS • RNR value of this clone sets Code Clone in a clone sets whose LEN(S) and RNR(S) (=50)
Summary and Future Work • Summary • We conducted a case study to validate our key idea and discuss its result • Future Work • Update used metrics • Investigate about recall • Use more metrics. • Conduct case studies of open source software
Thank You for Your Attention! 감사합니다. ありがとうございます
Example of clone set that are not selected… • It is too short to organize a semantic unit. • RNR metric sometimes extract unintentional code clones • E.g., Language-dependent code clones boolean isEqual(final DeweyDecimal other) { final int max = Math.max(other.components.length, components.length); for (int i = 0; i < max; i++) { final int component1 = (i < components.length) ? components[ i ] : 0; final int component2 = (i < other.components.length) ? other.components[ i ] : 0; if (
Clone sets whose RNR(S) is higher than others • Each code clone in a clone set S consists of more non-repeated token sequences /* Code Clone in a clone set whose RNR(S) is the second highest in Ant 1.7.0 */ else { // is the zip file in the cache ZipFilezipFile = (ZipFile) zipFiles.get(file); if (zipFile == null) { zipFile = new ZipFile(file); zipFiles.put(file, zipFile); } ZipEntry entry = zipFile.getEntry(resourceName); if (entry != null) { /* … */
Clone sets whose RNR(S) is lower than others • Consists of more repeated token sequences • Involve in language-dependent code clone /* Code Clone in a clone set whose RNR(S) is the lowest in Ant 1.7.0 */ String sosCmdDir = null; …… skip code…. private String filename = null; private boolean noCompress = false; private boolean noCache = false; private boolean recursive = false; private boolean verbose = false; /* … */ Consecutive variable declarations
Clone metric: RNR(S) (1/2) • File: • F1: a b c a b, • F2: c c* c* a b, • F3: d a b, e f • F4: c c* d e f • Superscript * indicated that the token is in a repeated token sequence • RNR(S1) of Clone Set S1 is Clone Set: S1: { , , , } 2 + 2 + 2 + 2 2 + 2 + 2 + 2 RNR(S1) = • 100 = 100 ab ab ab ab
Clone metric: RNR(S) (2/2) • File: • F1: a b c a b, • F2: c c* c* a b, • F3: d a b, e f • F4: c c* d e f • Superscript * indicated that the token is in a repeated token sequence • RNR(S2) of Clone Set S2 is Clone Set: S2: { , , } c c* c* c* c c* 1 + 0 + 1 2 + 2 + 2 RNR(S2) = • 100 = 33.3
The Number of Duplicate Clone Set(Industrial) • | SRNR ∩ SPOP ∩ SRNR∙POP| = 1 • | SRNR ∩ SRNR∙ POP| = 2 • | SPOP ∩ SRNR∙ POP| = 2 • | SLEN∙ RNR∩ SLEN∙ POP∩ SRNR∙ POP∩ SLEN ∙ RNR∙ POP| = 1 CSセミナー 2010/12/01
The Number of Duplicate Clone Set(Apache ant) • | SRNR ∩ SRNR∙POP| = 1 • | SPOP ∩ SRNR∙ POP| = 1 • | SPOP ∩ SLEN∙ POP| = 1 CSセミナー 2010/12/01
The Number of Duplicate Clone Set(JBOSS) • | SRNR ∩ SLEN∙RNR| = 3 • | SRNR ∩ SRNR∙ POP| = 1 • | SLEN∙ RNR∩ SLEN∙ POP∩ SRNR∙ POP∩ SLEN ∙ RNR∙ POP| = 2 CSセミナー 2010/12/01
DFL (C ): Estimation of how many tokens would be removed from source files when all code fragments of clone set C are replaced with caller statements of a new identical routine new sub routine caller statements Clone set metrics • LEN (C ): Length of token sequence of each element in clone set C • POP (C ): Number of elements in clone set C • RAD (C ): Distribution in the file system of elements in clone set C