290 likes | 413 Views
Extracting Code Clones for Refactoring Using Combinations of Clone Metrics. Eunjong Choi † , Norihiro Yoshida ‡ , Takashi Ishio † , Katsuro Inoue † , and Tateki Sano*. †Osaka University, Japan ‡ Nara Institute of Science and Technology , Japan *NEC Corporation, Japan. Background: Clone Set.
E N D
Extracting Code Clones for Refactoring Using Combinations of Clone Metrics Eunjong Choi†, Norihiro Yoshida‡, Takashi Ishio†,Katsuro Inoue†, and Tateki Sano* †Osaka University, Japan‡Nara Institute of Science and Technology , Japan *NEC Corporation, Japan
Background: Clone Set • A set of code clones that is similar or identical to each other Code Clone 1 similar Code Clone 4 identical Code Clone 2 Code Clone 5 Code Clone 3 Clone Set: S1={Code Clone 1, Code Clone 3} S2={Code Clone 2, Code Clone 4, Code Clone 5}
Background: Refactoring Code Clone • Merge code clones into a single program unit Code Clone 1 Code Clone’ 1 Refactoring Code Clone 2 Code Clone 2 Code Clone 3
Background: Language-dependent Code Clone • It is unavoidable to exist in source code • because of features of the used program language. /* Code Clone in a clone set whose RNR(S) is the second highest in Ant 1.7.0 */ else { // is the zip file in the cache file); if == null) { (file); ; } /* Code Clone B */ def.setName(name); def.setClassName(classname); def.setClass(cl); def.setAdapterClass(adapterClass); def.setAdaptToClass(adaptToClass); def.setClassLoader(al); /* … */ /* Code Clone A */ replacement.setTaskType(taskType); replacement.setTaskName(taskName); replacement.setLocation(location); replacement.setOwningTarget(target); replacement.setRuntime (wrapper); wrapper.setProxy(replacement); /* … */ Example of the language-dependent code clone (Consecutive setter invocations)
Background: Clone Metrics [Higo2007] • Quantitative information on clone sets • E.g., LEN(S), RNR(S), POP(S) • Purposes • To check features of code clones in software • To extract code clones for several purposes • E.g., refactoring, defect-prone code clones [Higo2007] Yoshiki Higo, Toshihiro Kamiya, Shinji Kusumoto, Katsuro Inoue, "Method and Implementation for Investigating Code Clones in a Software System", Information and Software Technology, pp. 985-998 (2007-9)
Clone Metrics: LEN(S) • The average length of token sequences of code clones in a clone set S A token sequence [c c* ] is detected as a code clone from a token sequence <c c* c* a b> LEN(S) = 2 Superscript * indicated that the token is in a repeated token sequence Clone set S
Clone Metrics: RNR(S) • The ratio of non-repeated token sequences of code clones in a clone set S A token sequence [c c* ] is detected as a code clone from a token sequence <c c* c* a b> The length of non-repeated token sequence 1 RNR(S) =• 100 = 50 2 The length of whole token sequence Clone set S
Clone Metrics: POP(S) • The number of code clones in a clone set S 1 3 2 POP(S) = 6 4 5 6 Clone set S
Single Clone Metric (1/2) • Clone sets whose RNR(S) is higher • They do not organize a single semantic unit • semantic unit : many instructions forming a single functionality /* Code Clone in a clone set whose RNR(S) is the second highest in Ant 1.7.0 */ else { // is the zip file in the cache ZipFilezipFile = (ZipFile) zipFiles.get(file); if (zipFile == null) { zipFile = new ZipFile(file); zipFiles.put(file, zipFile); } ZipEntry entry = zipFile.getEntry(resourceName); if (entry != null) {x Not Appropriate for Refactoring! a part of semantic unit
Single Clone Metric (2/2) • Clone sets whose POP(S) is higher • They Include many language-dependent code clones /* Code Clone in a clone set whose POP(S) is the first highest in Ant 1.7.0 */ out.println("\">"); out.println(""); out.print("<!ELEMENT project (target | "); out.print(TASKS); out.print(" | "); out.print(TYPES); Not Appropriate for Refactoring!
Key Idea • It is not appropriate to extract refactorable code clones using just a single clone metric • According to our experiences • We propose a method based on combined clone metrics • To improve the weakness of single-metric-based extraction
Combined Clone Metrics • Clone sets whose RNR(S), POPS(S) are higher • Each code clone organizes a single semantic units /* Code Clone in a clone set whose RNR(S), POP(S) are higher than others*/ if (ifProperty != null && p.getProperty(ifProperty) == null) { return false; } else if (unlessProperty != null && p.getProperty(unlessProperty) != null) { return false; } return true; } Appropriate for Refactoring!
Case Study (1/2) • Goal: validating our key idea • Using combined clone metrics is a feasible method to extract code clone for refactoring • Target System • Industrial Java software developed by NEC • 110KLOC, 736 clone sets
Case Study (2/2) • Experimental Step • Selected 62 clone sets from CCFinder's output using clone metrics. • Conducted a survey about these clone sets and got feedback from a developer. Survey Feed back CCFinder Source files Clone sets using clone metrics
Subject Code Clones (1/2) • Clone sets whose either clone metric value is high • Clone sets whose LEN(S) value is top 10 high • Clone sets whose RNR(S) value is top 10 high • Clone sets whose POP(S) value is top 10 high
Subject Code Clones (2/2) • Clone sets whose combined clone metrics values are high • 15 clone sets whose LEN(S) and RNR(S) values are high rank in the top 15 • 7 clone sets whose LEN(S) and POP(S) values are high rank in the top 15 • 18 clone sets whose RNR(S) and POP(S) values are high rank in the top 15 • 1 clone set whose LEN(S), RNR(S) and POP(S) values are high rank in the top 15
Results of Case Study (1/2) • #Selected Clone Sets: The number of selected clones • #Refactoring: The number of clone sets marked as “Perform refactoring“ in survey
Results of Case Study (2/2) • Precision : “How many refactoring candidates were accepted by a developer?“ #Refactoring Precision = #Selected Clone Sets Combined clone metrics is more accepted as refactoring candidates by a developer
Summary and Future Work • Summary • Our Industrial case study shows that our key idea is appropriate. • Future Work • Investigate about recall • Conduct case studies of open source software • Suggest a new metric
Clone sets whose RNR(S) is higher than others • Each code clone in a clone set S consists of more non-repeated token sequences /* Code Clone in a clone set whose RNR(S) is the second highest in Ant 1.7.0 */ else { // is the zip file in the cache ZipFilezipFile = (ZipFile) zipFiles.get(file); if (zipFile == null) { zipFile = new ZipFile(file); zipFiles.put(file, zipFile); } ZipEntry entry = zipFile.getEntry(resourceName); if (entry != null) { /* … */
Clone sets whose RNR(S) is lower than others • Consists of more repeated token sequences • Involve in language-dependent code clone /* Code Clone in a clone set whose RNR(S) is the lowest in Ant 1.7.0 */ String sosCmdDir = null; …… skip code…. private String filename = null; private boolean noCompress = false; private boolean noCache = false; private boolean recursive = false; private boolean verbose = false; /* … */ Consecutive variable declarations
Survey Format: About Clone set XXX (1) Do you think that this clone set need a practice? [] Yes [] No(→Jump to next clone set) (2) If you marked “Yes” in your answer to (1), what practice is appropriate for this clone set? [] Refactoring [] Write comments about code clones, but don’t perform refactoring. [] Change nothing. [] Others. ( (3) Write the reason why did you mark in your answer to (2) Reason :
Clone metric: RNR(S) (1/2) • File: • F1: a b c a b, • F2: c c* c* a b, • F3: d a b, e f • F4: c c* d e f • Superscript * indicated that the token is in a repeated token sequence • RNR(S1) of Clone Set S1 is Clone Set: S1: { , , , } 2 + 2 + 2 + 2 2 + 2 + 2 + 2 RNR(S1) = • 100 = 100 ab ab ab ab
Clone metric: RNR(S) (2/2) • File: • F1: a b c a b, • F2: c c* c* a b, • F3: d a b, e f • F4: c c* d e f • Superscript * indicated that the token is in a repeated token sequence • RNR(S2) of Clone Set S2 is Clone Set: S2: { , , } c c* c* c* c c* 1 + 0 + 1 2 + 2 + 2 RNR(S2) = • 100 = 33.3
Subject Code Clones • 62 clone sets • clone sets whose individual clone metric value is high • SLEN Clone sets whose LEN(S) value is top 10 high. • SRNR Clone sets whose RNR(S) value is top 10 high. • SPOP Clone sets whose POP(S) value is top 10 high. • clone sets whose combined clone metrics values are high • SLEN∙RNR15 clone sets whose LEN(S) and RNR(S) values are high rank in the top 15. • SLEN∙POP7 clone sets whose LEN(S) and POP(S) values are high rank in the top 15. • SRNR∙POP18 clone sets whose RNR(S) and POP(S) values are high rank in the top 15. • SLEN∙RNR∙POP 1 clone set whose LEN(S), RNR(S) and POP(S) values are high rank in the top 15.
The Number of Duplicate Clone Set • | SRNR ∩ SPOP ∩ SRNR∙POP| = 1 • | SRNR ∩ SRNR∙ POP| = 2 • | SPOP ∩ SRNR∙ POP| = 2 • | SLEN∙ RNR∩ SLEN∙ POP∩ SRNR∙ POP∩ SLEN ∙ RNR∙ POP| = 1 CSセミナー 2010/12/01
Example of clone set that are not selected… • It is too short to organize a semantic unit. • RNR metric sometimes extract unintentional code clones • E.g., Language-dependent code clones boolean isEqual(final DeweyDecimal other) { final int max = Math.max(other.components.length, components.length); for (int i = 0; i < max; i++) { final int component1 = (i < components.length) ? components[ i ] : 0; final int component2 = (i < other.components.length) ? other.components[ i ] : 0; if (