Software Engineering Laboratory Eunjong Choi

Detection and evolution analysis of code clones for efficient management of large-scale software systems Software Engineering Laboratory Eunjong Choi

Code Clone • A code fragment that has other code fragments identical or similar to it in the source code Clone Set Code Clone Code Clone Code Clone • Representative factor that hampers software maintenance Source File 2 Source File 1

Clone Detection Tools • Using various granularities • String, Token, Program dependency graphs • CCFinder[Kamiya2002] • Token-based code clone detection tool • Famous for its highrecall [Kamiya2002] T. Kamiya et al. : A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code. IEEE TSE, 28,7, pp. 654-670, 2002

Lexical analysis Token sequence Transformation Transformed token sequence Match detection Clones on transformed sequence Formatting Example of CCFinder[Ueda2002] Source files 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Code clones [Ueda2002] Y. Ueda et al. Gemini: Maintenance Support Environment Based on Code Clone Analysis. http://sel.ist.osaka-u.ac.jp/~lab-db/betuzuri/contents.ja/346.html

Lexical analysis Token sequence Transformation Transformed token sequence Match detection Clones on transformed sequence Formatting Example of CCFinder[Ueda2002] Source files Code clones [Ueda2002] Y. Ueda et al. Gemini: Maintenance Support Environment Based on Code Clone Analysis. http://sel.ist.osaka-u.ac.jp/~lab-db/betuzuri/contents.ja/346.html

1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Lexical analysis Token sequence Transformation Transformed token sequence Match detection Clones on transformed sequence Formatting Example of CCFinder[Ueda2002] Source files Source files Code clones [Ueda2002] Y. Ueda et al. Gemini: Maintenance Support Environment Based on Code Clone Analysis. http://sel.ist.osaka-u.ac.jp/~lab-db/betuzuri/contents.ja/346.html

Clone Management • Tools for managing code clones • consistent change, clone refactoring • Clone refactoring • Merging code clones into a single unit (i.e. method/function) • Reduce effort and time for clone management Clone Refactoring Call

Motivation of the Thesis (1/2) • Many companies release a new model in rapid rushed intervals[Bosch 2010] • Frequently reuse robust parts of existing source code for new development Reused Parts ＋＋＋ Unique Features [Bosch 2010] J. Bosch and P. Bosch-Sijtsema From integration to composition: On the impact of software product lines, global development and ecosystems. J. Syst. Softw. 83, 1 (January 2010), pp. 67-76.

Motivation of the Thesis (2/2) • The existing tools are insufficient for large-scale software systems • Take much time for detection • system involving a large amount of code clones • Tools for clone management are commonly underused RQ1. How to quickly detect code clones from large-scale software systems? RQ2. How to develop more widely used tools that support clone refactoring?

Thesis Outline • Chapter 1 Introduction • Chapter 2: Related work [1-2] • Chapter 3: Proposing and Evaluating Clone Detection Approaches [1-1] • To answer RQ1 • Chapter 4: Investigating Merged Code Clones during Software Evolution [1-3][1-4] • To answer RQ2 • Chapter 5: Conclusion and Future Work

Chapter 3 : Proposing and Evaluating Clone Detection Approaches

Motivation of This Study (1/2) • Important to detect code clones from different release models/versions • A large amount of code clones increase detection time • Identical files increase computational complexity of code clone detection • Code clones are repeatedly detected within them.

Motivation of This Study (2/2) • Different degrees of normalizations make subtly different source code to be detected as code clones • Normalization : transformation of program elements org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); …… Code Clone by CCFinder org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); …… Code Clone by CCFinder RE exp = new RE("[0-9,]+"); …….

Overview of This Study (1/2) • Proposes six approaches and evaluates them • To investigate how the normalizations impact the code clone detection • Approach with non-normalization • Approaches with normalization

Overview of This Study (2/2) • The proposes approaches share three pipeline phases • Preprocessing : performs equivalence class (i.e. a set of files that are identical each other) partition and then generates corpus (i.e. a set of files that are representatives of each equivalence class) based on the MD5 hash values of the input files. • clone detection :detects code clones on the corpus using CCFinder • Post-processing : generates all clone sets by mapping output of CCFinder, the equivalence classes and other information if necessary

Approach with Non-normalization 0cc 0cc 0cc 0cc 0cc 0cc a1 a2 a1 a2 a1 a2 a1 a2 a1 a2 Partition equivalenceclass Select be05 be05 175 175 be05 175 Calculate MD5 hash values b1 c1 b1 c1 b1 c1 Detect code clones Preprocessing Input source files Mapping b1 c1 b1 c1 Clone detection Post-processing

Approach with Normalizations (1/2) 0cc 0cc 0cc 0cc 0cc 0cc a1 a2 a1 a2 a1 a2 a1 a2 a1 a2 a1 a2 Partition equivalenceclass Calculate MD5 hash values Select be05 be05 175 175 be05 175 b1 c1 b1 c1 b1 c1 b1 c1 b1 c1 b1 c1 Parse &Normalize Detect code clones Preprocessing Input source files Mapping Clone detection Post-processing

Approach with Normalizations (2/2) • Identical Except for Comments (IEC) approach • Identical Except for Macros (IEM) approach • Identical Except for Macros and Comments (IEMC) approach • Identical Source Code (ISC) approach • Identical Normalized Source Code (INSC) approach

Case Study (1/2) • Research Questions (RQs) • RQ1. Can proposed approaches detect code clones faster than an approach that uses only CCFinder? • RQ2. Which approach is the fastest among the proposed approaches? • The approaches are applied to different versions of three open source software (OSS) systems. • Our proposed approaches • Approach that uses only CCFinder

Case Study (2/2) • Statistics of subject systems • Detection environment • 64 bits Windows 7 Professional workstation equipped with 2 processors and 24 gigabytes of main memory.

Detection Time in Seconds(Samsung galaxy)

Answers to RQs • RQ1. Can proposed approaches detect code clones faster than an approach that uses only CCFinder? • RQ2. Which approach is the fastest among the proposed approaches? Answer to RQ1 : Our proposed approaches are able to detect code clones faster than the “approach that uses only CCFinder”. Answer to RQ2 : The “Approach with non-normalization” is the fastest.

Chapter 4 : Investigating Merged Code Clones during Software Evolution

Motivation of This Study • Clone refactoring tools are commonly underused compared to refactoring tools • Investigated instances of clone refactoring in open source software systems • To uncover clues that could contribute to the development of more widely used tools for clone refactoring

Research Questions(RQs) • RQ1: Which refactoring patterns are the most frequently used in clone refactoring? • RQ2: How similar are the token sequences between pairs of merged code clones? • RQ3: How different are the lengths of token sequences between pairs of merged code clones? • RQ4: How far are pairs of code clones located before clone refactoring?

Steps of Investigation Step 1: Detecting Instances of Refactoring Ref-Finder Step 2: Identifying Instances of Clone Refactoring extract k k+1 k k+1 detected instances of refactoring source code software repository Step 3: Measuring the Characteristics of Merged CodeClones k k+1 identified instances of clone refactoring k k+1 identified instances of clone refactoring

Step 1: Detecting Instances of Refactoring (1/2) • Ref-Finder[Prete2010] was applied to identify instances of refactoring • Extract Method (EM) • Extract Class (EC) • Extract Superclass (ES) • Form Template Method (FTM) • Pull Up Method (PUM) • Parameterized Method (PM) • Replace Method with Method Object (RMMO) [Prete2010] K. Prete, et al., Template-based reconstruction of complex refactorings. In Proc. of ICSM, pp. 1-10, 2010

Step 1: Detecting Instances of Refactoring (2/2) • Manually validated the output of Ref-Finder • To exclude false positive • Referred to existing validated output data [Bavota2012] • Subject systems [Bavota2012] G. Bavota et al, "When Does a Refactoring Induce Bugs? An Empirical Study,"? In Proc. of SCAM, pp. 104-113, 2012

Step 2 : Identifying Instances of Clone Refactoring (1/3) • Undirected similarity(usim)[Mende2010] : determine the similarity between two token sequences • Using Levenshtein distance • Measuring the amount of difference between two character sequences • Levenshtein distance between survey and surgery is 2 [B-yates] +1 +1 survey → surgey → surgery [Mende2010] T. Mende et al,. An evaluation of code similarity identification for the grow-and-prune model. Journal of Software Maintenance, 21(2): pp.143-169, 2009 [B-yates] R. Baeza-Yates and B. Ribeiro-Neto.Modern. Information Retrieval: The Concepts and Technology behind Search (2nd Edition). Addison Wesley, 2010.

Step 2 : Identifying Instances of Clone Refactoring (2/3) • Levenshtein distance between two token sequences is normalized by the maximum size between them : a normalized token sequence : length of normalized token sequence : number of items that have to be changed to turn function fx into fy

Step 2 : Identifying Instances of Clone Refactoring (3/3) • Each pair of refactored clones was defined as an instance of clone refactoring, only if it satisfied the following three conditions • Syntax condition : Each pair of code fragments was refactored into the same new method in the new version • Similarity condition : The computed usim value of each pair of code fragments in the old version was more than 65%[Mende2010] • Volume condition : The token length of each refactored pair was greater than 10 in the old version[Mende2010] [Mende2010] T. Mende et al,. An evaluation of code similarity identification for the grow-and-prune model. Journal of Software Maintenance, 21(2): pp.143-169, 2009

RQ1. The Most Frequently Used in Refactoring Patterns • categorized sets of code clones based on whether they were merged into the same newly-created method using the refactoring patterns • a total of 35 sets of merged code clones were identified Answer to RQ1: RMMO was the most frequently used refactoring pattern observed, followed by EM pattern.

RQ2 Similarities of Token Sequences between Pairs of Merged Code Clones • The average usim values of sets of merged code clones • The token similarities of EM and RMMO were relatively low compared to that of ES and FTM Answer to RQ2: EM and RMMO were mainly used to merge pairs of code clones of various token similarities.

Suggestions for Tool Developing • Vital for tools to support RMMO and EM patterns • To support EM pattern, tools should suggest pairs of code clones with various token similarities as candidates for clone refactoring • To support RMMO pattern, tools should suggest pairs of code clones of various token similarities as candidates

Future Work • Higher speed : extend the approach in Chapter 3 • Using the distributed approach such as D-CCFinder [Livieri2007] • Tool developing : Based on the investigation results of Chapter 4 Any Questions? [Livieri2007] S. Livieri et al. :”Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder”, pp. 106-115, ICSE 2007,

Software Engineering Laboratory Eunjong Choi

Software Engineering Laboratory Eunjong Choi

Presentation Transcript

Software Engineering

Software Engineering

Software Engineering

Software Engineering

Lynn Choi School of Electrical Engineering

Software Engineering

Software Engineering

Software Engineering

Software Engineering

Networked Software Systems Laboratory Department of Electrical Engineering Technion

Software Engineering

software Engineering

Software Engineering

Software Engineering

Software Engineering

Software Engineering

Laboratory Software

SOFTWARE ENGINEERING

SYSLAB: The Information Systems and Software Engineering Laboratory

Provable Software Laboratory

Lynn Choi School of Electrical Engineering

Software Engineering