220 likes | 422 Views
A clone detection approach for a collection of similar large-scale software products. Eunjong Choi† , Norihiro Yoshida‡, Yoshiki Higo†, Katsuro Inoue† †Osaka University ‡Nara Institute of Science and Technology. Software Development for Mobile Device (1/2).
E N D
A clone detection approach for a collection of similar large-scale software products EunjongChoi†, NorihiroYoshida‡,YoshikiHigo†, KatsuroInoue† †Osaka University‡Nara Institute of Science and Technology
Software Development for Mobile Device (1/2) • Releases a new model in regular and rapid rushed intervals • Adaptsto variouscountry constraints and needs • e.g. Oshaifu-Keitai for Japan
Software Development for Mobile Device (2/2) Reused pieces + + + Unique features Develop software by reusing common pieces and implement unique pieces for each feature.
Reused Source Code Pieces A code clone : a code fragment that has lexically, syntactically, or semantically similar code fragments in source code • Source code is reused in code fragment level (code clones) and file level • . • Detecting and managing reused pieces is necessary • e.g. Inconsistency management, plagiarism detection
Code Clone Code Clone Code Clone Code Clone A clone set: A set of code clones that are similar or identical to each other • Generated by: • Code reuse by copy & paste • Stereotyped functions or tool generated code
Code Clone Detection [Baker1995] B. S. Baker. On nding duplication and near-duplication in large software systems. In Proc. of WCRE, pages 86, July 1995. [Li 2006] Z. Li, S. Lu, S. Myagmar and Y. Zhou. CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software Code. IEEE Transactions on Software Engineering, 32: pages 176-192, 2006 [Kamiya2002] T. Kamiya, S. Kusumoto and K. Inoue. CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code. IEEE TSE, 28: 654-670, 2002 • Various detection techniques and tools have been proposed • e.g. Text-based and line-based(dup)[Baker1995], token-base(CP-minder)[Li2006] • CCFinder [Kamiya2002] • A token-base clone detection tool • Multi language support (C, C++, COBOL, Java, ...) • Good speed (5MLOC/20m)
Source files Lexical analysis Token sequence Transformation Transformed token sequence Match detection Clones on transformed sequence Formatting Code clones 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. } Lexical analysis Lexical analysis Lexical analysis Token sequence Token sequence Token sequence Transformation Transformation Transformation Transformed token sequence Transformed token sequence Transformed token sequence Match detection Match detection Match detection Clones on transformed sequence Clones on transformed sequence Clones on transformed sequence Formatting Formatting Formatting Example of Clone Detection Technique: CCFinder 1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. } 10. static void goo(String [] a) throws RESyntaxException { 11. RE exp = new RE("[0-9,]+"); 12. int sum = 0; 13. for (int i = 0; i < a.length; ++i) 14. if (exp.match(a[i])) 15. sum += parseNumber(exp.getParen(0)); 16. System.out.println("sum = " + sum); 17. }
Problem of Code Clone Detection Tools A identical file set : A set of files that are identical each other • Take enormous time for existing tools to detect code clones on large-scale software • Suggest an approach for detecting code clone for a collection of similar large-scale software products • Excluding detecting code clones among each set of files that are identical each other
Overview Of Our Approach (1) Calculate MD5 Hash (2) Prepare Input Files for CCFinder (4) Generate All Clone Sets (3) Detect Code Clones Using CCFinder Hashedfiles Input Files for CCFinder Identical File Sets All Clone Sets Source Files Clone Sets Step1. Calculate MD5 hash. Step2. Prepare Input Files for CCFinder Step3. Detect code clones using CCFinder Step4. Generate all clone sets
Step1. Calculate MD5 Hash MD5 Hash Ce9e187434e35746abf2 C9ad2 A77bdd2 7ed90608d1 2622448 97ccd1164dc3 -------------------- ----- ------ -------- ----- ---------- Calculate Source Files Source Files • Creates MD5 hash value of input files • MD5 hash does not require any large substitution tables
Step2. Prepare Input Files for CCFinder (1/2) MD5 Hash 175 A9 0cc 0cc 0cc C1 D1 A1 A2 A3 Identical File Set A be05 be05 . . . . B2 B1 Identical File Set B . . . . Input Files for CCFinder Identical File Sets Detect identical file sets
Step2. Prepare Input Files for CCFinder(2/2) 175 A9 0cc 0cc 0cc C1 D1 A1 A2 A3 Identical File Set A be05 be05 . . . . B2 B1 Identical File Set B . . . . Input Files for CCFinder Identical File Sets Prepare Input Files for CCFinder
Step3. Detect Code Clones Detected Code Clones C1 D1 A1 A2 A3 Identical File Set A . . . . B2 B1 Identical File Set B . . . . Code Clones Detected by CCFinder Identical File Sets Use CCFinder to detect code clones
Step3. Detect Code Clones Detected Code Clones C1 D1 A1 A2 A3 Identical File Set A Clone Set 1 . . . . B2 B1 D1 Identical File Set B Clone Set 2 . . . . Identical File Sets Clone Sets Detected by CCFinder Use CCFinder to detect code clones
Step4. Generate all clone sets C1 D1 A1 A1 A2 A3 Identical File Set A A2 A3 Clone Set 1 . . . . B2 B1 Identical File Set B . . . . Clone Set 2 D1 B2 B1 . . . . Identical File Sets All Clone Sets Generate all clone sets
Overview of Case Study (2/2) • Approach • Compare detection time between our method and using only CCFinder • Confirm that the detection result of our method is the same as the one of using only CCFinder • Detection Environment • 64 bits Windows 7 Professional workstation equipped with 2 processors and 24 gigabytes of main memory.
Results of Case Study (1/2) • Detection Time • Our approach detects code clones faster than using only CCFinder.
Results of Case Study (2/2) • Accuracy of Results : manually checked outputs • Arbitrary selected 30 clone sets that are detected by our approach from each OSS • Selected 30 identical file sets from each OSS project.
Summary • Suggest an approach for detecting code clones for a collection of similar large-scale software products. • MD5 hash to identify identical file sets • CCFinderto detect code clones. • Apply our approach to three OSS projects and compared code clone detection time between using only CCFinder and our approach. • Our approach takes shorter time to detect code clones.
Future Work • Improve for detecting files with slightly modification as identical file sets • Our current approach detects file that are identical each as a identical file set • Apply to various size of software projects in different domains • Introduce other code clones detection tools and compare results from them in the case study