1 / 24

CMCD: Count Matrix based Code Clone Detection

CMCD: Count Matrix based Code Clone Detection. Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education ) Peking University. Code Clones.

mariska
Download Presentation

CMCD: Count Matrix based Code Clone Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking University

  2. Code Clones • In software development, it is common to reuse some code fragments by copying with or without minor modifications. • This kind of code fragments are called code clones. [Jurgens et al., ICSE 2009]

  3. Scenario-based Evaluation Original Copy Example of Scenario #1

  4. Scenario-based Evaluation Original Copy Example of Scenario #2

  5. Scenario-based Evaluation Original Copy Example of Scenario #3

  6. Scenario-based Evaluation Original Copy Example of Scenario #4

  7. Importance of Code Clones • Code clone brings troubles: • Increase the complexity of source code • Increase the maintenance cost of software system • Increase the possibility of getting bugs • 7%-23% of the code in large software system is cloned. [Roy et al., SCP 2009] • Detecting code clones may help: • Analyze the programming habits of the programmers • Find the design patterns of the source code

  8. Previous Work in Clone Detection • lower level: • Textual approach • SDD [Lee and Jeong, OOPSLA 2005] • NICAD [Roy and Cordy, ICPC 2008] • ... • Lexical approach • DUP [Baker, WCRE 1995] • CCFinder [Kamiyaet al., TSE 2002] • CP-Miner [Li et al., OSDI 2004, TSE 2006] • ….

  9. Previous Work in Clone Detection • Higher level: • Syntactic approach • CloneDr [Baxter et al., ICSM 1998] • Deckard [Jiang et al., ICSE 2007] • CloneDigger [Bulychev, SyRCoSE 2008] • … • Semantic approach • Duplix [Krinke, WCRE 2001] • GPLAG [Liu et al., KDD 06] • …

  10. Challenges Low level approaches High level approaches Slower Better understanding of the programs Difficult to scale • Faster • Usually focusing on local characters • No Idea about global meanings GAP

  11. Our idea • A novel count matrix based clone detection approach. • Benefits of counting • By ignoring the order of variables, it can identify clones with statement swapping cases, which is difficult for both lexical and syntactic approaches. • Easy to calculate and implement • Reduces space and time complexity

  12. Count Matrix Construction tot,=,n,+,Find,(,n,),for,i,=,1,to,n,-,1, if,a,[,i,],>,a,[,j,], ,k,=,a,[,i,]….

  13. Comparison Algorithms • Goal: • Find more scenario #4 clones with more transformations such as sentence swapping • Run fast • General principles: • Compare individual variables, instead of variable sequences • Ignore variable orders in the count matrix

  14. bipartite graph matching • Use bipartite graph matching to find code clone in different granularity: • Bottom-up approach • Can be used for compute the similarity between two projects, two classes, or two methods • Use two kinds of bipartite graph • KM algorithm (low-level, slow, accurate) • Hungarian algorithm (high-level, fast, inaccurate)

  15. Optimization • Use Euclidean metrics to compute the similarity of CVs • Use quick rejection algorithm to improve speed • Eliminate false positives: • Cut and check • Slice and match

  16. Implementation • Use Soot to convert Java->Jimple • [Vallee-Raiet al., CASCON 1999] • 3-address intermediate representation • Smaller language set • Break complex statements into basic ones • Does not change the meaning of the program • A new version of CMCD without using Soot

  17. Overview

  18. Performance Comparison to Deckard

  19. Scenario-based Evaluation Based on scenario classification from Roy et al., paper “Comparison and Evaluation of Code Clone Detection Techniques ”

  20. Detecting Plagiarisms • Student-submitted compiler lab projects • 29 submissions • 106 - 251 Java classes • 7,825 – 38,086 Lines of code • Experimental Results • Running time: 123 minutes • 2 clusters of code clones, each has 3 copies • Confirmed • Now used by two courses in Peking University for detecting students’ homework

  21. Analyzing JDK 1.6 Source Code • JDK 1.6.0_18 • 7,197 files • 2,079,166 LoC • Experimental Results • Running time: 163 minutes • Found: 786 methods in 174 clusters (Small methods are omitted)

  22. Code Comparison: Two Clones Method 1: (in com.sun.corba.se.impl.ior.iiop.SyncFactory) public static SyncFactorygetSyncFactory(){ if(syncFactory== null){ synchronized(SyncFactory.class) { if(syncFactory== null){ syncFactory= new SyncFactory(); } //end if } //end synchronized block } //end if return syncFactory; } Method 2: (in javax.swing.JComponent) static Set<KeyStroke> getManagingFocusBackwardTraversalKeys() { synchronized(JComponent.class) { if (managingFocusBackwardTraversalKeys == null) { managingFocusBackwardTraversalKeys= new HashSet<KeyStroke>(1); managingFocusBackwardTraversalKeys.add(KeyStroke.getKeyStroke( KeyEvent.VK_TAB,InputEvent.SHIFT_MASK|InputEvent.CTRL_MASK)); } } return managingFocusBackwardTraversalKeys; }

  23. Detected a bug Method 1: (in com.sun.corba.se.impl.ior.iiop.SyncFactory) public static SyncFactorygetSyncFactory(){ if(syncFactory == null){ synchronized(SyncFactory.class) { if(syncFactory == null){ syncFactory = new SyncFactory(); } //end if } //end synchronized block } //end if return syncFactory; } Method 3: (in com.sun.corba.se.impl.ior.iiop.JavaSerializationComponent) public static JavaSerializationComponent singleton() { if (singleton == null) { synchronized (JavaSerializationComponent.class) { singleton =new JavaSerializationComponent(Message.JAVA_ENC_VERSION); } } return singleton; } http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6999537

  24. Conclusion • We propose a code clone detection approachCMCD: • Extracting count-based information • Language independent • Scales to large programs (> 1M LoC) • Capabilities • Performs well in scenario-based evaluation • Detects code plagiarism in students’ homework • Identifies a potential bug in JDK source code

More Related