240 likes | 434 Views
CMCD: Count Matrix based Code Clone Detection. Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education ) Peking University. Code Clones.
E N D
CMCD: Count Matrix based Code Clone Detection Yang Yuan and Yao Guo Key Laboratory of High-Confidence Software Technologies (Ministry of Education) Peking University
Code Clones • In software development, it is common to reuse some code fragments by copying with or without minor modifications. • This kind of code fragments are called code clones. [Jurgens et al., ICSE 2009]
Scenario-based Evaluation Original Copy Example of Scenario #1
Scenario-based Evaluation Original Copy Example of Scenario #2
Scenario-based Evaluation Original Copy Example of Scenario #3
Scenario-based Evaluation Original Copy Example of Scenario #4
Importance of Code Clones • Code clone brings troubles: • Increase the complexity of source code • Increase the maintenance cost of software system • Increase the possibility of getting bugs • 7%-23% of the code in large software system is cloned. [Roy et al., SCP 2009] • Detecting code clones may help: • Analyze the programming habits of the programmers • Find the design patterns of the source code
Previous Work in Clone Detection • lower level: • Textual approach • SDD [Lee and Jeong, OOPSLA 2005] • NICAD [Roy and Cordy, ICPC 2008] • ... • Lexical approach • DUP [Baker, WCRE 1995] • CCFinder [Kamiyaet al., TSE 2002] • CP-Miner [Li et al., OSDI 2004, TSE 2006] • ….
Previous Work in Clone Detection • Higher level: • Syntactic approach • CloneDr [Baxter et al., ICSM 1998] • Deckard [Jiang et al., ICSE 2007] • CloneDigger [Bulychev, SyRCoSE 2008] • … • Semantic approach • Duplix [Krinke, WCRE 2001] • GPLAG [Liu et al., KDD 06] • …
Challenges Low level approaches High level approaches Slower Better understanding of the programs Difficult to scale • Faster • Usually focusing on local characters • No Idea about global meanings GAP
Our idea • A novel count matrix based clone detection approach. • Benefits of counting • By ignoring the order of variables, it can identify clones with statement swapping cases, which is difficult for both lexical and syntactic approaches. • Easy to calculate and implement • Reduces space and time complexity
Count Matrix Construction tot,=,n,+,Find,(,n,),for,i,=,1,to,n,-,1, if,a,[,i,],>,a,[,j,], ,k,=,a,[,i,]….
Comparison Algorithms • Goal: • Find more scenario #4 clones with more transformations such as sentence swapping • Run fast • General principles: • Compare individual variables, instead of variable sequences • Ignore variable orders in the count matrix
bipartite graph matching • Use bipartite graph matching to find code clone in different granularity: • Bottom-up approach • Can be used for compute the similarity between two projects, two classes, or two methods • Use two kinds of bipartite graph • KM algorithm (low-level, slow, accurate) • Hungarian algorithm (high-level, fast, inaccurate)
Optimization • Use Euclidean metrics to compute the similarity of CVs • Use quick rejection algorithm to improve speed • Eliminate false positives: • Cut and check • Slice and match
Implementation • Use Soot to convert Java->Jimple • [Vallee-Raiet al., CASCON 1999] • 3-address intermediate representation • Smaller language set • Break complex statements into basic ones • Does not change the meaning of the program • A new version of CMCD without using Soot
Scenario-based Evaluation Based on scenario classification from Roy et al., paper “Comparison and Evaluation of Code Clone Detection Techniques ”
Detecting Plagiarisms • Student-submitted compiler lab projects • 29 submissions • 106 - 251 Java classes • 7,825 – 38,086 Lines of code • Experimental Results • Running time: 123 minutes • 2 clusters of code clones, each has 3 copies • Confirmed • Now used by two courses in Peking University for detecting students’ homework
Analyzing JDK 1.6 Source Code • JDK 1.6.0_18 • 7,197 files • 2,079,166 LoC • Experimental Results • Running time: 163 minutes • Found: 786 methods in 174 clusters (Small methods are omitted)
Code Comparison: Two Clones Method 1: (in com.sun.corba.se.impl.ior.iiop.SyncFactory) public static SyncFactorygetSyncFactory(){ if(syncFactory== null){ synchronized(SyncFactory.class) { if(syncFactory== null){ syncFactory= new SyncFactory(); } //end if } //end synchronized block } //end if return syncFactory; } Method 2: (in javax.swing.JComponent) static Set<KeyStroke> getManagingFocusBackwardTraversalKeys() { synchronized(JComponent.class) { if (managingFocusBackwardTraversalKeys == null) { managingFocusBackwardTraversalKeys= new HashSet<KeyStroke>(1); managingFocusBackwardTraversalKeys.add(KeyStroke.getKeyStroke( KeyEvent.VK_TAB,InputEvent.SHIFT_MASK|InputEvent.CTRL_MASK)); } } return managingFocusBackwardTraversalKeys; }
Detected a bug Method 1: (in com.sun.corba.se.impl.ior.iiop.SyncFactory) public static SyncFactorygetSyncFactory(){ if(syncFactory == null){ synchronized(SyncFactory.class) { if(syncFactory == null){ syncFactory = new SyncFactory(); } //end if } //end synchronized block } //end if return syncFactory; } Method 3: (in com.sun.corba.se.impl.ior.iiop.JavaSerializationComponent) public static JavaSerializationComponent singleton() { if (singleton == null) { synchronized (JavaSerializationComponent.class) { singleton =new JavaSerializationComponent(Message.JAVA_ENC_VERSION); } } return singleton; } http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6999537
Conclusion • We propose a code clone detection approachCMCD: • Extracting count-based information • Language independent • Scales to large programs (> 1M LoC) • Capabilities • Performs well in scenario-based evaluation • Detects code plagiarism in students’ homework • Identifies a potential bug in JDK source code