Retrieving Similar Code Fragments based on Identifier Similarity for Defect Detection

Retrieving Similar Code Fragments based on IdentifierSimilarity for Defect Detection Norihiro Yoshida Takashi Ishio Makoto Matsushita Katsuro Inoue (Osaka University)

Similar code fragment • A code fragment that has similar part to it in source code • introduced in source code because of various reasons. • e.g. “copy-and-paste” • makes software maintenance difficult. It is necessary to check CF2 and CF3 It is necessary to check a2. Source file Source file Similar code fragment If CF1 is defective… CF2 CF1 CF3

Similar defects in Linux 2.6.6 for(iter=0; iter<num_regs; iter++) { prom_prom_taken[iter].start_adr = prom_reg_memlist[iter].phys_addr; prom_prom_taken[iter].num_bytes = prom_reg_memlist[iter].reg_size; prom_prom_taken[iter].theres_more = &prom_phys_total[iter+1]; // should be:&prom_prom_taken[iter+1]; } for(iter=0; iter<num_regs; iter++) { prom_prom_taken[iter].start_adr = (char *) prom_reg_memlist[iter].phys_addr; prom_prom_taken[iter].num_bytes = (unsigned long) prom_reg_memlist[iter].reg_size; prom_prom_taken[iter].theres_more = &prom_phys_total[iter+1]; // should be:&prom_prom_taken[iter+1]; }

Similar defects in Linux 2.6.6 for(iter=0; iter<num_regs; iter++) { prom_prom_taken[iter].start_adr = prom_reg_memlist[iter].phys_addr; prom_prom_taken[iter].num_bytes = prom_reg_memlist[iter].reg_size; prom_prom_taken[iter].theres_more = &prom_phys_total[iter+1]; // should be:&prom_prom_taken[iter+1]; } Type cast operations are inserted. for(iter=0; iter<num_regs; iter++) { prom_prom_taken[iter].start_adr = (char *) prom_reg_memlist[iter].phys_addr; prom_prom_taken[iter].num_bytes = (unsigned long) prom_reg_memlist[iter].reg_size; prom_prom_taken[iter].theres_more = &prom_phys_total[iter+1]; // should be:&prom_prom_taken[iter+1]; } Clone detection tools cannot treat the code fragments as a clone pair.

An overview of proposed method Input code fragment (Query) Input identifier list Lexical Analysis Ii[0] Ii[ni] Similar sublists Is1[0] Is1[ns1] Comparison Is2[0] Is2[ns2] Target source files Target identifier lists Isn[0] Isn[nsn] It1[0] It1[nt1] Lexical Analysis It2[0] It2[nt2] Ranking Itn[0] Itn[ntn] Similarity Ranking The method retrieves code fragments similar to an input code fragment.

Comparison • Scan a target identifier list with a sliding window • We compare identifiers in the sliding window with the input identifier list. • Extract a code fragment corresponding to the sliding window if the window involves one or more identifiers in the input list Input identifier list Ii[0] Ii[1] Ii[2] Target identifier list It[n-1] It[n] It[0] It[1] It[2] It[3] Sliding Window （fixed length） The direction of movement of the sliding window

Similarity-based ranking • The extracted code fragments aresorted according to the following similarity. • Si :a set of elements in an input identifier list • Sw: a set of elements in a sliding window • Developers investigate the resultant similarity-based ranking.

Case Study • Target open source software systems • arch/ directory in Linux 2.6.6 • Architecture-specific implementations in OS • 2 incorrect pointer accesses • server/ directory in Canna 3.6 • Japanese input system • 19 buffer overflow errors • Procedure • extract code fragments sharing similar defects • enter each code fragment into the tool implementing our method • inspect if the similarity ranking ranks highly code fragments involving defects

Result • Linux 2.6.6 • We used 2 code fragments as queries. • Each code fragment involves an incorrect pointer access. • In both of those queries, the 2 code fragments are the top 2. • Canna 3.6 • We used 19 code fragments as queries. • Each code fragment involves a buffer overflow error. • In all of those queries, 18 or 19 code fragments are the top 30. In our case studies, we could detect most of similar defects.

Summary & Future work • We proposed a method to retrieve similar code fragments based on identifier similarity. • Sliding window comparison • Similarity-based ranking • We need further case studies. • Application to similar defects in other software systems • Effects from changing “similarity” definition

Retrieving Similar Code Fragments based on Identifier Similarity for Defect Detection

Retrieving Similar Code Fragments based on Identifier Similarity for Defect Detection

Presentation Transcript

Object Recognition Based on Shape Similarity

Face Detection on Similar Color Images

CMCD: Count Matrix based Code Clone Detection

Defect Prevention Defect Detection

Scalable Defect Detection

[Image Similarity Based on Histogram]

On Link-based Similarity Join

Bayesian Bot Detection Based on DNS Traffic Similarity

Finding Similar Defects Using Synonymous Identifier Retrieval

Matching Similarity for Keyword - based Clustering

Continuous Similarity-Based Queries on

Feature Based Similarity

Feature Based Similarity

A Similar Fragments Merging Approach to Learn Automata on Proteins

Similarity-based matching for face authentication

Learning Embeddings for Similarity-Based Retrieval

Failure Detection Oriented Defect Removal

Improving Spam Detection Based on Structural Similarity

Identifying Source Code Reuse across Repositories using LCS-based Source Code Similarity

100% Inspection For Printing & Defect Detection

[Image Similarity Based on Histogram]

Similarity based deduplication