110 likes | 222 Views
Finding Similar Defects Using Synonymous Identifier Retrieval. Norihiro Yoshida , Takeshi Hattori, Katsuro Inoue Osaka University, Japan. Similar code fragment SF 1. Similar code fragment SF 2. Similar Code fragment. One of factors that make software maintenance more difficult.
E N D
Finding Similar Defects Using Synonymous Identifier Retrieval Norihiro Yoshida, Takeshi Hattori, Katsuro Inoue Osaka University, Japan
Similar code fragment SF1 Similar code fragment SF2 Similar Code fragment • One of factors that make software maintenance more difficult Source file B Source file A Modify it Code fragment CF It is necessary to determine whether or not modify them It is necessary to develop automatic code retrieval tool based on code similarity
Key Idea • In many cases, code fragments involving similar identifier names have the similar functionalities. • e.g., type, variable, function names • Developers often need to inspect those code fragments simultaneously. It is necessary to develop automatic code retrieval tool based on identifier similarity
SC-Retriever: Code retrieval tool based on identifier similarity • Retrieves code fragments that are similar to a query code fragment • Based on identifier similarity • e.g., type, variable, function name • Determines synonymous words in target source files Query Code Fragment Similar code fragments Identifier extraction Retrieval Target source files Identifier extraction Synonymous identifier determination
Why should we determine synonymous words in source code? • SC-Retriever needs to identify a set of code fragments that have similar functionalities • Different developer often uses different identifier names even if they implement the same functionalities It is necessary to determine synonymous words for identifying code fragments that have similar functionalities.
How to determine synonymous words? • We use an automatic synonymous words determination technique in NLP. • Dagan’s method[1], which is based on co-occurrence relation and do not use thesauruses and dictionaries. • His method detects a set of synonymous words often occurs a similar set of words in statements. • e.g., “Kids play soccer”, “Children play soccer” • Note that we should set threshold for synonymous words determination. Both “kids” and “children” co-occur with a set of words “play“ “soccer”. They are synonymous. [1] I. Dagan, L. Lee, and F. C. N. Pereira. Similarity-based models of word cooccurrence probabilities. Machine Learning, 34(1-3):43–69, 1999.
How to match a query code fragment with target source files? • if code fragments have the same or synonymous identifiers as the query identifiers,…. • those code fragments are extracted as similar code fragments from the target source files Identifiers in query code fragment host alloc add host alloc add node node Identifiers in target source files
Case Study • Overview • conduct a case study with SC-Retriever and CCFinder • retrieve defective functions in 2 software systems • compare the efficiency of the retrieval • Target Systems • Canna(90KLOC, 2361 functions) • client-server Japanese character input system • Ver. 3.6 involves 19 buffer overflow defects • Those defects exist in 18 functions. • SPARS-J (36KLOC, 859 functions)
Experimental Step • choose defective code fragments from Canna source code • retrieve C functions in Canna source code. • SC-Retriever • we give 3 chosen code fragments as the queries. • CCFinder • we detect code clones for those 3 code fragments. • calculate the precisions, recalls and F-scores with the retrieved results and the bug records • F-score is harmonic average between precision and recall.
Resultant precisions, recalls, and F-scores • We set threshold for synonymous words determination, to 0.1 and 0.2. • If th is set to high value, a lot of synonymous words are detected • F-Scores of SC-Retriever are higher than those of CCFinder. • Recalls of SC-Retriever are relatively high • Precisions of CCFinder are relatively high • The results of SC-Retriever depends on queries and th
Future Work • Further case studies on defects in other software systems. • Code clone detection tool based on synonymous words determination • Method to calculate code fragment ranking based on identifier similarity • Other methods to determine synonymous words • LSI, dictionary, or thesauruses based method