200 likes | 345 Views
Building Domain Specific Dictionaries of Verb-Object Relation from Source Code. Yasuhiro Hayase † , Yu Kashima ‡ , Yuki Manabe ‡ , Katsuro Inoue ‡ †: Faculty of Information Sciences and Arts, Toyo University. ‡ : Graduate School of Information Science and Technology, Osaka University.
E N D
Building Domain Specific Dictionaries of Verb-Object Relation from Source Code Yasuhiro Hayase†, Yu Kashima‡, Yuki Manabe‡, Katsuro Inoue‡ †: Faculty of Information Sciences and Arts, Toyo University ‡: Graduate School of Information Science and Technology, OsakaUniversity
Program Comprehension • Program comprehension consumes at least half the time allocated to the software maintenance process.[1] • Identifiers in source code are very important for program comprehension.[2][3] • Software developers try to understand a program by guessing the roles of the program elements from their identifiers. [1]:R. K. Fieldstad and W. T. Hamlen.Application Program Maintenance Study: Report to Our Respondents [2]:A. Von Mayrhauser and A. M. Vans. Identification of Dynamic Comprehension Processes During Large Scale Maintenance [3]:Nancy Pennington. Comprehension strategies in programming
Presenting Behavior using Combination of Identifiers Combinations of multiple identifiers in source code represent program behaviors. Ex. • Method has Verb-Object (V-O) relations.[4] • Complicate combinations of identifiers represent rich meaning. • Understanding these combinations is important for program comprehension. class JMenu { void addMenuListener(MenuListener) { … addMenuListenerto JMenu VerbDirect Object (DO)Indirect Object (IO) • [4] : D. Shepherd, L. Pollock, K and Vijay-Shanker. Analyzing source code: looking for useful verb-direct object pairs in all the right places
Problem for Naming • Developers need to learn the rules of various word and their combinationsin different domain. • Programming language • Organization • Application domain • If the rules are not documented, the only way to learn these rules is through examples.
Approach for Support Naming • Problem • The learning through examples is difficult and time-consuming task. • Approach • Building dictionary by collecting V-O relations from software products in a domain • Presenting good example for appropriate naming • Input • Software products in a same domain written in object-oriented programming language • Output • A dictionary including <Verb, DO, IO> tuples
Overview of Method Extraction Patterns Prepared by hand Software Products in a Same Domain Return Type Method Name Class Name Argument Step1. Obtaining Method Property void Verb1Noun2 Noun2 Noun3 Method Properties Return Type Method Name Class Name Argument void addProduct Product Stock add Product Step2. Extracting V-O Relations void VerbNoun Noun Noun Step3. Filtering V-O Relations <Verb, DO, IO> tuples Dictionary
Method Property A tuple of four sequences of words together with part-of-speech (i.e., word class, POS) Return Type Method Name Argument Class Name Noun void Noun Sequence Noun Split composed word, then perform POS tagging ( OpenNLP [5] + several heuristics) Ex. Server Class : void createTicketForUser(User) void createTicketForUser User Server create Ticket For User void Noun Noun VerbNoun PrePosNoun [5] : http://opennlp.sourceforge.net/projects.html
Extraction Pattern Structure Spec Return Type Method Name Argument Class Name void Noun Wild Card Noun Sequence Wild Card Noun Wild Card POS Sequence Extraction Spec Ex. Structure Spec Return Type Method Name Argument Class Name void Verb1Noun2PrePos3Noun4 Wild Card Wild Card Extraction Spec
Extracting V-O Relations • Match method property to a structure spec • If the matching succeed, extract a <Verb, DO, IO> tuple according to the extraction spec Ex. Method Property void createTicketForUser User Server create Ticket For User void VerbNoun PrePosNoun Noun Noun Extraction Pattern Structure Spec Return Type Method Name Argument Class Name void Verb1Noun2PrePos3Noun4 Wild Card Wild Card Extraction Spec
Evaluation Evaluate the validity of the dictionary built with our method • Overview • Prepare 31 extraction patterns by hand • Build 4 domain dictionaries using the patterns as the experimental target • Evaluate tuples in the dictionaries by questionnaire investigation by 6 students in a software engineering laboratory
Experimental Target Built 4domain dictionaries • Web application (WEB) • XML processing (XML) • Database (DB) • GUI Web XML DB GUI
Questionnaire Investigation • Extract 90 tuples randomly from each dictionary • Evaluate the tuples by 6 students in a software engineering laboratory
Task Assignment 30 tuples 30 tuples 90 tuples 30 tuples 30 tuples 30 tuples One participantwas assigned two dictionaries in which domain he/she has an experience. 90 tuples in one dictionary were assigned to three participants. Each participant was assigned 30 tuples per one dictionary.
Result ( Q1 ) • Q1. Is the V-O relation of the tuple actually used • in the dictionary domain or in common Java programs? higher is better • Ratios of tuples used in the dictionary domain • 62% ~ 75% • Ratios of tuples used in common Java program • 38% ~ 76% • The dictionaries include: • Many tuples used in the dictionary domain • Tuples used in common Java programs
Result ( Q2 ) • Ratios of tuples including an inappropriate Verb, DO, or IO • 6% ~ 13% • Q2. Does the tuple include an inappropriate Verb, DO, or IO? lower is better • Most tuples are given an appropriate word.
Result ( Q3 ) • Ratios of useful tuples in the dictionary domain • 53% ~ 71% • Ratios of useful tuples in common Java program • 30% ~ 61% • Q3. Is the tuple useful for appropriate naming of identifiers? higher is better • The dictionaries include many useful tuples used in each domain.
Tuples evaluated useful in Q3 Tuples evaluated useful in the dictionary domain
Tuples evaluated Not Useful in Q3 Reasons why tuples evaluated not useful These tuples: • belong to other domains • contain uncertain words • are common sense for average developers • are used not in the whole domain, but in the programs that dependent on a specific library
Discussion • The dictionaries included tuples in other domains. • The threshold for filtering was too low to remove noise. • More input products are needed to use a higher threshold. • Some of the input products belong to multiple domains (e.g., both WEB and DB) • If a tuple is appeared in multiple dictionaries, treat the tuple specially • The POS tagger gave inaccurate POSs to words in a method. • Our POS tagger uses OpenNLP with several heuristic but the tagger was not effective in case. • Optimize the method of POS tagging for words in a method
Conclusion and Future Work • Conclusion • Proposed an approach for building domain specific dictionary of V-O relations in methods • Future Work • Develop a method for filtering out tuples in other domains • Develop an environment to support naming with a dictionary built by our method