360 likes | 478 Views
Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools. Zak Fry. Outline. Problem and Motivation Automatically Identifying Abbreviation Expansions A Scoped Approach Analysis and Refinement: iScope Evaluations Conclusions. Maintenance Tasks.
E N D
Improving Automatic Abbreviation Expansion within Source Code to Aid in Program Search Tools Zak Fry
Outline • Problem and Motivation • Automatically Identifying Abbreviation Expansions • A Scoped Approach • Analysis and Refinement: iScope • Evaluations • Conclusions
Maintenance Tasks • 60-90% of software lifecycle • Problem: id where relevant code is – where changes need to be made • Code to perform a certain task can be very scattered • Causes difficulty for current maintenance search tools
Challenges - Coding Practices • Identifier names important for code documentation and understanding • Problem: Programmers’ use of abbreviations in code • Frequency of occurrence • character, integer, string • Complex inheritance – long class names • SecureMessageServiceClientMessageImpl • Negates usefulness of identifier names and complicates program understanding
Abbreviations and Maintenance Tools • Problem: Search based maintenance tools rely on natural language • Abbreviations change the natural language • Search Term: “distributed hash” dht = (DHTPlugin)dht_pi.getPlugin(); Thread t = new AEThread( "DHTTrackerPlugin:init" ) { public void runSupport() { try{ if ( dht.isEnabled()){ log.log( "DDB Available" ); } } catch( Throwable e ){ log.log( "DDB Failed", e ); } ... } }
Automatically Identifying Abbreviation Expansions • First, how do we identify candidates for expansion? • Non-dictionary words • Abbreviation • Short form • Expansion • Long form
State of the Art • Lawrie, Feild, and Binkley • Abbreviation Expansion • Problem: • Lack of precision • No support for choosing between multiple matches
Scoped Approach • How to choose between multiple possible long forms: • By manual inspection we found correct long forms are more likely to be found in certain locations • Also, correctly identifying the long forms for certain types of abbreviations is easier than for others
General Algorithm Acronym Prefix
Multiple matches • We assume one best candidate though multiple might be present at the same level of scope • If multiple matches: • Examine frequencies • Stem long forms and reexamine frequencies • Broaden Scope and reexamine frequencies • Most frequent expansion
Most Frequent Expansion (MFE) • If still no ideal candidate is found: • We mined long forms from 1.5 million LOC of Java 5 code base • Return most frequent long form as last resort
Evaluation of Scoped Approach • 250 abbreviations from 5 subject programs • Gold standard developed by human developer inspecting the code manually • Implemented LFB according to description • Except combination words – due to missing database (Accuracy)
Analysis and Refinement - iScope • Analyzed results and found 3 major sources of problems • Developed iScope by addressing these 3 major problem areas
Order of Scoping • Problem: • Scoped approach ordering: examine every context for an abbreviation type then go to next type • Investigating broader contexts for one type before even the narrowest context for another type is likely to yield incorrect matches Insight: Context is more sensitive than type Solution: Check each type at each context level, then go to next context level (switch order)
Single Letter Abbreviations • Problem: • Developers use single letter abbreviations differently than multiple letter abbreviations • A large subset are actually semantically meaningless • Single letter very easily matched especially because prefix matching is greedy Reader r = new BufferedReader() Insight: Based on manual inspection, we found that meaningful single letter short forms were identifiers whose long forms were also their type name Solution: Limit contextual scope to type only
Hyper-Common Abbreviation Problem: Some abbreviations used so often in code that long form rarely ever co-occurs leading to incorrect expansion based on coincidence Solution: Mine a small set of extremely common abbreviations and use as a preprocessing step
Evaluations • Is our method accurate enough to be useful? • Reevaluation of previous experiment • Does abbreviation expansion help maintenance tasks? • Simple Search • Concern Location Task
1. Reevaluation of Previous Test • Based on our previous experimental methodology and metrics, how much improvement was made from Scope to iScope? • Modified goldset based on new assumptions – single letter abbreviations
1. Reevaluation of Previous Test - Results Compare LFB with Scope and iScope using non combinational word (NCW) accuracy values Compare JavaMFE, ProgMFE, Scope, and iScope using the total accuracy values
2. Simple Search Evaluation • When abbreviations are expanded in software, how many more search results are returned than without expansion? • Focus: Recall • Not missing important results – want as many potentially relevant results as possible • Metric: Percent increase in results • P.I. = Raw returned results with expansion - 100% Raw returned results without expansion
2. Simple Search Evaluation (cont) • Subjects: 215 concerns(Eaddy et al.) annotated by 3 people each for total of 645 queries • Developed independent of the idea of abbreviation expansion – many queries might not be affected by abbreviation expansion at all • “Match”: if any word in the query matches any word in the method considered a match and returned as a result
2. Simple Search Evaluation - Results • Less increase with iScope – single letter abbreviation false positive decrease • Ideally, this means quality is better • experiment 3
3. Evaluation with Concern Location • Concern location task: identification of methods that are deemed to be relevant for the given search term • How much increase in effectiveness can be gained from expanding abbreviations in source code when performing concern location tasks?
3. Evaluation Methodology • Tools: Latent Semantic Indexing(LSI) and Log Entropy-based concern location • Goals: Attempt to calculate similarity values based on location and frequency of potential query matches • Subjects: same as previous experiment
3. Methodology (cont) • Metric: Mean Average Precision (MAP) • Precision: # True positives / Total # of positives • MAP: • Collect precision values for every new true positive, going down the ranked returned results • Then take average of all results • Attempts to reward highly ranked true positives
Conclusions • Abbreviation expansion is proven to be helpful in maintenance tools and processes • iScope approach improves upon Scope and greatly upon state-of-the-art
Future Work • Further refinement of expansion process to achieve highest possible accuracy • Full integration into maintenance tool • Extension into other programming languages
Acknowledgments • Emily Hill and Haley Boyd • Dr. Vijay K. Shanker and Dr. Lori Pollock
Inherent Inaccuracy Problem: Additional errors in code not generalizable into solvable problems Insight: There will always be inherent error when developing automatic systems for non-standard input