330 likes | 448 Views
A Human Study of Patch Maintainability. Zachary P. Fry , Bryan Landau, Westley Weimer University of Virginia {zpf5a,bal2ag,weimer}@ virginia.edu. Bug Fixing. Fixing bugs manually is difficult and costly. Recent techniques explore automated patches: Evolutionary techniques – GenProg
E N D
A Human Study of Patch Maintainability Zachary P. Fry, Bryan Landau, Westley Weimer University of Virginia {zpf5a,bal2ag,weimer}@virginia.edu
Bug Fixing • Fixing bugs manually is difficult and costly. • Recent techniques explore automated patches: • Evolutionary techniques – GenProg • Dynamic modification – ClearView • Enforcement of pre/post-conditions – AutoFix-E • Program transformation via static analysis – AFix • While these techniques save developers time, there is some concern as to whether the patches produced are human-understandable and maintainable in the long run.
Questions Moving Forward How can we concretely measure these notions of human understandability and future maintainability? Can we automatically augment machine-generated patches to improve maintainability? In practice, are machine-generated patches as maintainable as human-generated patches?
Questions Moving Forward How can we concretely measure these notions of human understandability and future maintainability? Can we automatically augment machine-generated patches to improve maintainability? In practice, are machine-generated patches as maintainable as human-generated patches?
Measuring quality and maintainability • Functional Quality – Does the implementation match the specification? • Does the code execute “correctly”? • Non-functional Quality – Is the code understandable to humans? • How difficult is it to understand and alter the code in the future? ✓ ?
Software Functional Quality • Perfect: • Implementation matches specification • Direct software quality metrics: • Testing • Defect density • Mean time to failure • Indirect software quality metrics: • Cyclomatic complexity • Coupling and cohesion (CK metrics) • Software readability
Software Non-functional Quality • Maintainability: • Human-centric factors affecting the ease with which bugs can be fixed and features can be added • Broadly related to the “understandability” of code • Not easy to concretely measure with heuristics like functional correctness • These automatically-generated patches have been shown to be of high quality functionally – what about non-functionally?
Patch Maintainability Defined Rather than using an approximation to measure understandability, we will directly measure humans’ abilities to perform maintenance tasks Task: ask human participants questions that require them to read and understand a piece of code and measure the effort required to provide correct answers Simulate the maintenance process as closely as possible
Php Bug #54454 • Title: “substr_compare incorrectly reports equality in some cases” • Bug description: • “if main_stris shorter than str, substr_compare [mistakenly] checks only up to the length of main_str” • substr_compare(“cat”, “catapult”) = true
Motivating Example if (offset >= s1_len) { php_error_docref(NULL TSRMLS_CC, E_WARNING, "The start position cannot exceed string length"); RETURN_FALSE; } if (len > s1_len - offset) { len = s1_len - offset; } cmp_len = (uint) (len ? len : MAX(s2_len, (s1_len - offset)));
Motivating Example len--; if (mode & 2) { for (i = len - 1; i >= 0; i--) { if (mask[(unsigned char)c[i]]) { len--; } else { break; } } } if (return_value) { RETVAL_STRINGL(c, len, 1); } else {
Automatic Documentation • Intuitions suggest that patches augmented with documentation are more maintainable • Human patches can contain comments with hints as to the developer’s intention when changing code • Automatic approaches cannot easily reason about why a change is made, but can describe what was changed • Automatically Synthesized Documentation: • DeltaDoc (Buse et al. ASE 2010) • Measures semantic program changes • Outputs natural language descriptions of changes
Automatic Documentation if (!con->conditional_is_valid[dc->comp]) { if (con->conf.log_condition_handling) { TRACE("cond[%d] is valid: %d", dc->comp, con->conditional_is_valid[dc->comp]); } /* If not con->conditional_is_valid[dc->comp] No longer return COND_RESULT_UNSET; */ return COND_RESULT_UNSET; } /* pass the rules */ switch (dc->comp) { case COMP_HTTP_HOST: { char *ck_colon = NULL, *val_colon = NULL;
Questions Moving Forward How can we concretely measure these notions of human understandability and future maintainability? Can we automatically augment machine-generated patches to improve maintainability? In practice, are machine-generated patches as maintainable as human-generated patches?
Evaluation Focused research questions to answer: • 1) How do different types of patches affect maintainability? • 2) Which source code characteristics are predictive of our maintainability measurements? • 3) Do participants’ intuitions about maintainability and its causes agree with measured maintainability? • To answer these questions directly we performed a human study using over 150 participants with real patches from existing systems
Experiment - Subject Patches • We used patches from six benchmarks over a variety subject domains
Experiment - Subject Patches Original – the defective, un-patched code used as a baseline for measuring relative changes Human-Accepted – human patches that have not been reverted to date Human-Reverted – human-created patches that were later reverted Machine – automatically-generated patches created by the GenProg tool Machine+Doc– the same patches as above, but augmented with automatically synthesized documentation
Experiment – Maintenance Task • Sillitoet al. – “Questions programmers ask during software evolution tasks” • Recorded and categorized the questions developers actually asked while performing real maintenance tasks • “What is the value of the variable “y” on line X?” • Not: “Does this type have any siblings in the type hierarchy?”
Human Study … 15 if (dc->prev) { if(con->conf.log_condition_handling) { log_error_write(srv, __FILE__, __LINE__, "sb", "go prev", dc->prev->key); 18 } 19 /* make sure prev is checked first */ 20 config_check_cond_cached(srv, con, dc->prev); 21 /* one of prev set me to FALSE */ 22 if (COND_RESULT_FALSE == con->cond_cache[dc->context_ndx].result) { 23 return COND_RESULT_FALSE; 24 } 25 26 } 27 28 if (!con->conditional_is_valid[dc->comp]) { 29 if (con->conf.log_condition_handling) { 30 TRACE("cond[%d] is valid: %d", dc->comp, con->conditional_is_valid[dc->comp]); 31 } 32 return COND_RESULT_UNSET; } …
Human Study Question presentation Question: What is the value of the variable "con->conditional_is_valid[dc->comp]" on line 33? (recall, you can use inequality symbols in your answer) Answer to the Question Above:
Human Study … 15 if (dc->prev) { if(con->conf.log_condition_handling) { log_error_write(srv, __FILE__, __LINE__, "sb", "go prev", dc->prev->key); 18 } 19 /* make sure prev is checked first */ 20 config_check_cond_cached(srv, con, dc->prev); 21 /* one of prev set me to FALSE */ 22 if (COND_RESULT_FALSE == con->cond_cache[dc->context_ndx].result) { 23 return COND_RESULT_FALSE; 24 } 25 26 } 27 28 if (!con->conditional_is_valid[dc->comp]) { 29 if (con->conf.log_condition_handling) { 30 TRACE("cond[%d] is valid: %d", dc->comp, con->conditional_is_valid[dc->comp]); 31 } 32 return COND_RESULT_UNSET; } …
Human Study Question presentation Question: What is the value of the variable "con->conditional_is_valid[dc->comp]" on line 33? (recall, you can use inequality symbols in your answer) Answer to the Question Above: False
Evaluation Metrics • Correctness – is the right answer reported? • Time – what is the “maintenance effort” associated with understanding this code? • We favor correctness over time • Participants were instructed to spend as much time as they deemed necessary to correctly answer the questions • The percentages of correct answers over all types of patches were not different in a statistically significant way • We focus on time, as it is an analog for the software engineering effort associated with program understanding
Type of Patch vs. Maintainability Effort = average number of minutes it took participants to report a correct answer for all patches of a given type relative to the original code
Type of Patch vs. Maintainability Effort = average number of minutes it took participants to report a correct answer for all patches of a given type relative to the original code
Characteristics of Maintainability • We measured various code features for all patches used in the human study • Using a logistic regression model, we can predict human accuracy when answering the questions in the study 73.16% of the time • A Principle Component Analysis shows that 17 features account for 90% of the variance in the data • Modeling maintainability is a complex problem
Human Intuition vs. Measurement After completing the study, participants were asked to report which code features they thought increased maintainability the most
Conclusions • From conducting a human study involving over 150 participants and patches fixing high-priority defects from real systems we conclude: • The fact that humans take less time, on average, to answer questions about machine-generated patches with automated documentation than human-created patches validates the possibility of using automatic patch generation techniques in practice • There is a strong disparity between human intuitions about maintainability and our measurements and thus we think further study is meritedin this area
Modified DeltaDoc • We modify DeltaDoc in the following ways: • Include all changes, regardless of length of output • Ignore all internal optimizations that lead to loss of information (e.g. ignore suspected unrelated statements) • Include all relevant programmatic information (e.g. function arguments) • Ignore all high-level output optimizations • Favor comprehensive explanations over brevity • Insert output directly above patches as comments
Experiment - Participants • Over 150 participants • 27 fourth-year undergraduate CS students • 14 CS graduate students • 116 Mechanical Turk internet participants • Accuracy cutoff imposed • Ensuring people don’t try to “game the system” requires special consideration • Any participant who failed to answer all questions or scored below one standard deviation of the average undergraduate student’s score was removed
Experiment - Questions • What conditions must hold to always reach line X during normal execution? • What is the value of the variable “y” on line X? • What conditions must be true for the function “z()” to be called on line X? • At line X, which variables must be in scope? • Given the following values for relevant variables, what lines are executed by beginning at line X? Y=5 && Z=True.