330 likes | 482 Views
Empirical Evaluation of innovations in automatic repair. Claire Le Goues Site visit February 7, 2013. “Benchmarks set standards for innovation, and can encourage or stifle it.” -Blackburn et al. Automatic program repair Over time. 2009: 15 papers on automatic program repair*
E N D
Empirical Evaluation of innovations in automatic repair Claire Le Goues Site visit February 7, 2013
“Benchmarks set standards for innovation, and can encourage or stifle it.”-Blackburn et al.
Automatic program repair Over time 2009: 15 papers on automatic program repair* 2011: Dagstuhl seminar on self-repairing programs 2012: 30 papers on automatic program repair* 2013: dedicated program repair track at ICSE *manually reviewed the results of a search of the ACM digital library for “automatic program repair”
Current approach • Manually sift through bugtraq data. • Indicative example: Axis project for automatically repairing concurrency bugs • 9 weeks of sifting to find 8 bugs to study. • Direct quote from Charles Zhang, senior author, on the process: "it's very painful” • Very difficult to compare against previous or related work or generate sufficiently large datasets.
Benchmark Requirements • Indicative of important real-world bugs, found systematically in open-source programs. • Support a variety of research objectives. • “Latitudinal” studies: many different types of bugs and programs • “Longitudinal” studies: many iterative bugs in one program. • Scientifically meaningful: passing test cases repair • Admit push-button, simple integration with tools like GenProg.
Benchmark Requirements • Indicative of important real-world bugs, found systematically in open-source programs. • Support a variety of research objectives. • “Latitudinal” studies: many different types of bugs and programs • “Longitudinal” studies: many iterative bugs in one program. • Scientifically meaningful: passing test cases repair • Admit push-button, simple integration with tools like GenProg.
Systematic Benchmark Selection • Goal: a large set of important, reproduciblebugs in non-trivialprograms. • Approach: use historical data to approximate discovery and repair of bugs in the wild. http://genprog.cs.virginia.edu
New bugs, new programs • Indicative of important real-world bugs, found systematically in open-source programs: • Add new programs to the set, with as wide a variety of types as possible (support “latitudinal” studies) • Support a variety of research objectives: • Allow studies of iterative bugs, development, and repair: generate a very large (100) set of bugs in one program (php) (support “longitudinal” studies).
Benchmark Requirements • Indicative of important real-world bugs, found systematically in open-source programs. • Support a variety of research objectives. • “Latitudinal” studies: many different types of bugs and programs • “Longitudinal” studies: many iterative bugs in one program. • Scientifically meaningful: passing test cases repair • Admit push-button, simple integration with tools like GenProg.
Test Case Challenges • They must exist. • Sometimes, but not always, true (see: Jonathan Dorn)
Test Case Challenges • They must exist. • Sometimes, but not always, true (see: Jonathan Dorn) • They should be of high quality. • This has been a challenge from day 0: nullhttpd • Lincoln labs noticed it too: sort • In both cases, adding test cases led to better repairs.
Test Case Challenges • They must exist. • Sometimes, but not always, true (see: Jonathan Dorn) • They should be of high quality. • This has been a challenge from day 0: nullhttpd • Lincoln labs noticed it too: sort • In both cases, adding test cases led to better repairs. • They must be automated to run one at a time, programmatically, from within another framework.
Push-button Integration • Need to be able to compile and run new variants programmatically. • Need to be able to run test cases one at a time. • It’s not simple, and as we scale up to real-world systems, becomes increasingly tricky. • Much of the challenge is unrelated to the program in question, instead requiring highly-technical knowledge of OS-level details.
Digression on wait() • Calling a process from within another process : • system(“run test 1”) ...; wait() • wait() returns the process exit status. • This is complex. • Example: a system call can fail because the OS ran out of memory in creating the process, or because the process itself ran out of memory. • How do we tell the difference? • Answer: bit masking
Real-world Complexity • Moral: integration is tricky, and lends itself to human mistakes. • Possibility 1: original programmers make mistakes in developing the test suite. • Test cases can have bugs, too. • Possibility 2: we (GenProgdevs/users) make mistakes in integration. • A few old php test cases are not to our standards; faulty bitshift math for extracting the return value components.
Integration Concerns • Interested in more, better benchmark design, with easy integration (without gnarly OS details). • Virtual machines provide one approach. • Need a better definition of “high quality test case” vs. “low quality test case:” • Can the empty program pass it? • Can every program pass it? • Can the “always crashes” program pass it?
Current Repair Success • Over the past year, we have conducted studies of representation and operators for automatic program repair: • One-point crossover on patch representation. • Non-uniform mutation operator selection. • Alternative fault localization framework. • Results on the next slide incorporate “all the bells and whistles:” • Improvements based on those large-scale studies. • Manually confirmed quality of testing framework.
Repair Templates Claire Le Goues Shirley Park DARPA Site visit February 7, 2013
Immunology: T-cells 26 • Immune response is equally fast for large and small animals. • Human lung is 100x larger than mouse lung, still finds influenza infections in ~8 hours. • Successfully balances local search and global response. • Balance between generic and specialized T-cells: • Rapid response to new pathogens vs.long-term memory of previous infections (cf. vaccines).
INPUT EVALUATE FITNESS DISCARD ACCEPT OUTPUT MUTATE
Automatic software repair 28 • Tradeoff between generic mutation actions and more specific action templates: • Generic:INSERT, DELETE, REPLACE • Specific: if ( != NULL) { <code using > }
Hypothesis: GenProg can repair more bugs, and repair bugs more quickly, if we augment mutation actions with “repair templates.”
Option 1: Previous Changes • Insight: Just like T-cells “remember” previous infections, abstract previous fixes to generate new mutations. • Approach: • Model previous changes using structured documentation. • Cluster a large set of changes by similarity. • Abstract the center of each cluster • Example: • if( < 0) • return 0; • else • <code using >
Option 2: Existing behavior • Insight: Looking up things at a library provides people with the best example of what they are looking to reproduce. • Approach: • Generate static paths through C programs. • Mine API usage patterns from those paths • Abstract the patterns into mutation templates. • Example: • while(it.hasnext()) • <code using it.next()>
Conclusions We are generating a benchmark suite to support GenProg research, integration and tech transfer, and the automatic repair community at large. Current GenProg results for 12-hour repair scenario: 87/163 (53%) of real-world bugs in dataset. Repair templates will augment GenProg’s mutation operators to help repair more bugs, and repair bugs more quickly.