200 likes | 381 Views
Automatic Software Repair with Evolutionary Computation. Stephanie Forrest Westley Weimer. Introduction. Automatic bug repair is an important unsolved problem in software engineering Automated repair is needed for self-healing systems “The problem of security is the problem of software”
E N D
Automatic Software Repairwith Evolutionary Computation Stephanie Forrest Westley Weimer
Introduction • Automatic bug repair is an important unsolved problem in software engineering • Automated repair is needed for self-healing systems • “The problem of security is the problem of software” • We combine state-of-the-art methods from programming languages with innovations in evolutionary computation • To repair bugs in publicly released software
Summary of Method • Assume: • Access to C source code • Negative test case (input = 10593 ; output = infinite loop) • Positive test cases (encode required program functionality) • Construct Abstract Syntax Tree using CIL • Evolve repair that avoids negative test case and passes positive test case • Minimize repair using structural differencing and delta debugging
What is evolutionary computation? • Evolution in a computer: • Individuals (genotypes) stored in the computer’s memory • Evaluation of individuals (artificial selection) • Differential reproduction by copying and deleting • Variation introduced by analogy with mutation and crossover
Example: Microsoft Zune • Dec. 31, 2008. Microsoft Zune players mysteriously freeze up. • Bug: Infinite loop when input is last day of a leap year. • Negative test case: 10593, which corresponds to Dec 31, 2008. • Repair is not trivial. Microsoft’s recommendation was to let Zune drain its battery and then reset. Downloaded from http://pastie.org/349916 (Jan. 2009).
Evolutionary Computation Innovations • Start with a working program • Focus on execution path through AST • Restrict mutation and crossover to execution path • Represent AST at level of statements • Leaves out expressions, variable declarations • Genetic operators • Don’t invent any new code, crossback, macromutation operators • Minimize repair size using structural differencing
Weighted Path • Nodes visited by negative test case have weight 1.0 • Nodes visited by negative and positive test cases have weight 0.01 • All other nodes have weight 0.0
Summary of Repairs to Date • Twenty distinct defects in 7 classes: • Segfault: 7 • Buffer overflows: 3 • Infinite loops: 4 • Incorrect output: 2 • Integer overflow: 2 • Non-overflow DOS: 1 • Format string attack: 1 • Twenty distinct programs totaling 186,603 LOC (180k LOC) • Scientific Computing: 1 • Scripting Languages: 3 • Games, Graphics, Sound: 4 • Servers (web, ftp, authentication): 4 • Operating system utilities: 8
Benchmark programs GECCO 2009, ICSE 2009, ACSAC(submitted)
Time to Discover Repair • Time to repair: • 3 - 10 minutes • Time includes: • GP algorithm (selection, mutation, calculating fitness, etc.) • Running test cases • Pretty printing and memoizing ASTs • gcc (compiling ASTs into executable code) • No special hardware
Research Questions • Does it really work? Why does it work? How can we break it? • How does the representation affect size of search space? • Order-of-magnitude reductions • What is the role of evolution? • Variable. Random search often performs as well • How does the number of test cases affect results? • Can improve results and reduce variability, but increases search time • How does the method scale with problem size? • Search time scales more than linearly but less than a quadratic
Search Time Scaling m = 1.26
Why it Works • Generic approach • Powerful intermediate representation • Weighted path greatly reduces search space • Minimization eliminates unnecessary fixes • Most bugs can be fixed with a few local modifications • 667 average atomic genetic operations to discover a repair; Repair discovered on average in 3.6 generations; 2.9 genetic operations per fitness evaluation • At least 1/2 the time, Random Search does as well as GP
Quality of Repair • Manual checks for repair correctness. • Microsoft requires that security-critical changes be subjected to 100,000 fuzz inputs (randomly generated structured input strings). • Used SPIKE black-box fuzzer (immunitysec.com) to generate 100,000 held-out fuzz requests for web server examples. • In no case did GP repairs introduce errors that were detected by the fuzz tests, and in every case the GP repairs defeated variant attacks based on the same exploit. • Thus, the GP repairs are not fragile memorizations of the input. • GP repairs also correctly handled all subsequent requests from indicative workload.
Papers and Awards • W. Weimer, T. Nguyen, C. Le Goues, and S. Forrest ``Automatically finding patches using genetic programming.'’ ICSE (2009) Best Paper Award. • S. Forrest, W. Weimer, T. Nguyen, and C. Le Goues ``A Genetic Programming Approach to Automated Software Repair.'’ GECCO (2009) Best Paper Award. • C. Le Goues, T. Nguyen, W. Weimer, and S. Forrest ``Closed-Loop Repair of Security Vulnerabilities.'’ (ACSAC 25) (Submitted June 2009). • AWARD: Human-Competitive Results Produced by Genetic and Evolutionary Computation (Humie Award). $5000 • IFIP TC2 Manfred Paul Award for Excellence in Software: Theory and Practice.1024 Euros • 2nd International Workshop on Search-Based Software Testing. Best paper and best presentation.
The Future • Self-healing systems for security (next talk) • Integrating anomaly detection to find negative test cases • Runtime repair using software dynamic translation, e.g., Strata • Repair templates, other search methods • Repair quality carefully • Consistency in distributed applications? N-version diversity? • Systematic study of large software code bases • Hypothesis: Most bugs are small • A small step for GP, a large step for software?
Evolutionary computation details • Fitness: Weighted sum of test cases that the program passes: • F(Programs that don’t compile) = 0 • 5 positive test cases (weight = 1), 1 or 2 negative test cases (weight = 10) • Mutation operations: • Delete a statement, Insert a statement, Swap a stmt along the weighted path with a stmt from another part of the program, • Crossover: Crosses back to original parent • Population size is 40. Standard run is 10 gens + 10 gens
Minimizing the final repair • Use tree-structured differencing (Al-Ekram et al. 2005) • View primary repair as a set of tree-structured operations • Consider the One-minimal subset of repairs • Let Cp = {c1, c2, ... cn} be the set of changes in a primary repair • One-minimal subset is the minimal subset of Cp that passes all test cases • Delta debugging: Search for one-minimal subset using binary search • n2 time in worst case • often linear