Automatic Software Repair with Evolutionary Computation

Automatic Software Repairwith Evolutionary Computation Stephanie Forrest Westley Weimer

Introduction • Automatic bug repair is an important unsolved problem in software engineering • Automated repair is needed for self-healing systems • “The problem of security is the problem of software” • We combine state-of-the-art methods from programming languages with innovations in evolutionary computation • To repair bugs in publicly released software

Summary of Method • Assume: • Access to C source code • Negative test case (input = 10593 ; output = infinite loop) • Positive test cases (encode required program functionality) • Construct Abstract Syntax Tree using CIL • Evolve repair that avoids negative test case and passes positive test case • Minimize repair using structural differencing and delta debugging

What is evolutionary computation? • Evolution in a computer: • Individuals (genotypes) stored in the computer’s memory • Evaluation of individuals (artificial selection) • Differential reproduction by copying and deleting • Variation introduced by analogy with mutation and crossover

Example: Microsoft Zune • Dec. 31, 2008. Microsoft Zune players mysteriously freeze up. • Bug: Infinite loop when input is last day of a leap year. • Negative test case: 10593, which corresponds to Dec 31, 2008. • Repair is not trivial. Microsoft’s recommendation was to let Zune drain its battery and then reset. Downloaded from http://pastie.org/349916 (Jan. 2009).

Evolutionary Computation Innovations • Start with a working program • Focus on execution path through AST • Restrict mutation and crossover to execution path • Represent AST at level of statements • Leaves out expressions, variable declarations • Genetic operators • Don’t invent any new code, crossback, macromutation operators • Minimize repair size using structural differencing

AST Representation

Weighted Path • Nodes visited by negative test case have weight 1.0 • Nodes visited by negative and positive test cases have weight 0.01 • All other nodes have weight 0.0

The Final Evolved Repair

Summary of Repairs to Date • Twenty distinct defects in 7 classes: • Segfault: 7 • Buffer overflows: 3 • Infinite loops: 4 • Incorrect output: 2 • Integer overflow: 2 • Non-overflow DOS: 1 • Format string attack: 1 • Twenty distinct programs totaling 186,603 LOC (180k LOC) • Scientific Computing: 1 • Scripting Languages: 3 • Games, Graphics, Sound: 4 • Servers (web, ftp, authentication): 4 • Operating system utilities: 8

Benchmark programs GECCO 2009, ICSE 2009, ACSAC(submitted)

Time to Discover Repair • Time to repair: • 3 - 10 minutes • Time includes: • GP algorithm (selection, mutation, calculating fitness, etc.) • Running test cases • Pretty printing and memoizing ASTs • gcc (compiling ASTs into executable code) • No special hardware

Research Questions • Does it really work? Why does it work? How can we break it? • How does the representation affect size of search space? • Order-of-magnitude reductions • What is the role of evolution? • Variable. Random search often performs as well • How does the number of test cases affect results? • Can improve results and reduce variability, but increases search time • How does the method scale with problem size? • Search time scales more than linearly but less than a quadratic

Search Time Scaling m = 1.26

Why it Works • Generic approach • Powerful intermediate representation • Weighted path greatly reduces search space • Minimization eliminates unnecessary fixes • Most bugs can be fixed with a few local modifications • 667 average atomic genetic operations to discover a repair; Repair discovered on average in 3.6 generations; 2.9 genetic operations per fitness evaluation • At least 1/2 the time, Random Search does as well as GP

Quality of Repair • Manual checks for repair correctness. • Microsoft requires that security-critical changes be subjected to 100,000 fuzz inputs (randomly generated structured input strings). • Used SPIKE black-box fuzzer (immunitysec.com) to generate 100,000 held-out fuzz requests for web server examples. • In no case did GP repairs introduce errors that were detected by the fuzz tests, and in every case the GP repairs defeated variant attacks based on the same exploit. • Thus, the GP repairs are not fragile memorizations of the input. • GP repairs also correctly handled all subsequent requests from indicative workload.

Papers and Awards • W. Weimer, T. Nguyen, C. Le Goues, and S. Forrest ``Automatically finding patches using genetic programming.'’ ICSE (2009) Best Paper Award. • S. Forrest, W. Weimer, T. Nguyen, and C. Le Goues ``A Genetic Programming Approach to Automated Software Repair.'’ GECCO (2009) Best Paper Award. • C. Le Goues, T. Nguyen, W. Weimer, and S. Forrest ``Closed-Loop Repair of Security Vulnerabilities.'’ (ACSAC 25) (Submitted June 2009). • AWARD: Human-Competitive Results Produced by Genetic and Evolutionary Computation (Humie Award). $5000 • IFIP TC2 Manfred Paul Award for Excellence in Software: Theory and Practice.1024 Euros • 2nd International Workshop on Search-Based Software Testing. Best paper and best presentation.

The Future • Self-healing systems for security (next talk) • Integrating anomaly detection to find negative test cases • Runtime repair using software dynamic translation, e.g., Strata • Repair templates, other search methods • Repair quality carefully • Consistency in distributed applications? N-version diversity? • Systematic study of large software code bases • Hypothesis: Most bugs are small • A small step for GP, a large step for software?

Evolutionary computation details • Fitness: Weighted sum of test cases that the program passes: • F(Programs that don’t compile) = 0 • 5 positive test cases (weight = 1), 1 or 2 negative test cases (weight = 10) • Mutation operations: • Delete a statement, Insert a statement, Swap a stmt along the weighted path with a stmt from another part of the program, • Crossover: Crosses back to original parent • Population size is 40. Standard run is 10 gens + 10 gens

Minimizing the final repair • Use tree-structured differencing (Al-Ekram et al. 2005) • View primary repair as a set of tree-structured operations • Consider the One-minimal subset of repairs • Let Cp = {c1, c2, ... cn} be the set of changes in a primary repair • One-minimal subset is the minimal subset of Cp that passes all test cases • Delta debugging: Search for one-minimal subset using binary search • n2 time in worst case • often linear

Automatic Software Repair with Evolutionary Computation

Automatic Software Repair with Evolutionary Computation

Presentation Transcript

CS 776: Evolutionary Computation

Evolutionary Computation (EC)

Evolutionary Computation

Introduction to Evolutionary Computation

Evolutionary Computation

Evolutionary Computation

Evolutionary Computation

Spatially-Structured Evolutionary Computation

Interactive Evolutionary Computation

Evolutionary Computation

Introduction to Evolutionary Computation

Introduction to Evolutionary Computation

Melody Generation with Evolutionary Computation

Introduction to Evolutionary Computation

Introduction to Evolutionary Computation

Evolutionary Computation (Swarm Intelligence)

Introduction to Evolutionary Computation

Evolutionary Computation and beyond

CS 776: Evolutionary Computation

Evolutionary Computation Introduction

Evolutionary Computation

Introduction to Evolutionary Computation