Automatic Software Self-Healing using Rescue Points

Automatic Software Self-Healing using Rescue Points • AngelosKeromytis, Jason Nieh, Sal Stolfo • Department of Computer Science • Columbia University

Motivation • Software remains buggy and crash-prone • Problem for high availability systems, remote attacks, high-volume events, non-exploitable bugs • High cost of downtime • In the absence of perfect software, error toleration and recovery techniques become necessary complement to existing techniques

Dealing with Failures Programming Language Design (Avoid failures) Software Verification (Prove failure-free) Software Testing (Expose failures) Development bugs Deployment Failure Detection (Detect Failures)

I detected a failure, now what? • User/Administrator • Restart application • File bug report • Developer • Locate bug • Create patch • Test patch • Deploy patch

Restart Malady

Integrity vs Availability • Terminate execution when fault is detected • Recurring faults (worms, etc.,) • Applications that build a lot of state • Collateral damage • But is this not the only sane thing to do? • Life after death?

Our Work: ASSURE • New automatic software self-healing system • Augments software integrity with availability • Works on commercial-off-the-shelf software

Software Elasticity • Assumption: Behind every complex system lies a well-tested core • Programmers build error handling • They just can’t cover every corner case • Feature creep, complexity, etc.,

Rescue Points • Recover using program’s code • Mapping between set of faults that could occur and those explicitly handled by the program code • Profile programs during “bad” test runs • Build behavioral model • Discover candidate recovery(rescue) points • Induce faults at locations that are known (or suspected) to propagate faults correctly • Work on binaries (COTS)

High-level Example a() a() b() b() • int c() { • if ((res = d() < 0) • return -1; /* Error */ • else • /* Do useful work */ • return 0; /* OK */ • } c() c() int d() { /* Slice-off functionality */ return -1; /* Error */ } d() d()

Why does this work? • Focus on server applications • Short error propagation distance • Errors in one request do not affect the computation of subsequent requests • Servers inherently support error handling (bad requests)

Self-Healing Process • Monitor • Rescue point discovery • Fault monitoring • Diagnose • Fault reproduction • Rescue point selection • Adapt • Rescue point creation • Test • Rescue point testing and deployment 12

ASSURE: Time Line Production System Fault Detected { Rescue-point Analysis (offline) Vulnerability Window Time Patched Production System Dynamic Patch

ASSURE Architecture

Rescue Point Discovery • Dynamic analysis via fuzzing • No access to source code • Dynamic binary instrumentation (Dyninst) • Examine behavior under “bad” input • Identify candidate rescue points • Log most frequent error return values • Happens off-line • Need only do it once

Fault Detection • Fault detection viewed as blackbox • Map detected faults to signals • Lightweight sensors on application • Simply give indication of failure • Watchdog process • ProPolice, StackGuard, etc.

Fault Reproduction • Network inputs • Good for deterministic failures • Cannot fully reproduce system state • Deterministic replay • Record all interactions between processes and their environment

Rescue Point Selection • Replay and detect failure • Extract stack trace • Find candidate rescue point that is closest to failure

Rescue Point Creation • Inject using dynamic binary instrumentation • Take checkpoint at rescue point • If fault detected, restore and steer execution • Cause application to rollback to checkpoint • Force error return using return value from rescue point discovery

Rescue Point Implementation • int rescue_point( int a, int b ) { • int rid = rescue_capture(id, fault); • if (rid < 0) /* checkpoint/restore error */ • handle_error(id); • elseif (rid == 0) /* error virtualization */ • return rescue_ret_val(fault); • else • /* rescue-point identifier */ • ... • }

Checkpoint/Rollback • Based on Zap • OS virtualization layer • Checkpoints kept in memory • Standard copy-on-write semantics • Consistent checkpoints of multi-process applications • File-system snapshot

Rescue Point Testing and Deployment • Test for survivability, correctness and performance • Repeat selection process if needed • Deployment via binary injection into running application on production server • Avoid: patch, compile, stop, restart

Evaluation • Implemented ASSURE for Linux • Tested several popular server applications • Metrics • Survivability • Correctness • Performance All tests on stripped binaries

Real Bugs and Benchmarks

Rescue Point to Fault

Self-Healing Time

Recovery Time

Normalized Performance

Checkpoint Time

Rescue Point State

Related Work • Number of proposals: • Failure-oblivious computing [OSDI 04] • Rx: Treating bugs as allergies [SOSP 05] • Automatic data-structure repair [OOPSLA 03] • Most try to mask the occurrence of faults • Problem with ensuring program semantics on recovery (unanticipated execution paths) • Our approach is to force an error!

Conclusions • Full system that enables automatic software self-healing • Introduced rescue-points • Programmer-tested recovery points • Experimental evaluation • Automatically fixed 8 real bugs • With minimal performance overhead

Automatic Software Self-Healing using Rescue Points

Automatic Software Self-Healing using Rescue Points

Presentation Transcript

Self-healing networks

Self-healing networks

Self-Healing SQL Servers

Healing the Shadow Self

Self-Healing Structures Using Cell Induction

Automatic Software Repair Using GenProg

The Amazing Self-Rescue

ROMP Healing Agent Development for Self-Healing Materials

Self-healing thermoplastic elastomers

Resiliency and self-healing

Automatic Projector Calibration Using Self-Identifying Patterns

Automatic Data Structure Repair for Self-Healing Systems

Self-rescue and respiratory devices

Self-healing Concrete

Rescue Key Points

Self-Healing Materials Market

Self Healing Grid Market

Resiliency and self-healing

Self-healing Software Systems

Healing the Shadow Self

Automatic Data Structure Repair for Self-Healing Systems

A Modeling Framework for Self-Healing Software Systems