320 likes | 438 Views
Automatic Software Self-Healing using Rescue Points. Angelos Keromytis , Jason Nieh , Sal Stolfo Department of Computer Science Columbia University. Motivation . Software remains buggy and crash-prone
E N D
Automatic Software Self-Healing using Rescue Points • AngelosKeromytis, Jason Nieh, Sal Stolfo • Department of Computer Science • Columbia University
Motivation • Software remains buggy and crash-prone • Problem for high availability systems, remote attacks, high-volume events, non-exploitable bugs • High cost of downtime • In the absence of perfect software, error toleration and recovery techniques become necessary complement to existing techniques
Dealing with Failures Programming Language Design (Avoid failures) Software Verification (Prove failure-free) Software Testing (Expose failures) Development bugs Deployment Failure Detection (Detect Failures)
I detected a failure, now what? • User/Administrator • Restart application • File bug report • Developer • Locate bug • Create patch • Test patch • Deploy patch
Integrity vs Availability • Terminate execution when fault is detected • Recurring faults (worms, etc.,) • Applications that build a lot of state • Collateral damage • But is this not the only sane thing to do? • Life after death?
Our Work: ASSURE • New automatic software self-healing system • Augments software integrity with availability • Works on commercial-off-the-shelf software
Software Elasticity • Assumption: Behind every complex system lies a well-tested core • Programmers build error handling • They just can’t cover every corner case • Feature creep, complexity, etc.,
Rescue Points • Recover using program’s code • Mapping between set of faults that could occur and those explicitly handled by the program code • Profile programs during “bad” test runs • Build behavioral model • Discover candidate recovery(rescue) points • Induce faults at locations that are known (or suspected) to propagate faults correctly • Work on binaries (COTS)
High-level Example a() a() b() b() • int c() { • if ((res = d() < 0) • return -1; /* Error */ • else • /* Do useful work */ • return 0; /* OK */ • } c() c() int d() { /* Slice-off functionality */ return -1; /* Error */ } d() d()
Why does this work? • Focus on server applications • Short error propagation distance • Errors in one request do not affect the computation of subsequent requests • Servers inherently support error handling (bad requests)
Self-Healing Process • Monitor • Rescue point discovery • Fault monitoring • Diagnose • Fault reproduction • Rescue point selection • Adapt • Rescue point creation • Test • Rescue point testing and deployment 12
ASSURE: Time Line Production System Fault Detected { Rescue-point Analysis (offline) Vulnerability Window Time Patched Production System Dynamic Patch
Rescue Point Discovery • Dynamic analysis via fuzzing • No access to source code • Dynamic binary instrumentation (Dyninst) • Examine behavior under “bad” input • Identify candidate rescue points • Log most frequent error return values • Happens off-line • Need only do it once
Fault Detection • Fault detection viewed as blackbox • Map detected faults to signals • Lightweight sensors on application • Simply give indication of failure • Watchdog process • ProPolice, StackGuard, etc.
Fault Reproduction • Network inputs • Good for deterministic failures • Cannot fully reproduce system state • Deterministic replay • Record all interactions between processes and their environment
Rescue Point Selection • Replay and detect failure • Extract stack trace • Find candidate rescue point that is closest to failure
Rescue Point Creation • Inject using dynamic binary instrumentation • Take checkpoint at rescue point • If fault detected, restore and steer execution • Cause application to rollback to checkpoint • Force error return using return value from rescue point discovery
Rescue Point Implementation • int rescue_point( int a, int b ) { • int rid = rescue_capture(id, fault); • if (rid < 0) /* checkpoint/restore error */ • handle_error(id); • elseif (rid == 0) /* error virtualization */ • return rescue_ret_val(fault); • else • /* rescue-point identifier */ • ... • }
Checkpoint/Rollback • Based on Zap • OS virtualization layer • Checkpoints kept in memory • Standard copy-on-write semantics • Consistent checkpoints of multi-process applications • File-system snapshot
Rescue Point Testing and Deployment • Test for survivability, correctness and performance • Repeat selection process if needed • Deployment via binary injection into running application on production server • Avoid: patch, compile, stop, restart
Evaluation • Implemented ASSURE for Linux • Tested several popular server applications • Metrics • Survivability • Correctness • Performance All tests on stripped binaries
Related Work • Number of proposals: • Failure-oblivious computing [OSDI 04] • Rx: Treating bugs as allergies [SOSP 05] • Automatic data-structure repair [OOPSLA 03] • Most try to mask the occurrence of faults • Problem with ensuring program semantics on recovery (unanticipated execution paths) • Our approach is to force an error!
Conclusions • Full system that enables automatic software self-healing • Introduced rescue-points • Programmer-tested recovery points • Experimental evaluation • Automatically fixed 8 real bugs • With minimal performance overhead