1 / 32

Automatic Software Self-Healing using Rescue Points

Automatic Software Self-Healing using Rescue Points. Angelos Keromytis , Jason Nieh , Sal Stolfo Department of Computer Science Columbia University. Motivation . Software remains buggy and crash-prone

elan
Download Presentation

Automatic Software Self-Healing using Rescue Points

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Software Self-Healing using Rescue Points • AngelosKeromytis, Jason Nieh, Sal Stolfo • Department of Computer Science • Columbia University

  2. Motivation • Software remains buggy and crash-prone • Problem for high availability systems, remote attacks, high-volume events, non-exploitable bugs • High cost of downtime • In the absence of perfect software, error toleration and recovery techniques become necessary complement to existing techniques

  3. Dealing with Failures Programming Language Design (Avoid failures) Software Verification (Prove failure-free) Software Testing (Expose failures) Development bugs Deployment Failure Detection (Detect Failures)

  4. I detected a failure, now what? • User/Administrator • Restart application • File bug report • Developer • Locate bug • Create patch • Test patch • Deploy patch

  5. Restart Malady

  6. Integrity vs Availability • Terminate execution when fault is detected • Recurring faults (worms, etc.,) • Applications that build a lot of state • Collateral damage • But is this not the only sane thing to do? • Life after death?

  7. Our Work: ASSURE • New automatic software self-healing system • Augments software integrity with availability • Works on commercial-off-the-shelf software

  8. Software Elasticity • Assumption: Behind every complex system lies a well-tested core • Programmers build error handling • They just can’t cover every corner case • Feature creep, complexity, etc.,

  9. Rescue Points • Recover using program’s code • Mapping between set of faults that could occur and those explicitly handled by the program code • Profile programs during “bad” test runs • Build behavioral model • Discover candidate recovery(rescue) points • Induce faults at locations that are known (or suspected) to propagate faults correctly • Work on binaries (COTS)

  10. High-level Example a() a() b() b() • int c() { • if ((res = d() < 0) • return -1; /* Error */ • else • /* Do useful work */ • return 0; /* OK */ • } c() c() int d() { /* Slice-off functionality */ return -1; /* Error */ } d() d()

  11. Why does this work? • Focus on server applications • Short error propagation distance • Errors in one request do not affect the computation of subsequent requests • Servers inherently support error handling (bad requests)

  12. Self-Healing Process • Monitor • Rescue point discovery • Fault monitoring • Diagnose • Fault reproduction • Rescue point selection • Adapt • Rescue point creation • Test • Rescue point testing and deployment 12

  13. ASSURE: Time Line Production System Fault Detected { Rescue-point Analysis (offline) Vulnerability Window Time Patched Production System Dynamic Patch

  14. ASSURE Architecture

  15. Rescue Point Discovery • Dynamic analysis via fuzzing • No access to source code • Dynamic binary instrumentation (Dyninst) • Examine behavior under “bad” input • Identify candidate rescue points • Log most frequent error return values • Happens off-line • Need only do it once

  16. Fault Detection • Fault detection viewed as blackbox • Map detected faults to signals • Lightweight sensors on application • Simply give indication of failure • Watchdog process • ProPolice, StackGuard, etc.

  17. Fault Reproduction • Network inputs • Good for deterministic failures • Cannot fully reproduce system state • Deterministic replay • Record all interactions between processes and their environment

  18. Rescue Point Selection • Replay and detect failure • Extract stack trace • Find candidate rescue point that is closest to failure

  19. Rescue Point Creation • Inject using dynamic binary instrumentation • Take checkpoint at rescue point • If fault detected, restore and steer execution • Cause application to rollback to checkpoint • Force error return using return value from rescue point discovery

  20. Rescue Point Implementation • int rescue_point( int a, int b ) { • int rid = rescue_capture(id, fault); • if (rid < 0) /* checkpoint/restore error */ • handle_error(id); • elseif (rid == 0) /* error virtualization */ • return rescue_ret_val(fault); • else • /* rescue-point identifier */ • ... • }

  21. Checkpoint/Rollback • Based on Zap • OS virtualization layer • Checkpoints kept in memory • Standard copy-on-write semantics • Consistent checkpoints of multi-process applications • File-system snapshot

  22. Rescue Point Testing and Deployment • Test for survivability, correctness and performance • Repeat selection process if needed • Deployment via binary injection into running application on production server • Avoid: patch, compile, stop, restart

  23. Evaluation • Implemented ASSURE for Linux • Tested several popular server applications • Metrics • Survivability • Correctness • Performance All tests on stripped binaries

  24. Real Bugs and Benchmarks

  25. Rescue Point to Fault

  26. Self-Healing Time

  27. Recovery Time

  28. Normalized Performance

  29. Checkpoint Time

  30. Rescue Point State

  31. Related Work • Number of proposals: • Failure-oblivious computing [OSDI 04] • Rx: Treating bugs as allergies [SOSP 05] • Automatic data-structure repair [OOPSLA 03] • Most try to mask the occurrence of faults • Problem with ensuring program semantics on recovery (unanticipated execution paths) • Our approach is to force an error!

  32. Conclusions • Full system that enables automatic software self-healing • Introduced rescue-points • Programmer-tested recovery points • Experimental evaluation • Automatically fixed 8 real bugs • With minimal performance overhead

More Related