250 likes | 401 Views
Rx: Treating Bugs As Allergies – A Safe Method to Survive Software Failures. Qin, Tucek, Sundaresan, Zhou (UIUC). SOSP’05. Shimin Chen LBA Reading Group Presentation. Motivation. High availability is important Critical applications: process control, etc.
E N D
Rx: Treating Bugs As Allergies – A Safe Method to Survive Software Failures.Qin, Tucek, Sundaresan, Zhou (UIUC). SOSP’05 Shimin ChenLBA Reading Group Presentation
Motivation • High availability is important • Critical applications: process control, etc. • Financial company: an hour of downtime costs $6 million • SW defects account for up to 40% of system failures • Common: memory-related bugs and concurrency bugs • Bugs still occur in production runs • Even after SW company spends enormous effort on testing Ask for mechanisms for surviving software bugs
Previous Work on Surviving SW Failures • Four categories: • Rebooting • Checkpointing and recovery • Application-specific mechanisms • Recent proposals: • Failure-oblivious computing • Reactive immune system
Previous Work 1: Rebooting • Schemes: • Whole program restart • Micro-rebooting of partial system components • SW rejuvenation (proactively restart processes) • Problem: • Cannot deal with deterministic bugs • Restart time
Previous Work 2: General checkpointing and recovery • Schemes: • Checkpoint, rollback, re-execute • Or use a backup server • Problems: • Cannot deal with deterministic bugs • Progressive retry in distributed systems: • Reorder messages to get around SW bugs, but not bugs on single system • N-version programming: • Too expensive
Previous Work 3: Application-Specific Recovery Mechanisms • Multi-process model (MPM) • Kill a request-handling process and start a new one • Problems: • Cannot handle deterministic bugs • What if shared data structure is corrupted?
Previous Work 4: Recent Non-Conventional Proposals • Failure-oblivious computing • Manufacture values for out-of-bound reads • Discard out-of-bound writes • Reactive immune system • Detect failures of function calls • Forcefully return from the function with a manufactured error return value (e.g. -1 for int, 0 for unsigned int etc.) • Problem: • Unsafe for correctness-critical applications (e.g. banking)
New Proposal: Rx • Rollback the program to a recent checkpoint when a bug is detected • Dynamically change the execution environment based on the failure symptoms • Re-execute the buggy code in the new environment • Features: • Comprehensive: can deal with deterministic bugs • Safe: do not speculatively “fix” bugs, but change environment • Noninvasive: no changes to app source code • Efficient • Informative: help locating the bugs
Outline • Introduction • Main Idea of Rx • Rx Design & Implementation • Evaluation • Summary
Main Idea Record the changes for offline diagnosis
Useful Execution Environmental Changes • Must be safe and may avoid bugs • Memory management based • Buffer overflows, dangling pointers, etc. • Timing based • Concurrency bugs • User request based • Dropping unexpected (malicious) user request • As a last resort
Outline • Introduction • Main Idea of Rx • Rx Design & Implementation • Evaluation • Summary
Rx Components Overview 4 1 2 3 5
Sensors for Detecting SW Failures • OS-raised exceptions: • Assertion failures, segfault, divide-by-zero, etc. • Fine-grain detection: • buffer overflow, accesses to freed memory etc. • Only implemented OS-raised exceptions
Checkpoint and Rollback (Flashback) • Memory state: fork-like operation • Files: keep a copy of each accessed files and file pointers for a checkpoint • Checkpoint management: • Equal intervals or exponential landmarks • Limit oldest checkpoint by considering recovery time goal • Multi-threaded process checkpointing • Send a signal to all threads to make them exit from blocked syscalls with EINTR • Take checkpoint • Library wrapper in Rx retries syscalls • High cost so cannot be frequent
Environment Wrappers • Memory wrapper: (intercepting library calls) • Delaying free: • keep a freed buffer for a threshold (process) time • FIFO recycling • Padding buffers: • adds two fixed-size padding to both ends of allocated buffers • Allocation isolation: • put allocated buffers to isolated locations • Zero-filling • Do the above during re-execution for failed code region only
Other Wrappers • Message wrapper (in proxy) • Randomly shuffle message orders of different connections while keeping the message order of the same connection • Randomize packet sizes • Process scheduling: change process’ priority • Signal delivery: randomize hw interrupt delivery time while preserving order • Dropping user requests • Binary search for bad requests • Drop at most 10% of requests
Control Unit • Coordinate checkpoint/roll back, environment changes etc. • Failure vector <S1, S2, …, Sm> per failure symptom (exception type, PC adderss, call chain etc.) • Si is the score for environmental change #i • If change #i is successful, Si++; if failed, Si - - • Try the changes with scores greater than a certain threshold first
Outline • Introduction • Main Idea of Rx • Rx Design & Implementation • Evaluation • Summary
Setup • A client machine and a server machine • 2.4GHz x86 CPU, 512KB L2 cache, 1GB DRAM • 100Mbps Ethernet Injected bugs
Checkpoint Overhead • Time: with checkpoint interval of 200ms, 5% overhead (MySQL) • Workloads: • apache, squid: 5 threads, GET files with size uniform [1KB, 512KB] • CVS: client exports a 30KB file • MySQL: 5 client threads, transactions on a small table
Summary • Rx: re-executing the buggy program region in a modified execution environment • Not panacea: • Semantic bugs, resource leaks • Latent bugs (long delay from bug to symptom)