300 likes | 310 Views
This presentation by Daniel Taylor discusses the motivation behind treating software bugs as allergies, different approaches to surviving software failures, and introduces the Rx approach which is comprehensive, safe, noninvasive, efficient, and informative.
E N D
Rx: Treating Bugs As Allergies – A Safe Method to Survive Software Failures Presented by: Daniel Taylor
Outline • Motivation • Approaches to surviving failures • Rx Approach • Rx Design • Experimental Results • Future Work • Evaluation • Discussion
Motivation • System Availability • Gartner report: 1 hour of downtime = $6 million • Affected by software failures • Software defects cause up to 40% of system failures • Memory-related and concurrency bugs account for over 60% of system vulnerabilities
Motivation • Treat bugs as allergies • Examples of environmental bugs • Memory management • Buffer overflows • Dangling pointers • Timing • Races • Message ordering • User Request • Malicious users • Bad requests
Approaches to surviving failures • 1) Rebooting/System restart • Designed for hardware failures • Fail in fixing deterministic bugs • Unavailability • Warm-up period • Micro-rebooting
Approaches to surviving failures • 2) Checkpointing and recovery • Checkpoint, rolback on failure, re-execute • Designed for hardware failures • Fail in fixing deterministic bugs • Progressive retry – method to re-order messages • Only works for message ordering bugs • N-version programming – run different implementation on re-execution • Requires extra software development
Approaches to surviving failures • 3) Application-specfic recovery • Multi-process model • Spawn new processes if old ones fail • Cannot deal with deterministic bugs • Cannot deal with shared data corruption • Exception handling • Programmer must expect failures
Approaches to surviving failures • 4) Non-conventional methods • Failure-oblivious computing • Artificial values for buffer overflows • Reactive immune systems • Speculative error code for crashed functions • Unsafe methods, not appropriate for critical applications • Hard to debug if the “fix” does something strange
Rx Approach • Treat bugs like real-life allergies • Remove the allergen to see if it helps • Goals: • Comprehensive – survive software bugs • Safe - no uncertainty or introduced errors • Noninvasive – no modifications • Efficient – good performance, reduce downtime • Informative – help diagnose bugs
Rx Approach • Keep checkpoints • Fail > Rollback > Change Environment > Re-Execute • Disable modifications if it succeeds
Rx Approach • Execution Environment • Anything external to the application affecting it • Low Level – Hardware • Middle Level – OS Kernel: scheduling, VM system, FS, drivers, etc. • High Level – libraries • Change must be: • Correctness-preserving – follow API’s, do the same thing • Avoid bugs – potentially fix a software defect
Rx Approach • Environmental changes and bugs
Rx Design • 5 parts • Sensors • Checkpoint and Rollback (CR) • Environment Wrapper • Proxy • Control Unit
Rx Design: Sensors • Detect failures and inform the control unit • Two types of sensors: • Detect software errors • Detect bugs before they cause crashes • Only the 1st is implemented • Provide information about the type of exception, memory address, and stack signature
Rx Design: Checkpoint and Rollback • Checkpoints are automatically and transparently taken • Application memory, accessed files and file pointers are copied by copy-on-write • Kept in memory (no disk accesses), old checkpoints can be written to disk • Using checkpoints too far back takes too long
Rx Design: Checkpoint and Rollback • Based on previous work, Flashback in 2004 • Because Rx doesn’t require determinism, it avoids overhead
Rx Design: Environment Wrappers • Carry out the environment changes during re-execution • Memory Wrapper • Intercepts memory library calls (malloc, free, etc) • Supports 4 environmental changes • Delaying free • Padding buffers • Allocation isolation • Zero-filling • Safe, no changes to API
Rx Design: Environment Wrappers • Message Wrapper • Implemented in the proxy, controls message ordering • Changes include • Shuffling requests • Randomized packet sizes • Helps avoid non-deterministic bugs • No change to execution – server should not expect any ordering or size
Rx Design: Environment Wrappers • Process Scheduling • Change priority • Signal Delivery • Signals are recorded and can be delivered randomly • Dropping User Requests • Binary search to narrow down possible bad user request and drop
Rx Design: Proxy • Records and replays messages on re-execution • Simply forwards messages during normal execution • On recovery, the proxy • Replays requests • Carries out message-related environment changes • Buffers incoming messages for after failure recovery • Keeps track which requests received responses
Rx Design: Control Unit • Coordinates the other components and performs 3 functions: • Directs checkpointing and requests rollbacks • Diagnoses failures based on symptoms and experiences and chooses changes to use • Gives an information report for programmers • Keeps a failure table to judge how well each environmental change works for future reference
Rx Design • Multi-threaded process checkpointing • Threads must be at the user level before taking a checkpoint because of kernel locks and state issues • A signal makes threads exit blocked calls to take the checkpoint, then Rx retries them • Big I/O problems with this method, cannot set checkpoint interval too short
Experimental Results • 4 different sets of tests • Surviving failures • Performance overhead • Malicious requests • Learning from previous failures • Tested with 4 real applications • Apache httpd – web server • Squid – proxy server • MySQL – database server • CVS – version control server • 6 bugs: data race, buffer overflow, uninitialized read, dangling pointer, stack overflow, double free
Experimental Results • Alternatives are the whole program restart or a rollback and re-execute method • Rx provides availability and is faster than restart methods except in the case of very simple programs (CVS) • If the bug is deterministic, restarting will likely cause a crash again
Future Work • Inter-server communication • If Rx is on all systems, it can rollback any that it needs to when a failure occurs • Coordinated checkpoints • Unavoidable bugs/failures • Memory leaks – requires whole program restart • Deadlocks • Semantic bugs that have nothing to do with the environment • Undetectable bugs – need better sensors • Implement Proxy in the kernel level
Evaluation • Safe/fast recovery of certain bugs, but not all bugs • Masks failures to users, provides availability • Rx was only tested on I/O bound applications, overhead may be larger for computation-based applications
Discussion • Questions?