1 / 12

Bugs (part 2)

This paper explores the concept of treating bugs as allergies in software failures to achieve timely recovery. It discusses the motivations behind this approach, the insight that many bugs are environmentally dependent, and the solution of applying checkpoint and rollback with tweaked environments. The paper also evaluates the effectiveness of Rx in recovering from bugs and discusses related work in the field.

rodericke
Download Presentation

Bugs (part 2)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bugs (part 2) CPS210 Spring 2006

  2. Papers • Rx: Treating Bugs As Allergies—A Safe Method to Survive Software Failures • Fen Qin … Yuanyuan Zhou

  3. Motivation • Bugs are not going away anytime soon • We would still like it be highly available • Server downtime is very costly • Recovery-oriented computing (Stanford/Berkeley) • Insight • Many bugs are environmentally dependent • Try to make the environment more forgiving • Avoid the bug • Example bugs • Memory issues • Data races

  4. Solution: Rx • Apply checkpoint and rollback • On failure, rollback to most recent checkpoint • Replay with tweaked environment • Iterate through environments until it works • Timely recovery • Low overhead nominal overhead • Legacy-code friendly

  5. Rx: “The main idea” • Timely recovery • Low nominal overhead • Legacy code friendly

  6. Environmental tweaks • Memory-related environment • Reschedule memory recycling (double free) • Pad-out memory blocks (buffer overflow) • Readdress allocated memory (corruption) • Zero-fill allocated memory (uninitialized read) • Timing-related environment • Thread scheduling (races) • Signal delivery (races) • Message ordering (races)

  7. System architecture Proxy Server Application Clients Sensors Environment Wrapper Checkpoint, rollback Errors Control Unit

  8. Proxy • Introspect on http, mysql, cvs protocols • What about request dependencies?

  9. “A: get ticket” “A got ticket” “B: get ticket” Fail Tickets = 1 Replay “B: get ticket” “B got ticket” “A: get ticket” “A: none left” “B got ticket” Example: ticket sales P Tickets = 1 CP A B Tickets = 0 Tickets = 0

  10. Evaluation results • Recovered from 6 bugs • Rx recovers much faster than restart • Except for CVS, which is strange • Problem: times are for second bug occurrence • Table is already in place (factor of two better) • Overhead with no bugs essentially zero • Throughput, response time unchanged • Have they made the right comparisons? • Performance of simpler solutions?

  11. Questions • Why not apply changes all the time? • Much simpler and prevents 5/6 bugs • Can recompile with CCured (4 bugs) • 30-150% slower for SPECINT95 benchmarks • SPECINT95 is CPU bound (no IO) • Network I/O would probably mask CPU slowdown • Can schedule w/ larger quanta? (1 bug) • Where is the performance comparison? • Sixth bug is fixed by dropping the request • Could multiplex requests across n servers

  12. Related work • Rx lies at the intersection of two hot topics • Availability + dependability • ROC (Berkeley, Stanford) • Nooks (Washington) • Process replay + dependencies • Pulse (Duke) • ReVirt, CoVirt, Speculator (Michigan)

More Related