1 / 19

Minimizing Faulty Executions of Distributed Systems

Explore methods for minimizing faulty executions in distributed systems to improve system reliability and bug detection. Utilizing techniques like Delta Debugging and Dynamic Partial Order Pruning to efficiently reduce execution sequences.

sandovala
Download Presentation

Minimizing Faulty Executions of Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Minimizing Faulty Executions of Distributed Systems NSDI’16

  2. Bugs in Distributed Systems • Why so many bugs in distributed systems? • The number of event ordering grows exponentially with no of events • High level of non-determinism

  3. Existing Methods of Bug Detection • Unit and Integration tests (for anticipated bugs) • Fuzzing (for unanticipated bugs) • But systems can run for long periods of time before problems manifests themselves Automatic minimization of faulty executions is therefore important !!

  4. Minimization of Faulty Execution • What is minimized • External events (process starts, process restarts, external messages) • Internal events (internal communications between concurrent processes) • The executions produced are within 1X to 4X of smallest executions possible • 1X to 16X improvement over the prior techniques • Distributed Execution Minimizer (DEMi)

  5. System Model • Configuration – internal state + content of network buffer • Event – moves the system between configurations • Internal events (messages (m,p)) • External events (process starts, restarts or external messages) P1 Timer Network buffer P2 P3

  6. System Model … • Schedule – finite sequence of events • Execution – applying each event in a schedule will result in an execution • Invariant – can be a safety condition that we can check in each configuration • Minimal Casual Sequence (MCS) • Minimal sequence of external evets that will produce a invariant violation • If we remove on event from the MCS it will not produce the violation

  7. Solution Approach Delta debugging • Iteratively explore external even schedules • log(|E|) average case time Dynamic Partial Order Reduction • Used to check each execution of external events • Use same strategies Minimize internal events

  8. Dynamic Partial Order Pruning • Used for pruning commutative schedules from the search space • Adds backtrack points to messages that are not concurrent • Performance of DPOR is improved by • Allocating time budget efficiently • Maximizing the probability of finding a violation Event e1 a c Event e2 b d

  9. Schedule Exploration Strategies • The strategies are based on, • Which pending events are executed first? • How to explore the backtrack points? • Observation #1 : Stay close to the original execution • Improve the probability of finding a violation • How to realize? • Start with a uniquely defined initial schedule • Use match functions to match messages of pending events (source, destination and content of messages)

  10. Schedule Exploration Strategies … • Observation #2 : Data independence • All parts of the messages do not effect the control flow (e.g. : Sequence numbers, authentication cookies) • Use fingerprint function to match only the required contents of massages • This will further reduce the search space • What if the message contents depend on past data • Can not ignore : relevant to control flow of programs • Logical clocks • Batching (batching all client command into a single message) • How to solve this?

  11. Schedule Exploration Strategies … • Observation #3: Coarsen message matching • Give priority to massages that matches by “Type” • Type is the language level type tag of massage object • Eliminate drawbacks in match functions • What if there are several matches? • Observation #4: Prioritize the backtrack points • Always give priority to messages that matches the “Type” but not fingerprint • Keep others as backtrack points • However it will eventually explore all the schedules within the given time budget

  12. Schedule Exploration Strategies … • Observation #5: Shrink external massages contents when possible • Contents of external massages can effect the execution length • Access to external massage content is needed for this • Internal event minimization • Using the same scheduling strategies

  13. Evaluation • Based on 2 aspects • Size of the sequence produced? • How quickly? (12 hour total budget) • Size of the sequence produced • DEMi produces executions that are within 1X to 4.6X from optimal case • 1X and 16X smaller than the prior work

  14. Evaluation • Observations made • STSSched is already effective for minimizing externals • TFB is significantly effective for minimizing internals • DEMi is efficient in both early stage systems and well established systems (spark and raft projects) • Spark test cases • STSSChed does extremely well • Internal communications are independent of each other • Raft test cases • Fuzz testing is effective for bug detection in early stages of a project • Divergent schedules are important for raft test cases

  15. Evaluation • Run times 7/10 test cases reached the minimum sequence in less than 10 minutes

  16. Evaluation • Minimization pace Significant improvement in early steps but it decreases later

  17. Evaluation • External message shrinking External message shrinking help to reduce the number of externals further

  18. Room for Improvement • Test case raft-58a : • Optimal value is 4X lower than DEMi. • Need to find better scheduling exploration techniques • Does not capture performance or liveness bugs • Does not support for production traces (all the events should be executed to and should be partial ordered) • Shared memory

  19. Q & A

More Related