1 / 34

Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems

Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems. Team Members. Mike Seto mseto@. Jeremy Ng jwng@. Wee Ming wmc@. Ian Kalinowski igk@. http://www.ece.cmu.edu/~ece749/teams/team7/ or just for ‘borbaman’. Baseline Application 1. SYNOPSIS:

truly
Download Presentation

Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems

  2. Team Members Mike Seto mseto@ Jeremy Ng jwng@ Wee Ming wmc@ Ian Kalinowski igk@ http://www.ece.cmu.edu/~ece749/teams/team7/ or just for ‘borbaman’

  3. Baseline Application 1 • SYNOPSIS: • Fault-tolerant, real-time multiplayer game • Inspired by Hudsonsoft’s Bomberman • Players interact with other players, and their actions affect the shared environment • Players plant timed bombs which can destroy walls, players and other bombs. • Last player standing is the winner! • PLATFORM: • Middleware: Orbacus (CORBA/C++) on Linux • Graphics: NCURSES • Backend: MySQL 3

  4. Baseline Application 2 • COMPONENTS: • Front end: one client per player • Middle-tier game servers • Back end: database • STRUCTURE: • A client participates in a game with other clients • 4 clients per game (more challenging and has real-time elements) • A client may only belong to one game • A server may support multiple games • Game does not start until 4 players have joined

  5. Baseline Architecture

  6. Fail-Over Measurements – Fault-free

  7. Fault-Tolerance Goals • ORIGINAL FAULT-TOLERANT GOALS: • Preserves current game state under server failure • Coordinates of players, and player state • Bomb locations, timers • State of map • Score • Switch from failed server to another server within 1 second • Players who “drop” may rejoin game within 2 seconds

  8. FT-Baseline Architecture • Stateless Servers • Two servers run on two machines with a single shared database • Passive replication: • No distinction between “primary” and “backup” servers • No checkpointing • Each server replica can receive and process client requests • But…clients only talk to one replica at any one time • Naming Service and Database system are single point of failure.

  9. FT-Baseline Architecture • Guaranteeing Determinism • State is committed to the reliable database at every client invocation • State is read from the DB before processing any requests, and committed back to the DB after processing • Table locking per replica in the database insures atomic access per game and guarantees determinism between our replicas • Transactions with non-increasing sequence numbers are discarded • Transaction processing • Database transactions are guaranteed atomic • Consistent state is achieved by having the servers read state from the database before beginning any transaction

  10. FT-Baseline Architecture

  11. Mechanisms for Fail-Over • Failing-Over • Client detects failure by catching COMM/TRANSIENT exception • Client queries Naming Service for list of servers • Client connects to first available server in order listed in Naming Service • If this list is null, the client waits until a new server registers with the naming service

  12. Fail-Over Measurements – 16 Faults

  13. Fail-Over Measurements – Breakdown with 16 Faults

  14. Fail-Over Measurements • Problem: • Average Fault-Free RTT: 14.7 ms • Average Failure-Induced RTT: 78.8 ms • Maximum Failure-Induced RTT: 1045.8 ms • Solution: Have servers pre-resolved by client, and have clients pre-establish connections with working servers. Too High!

  15. RT-FT-Baseline Architecture • What we tried: • Clients create a low-priority Update thread which contacts the Naming Service at a regular interval, caches references of working servers, and attempts to pre-establish connections. • This thread also performs maintenance on existing connections and repopulate cache with new launched servers • What we expected: X 15

  16. RT-FT Optimization – Part 1 Before and after multi-threaded optimization What went wrong?

  17. Bounded “Real-Time” Fail-Over Measurements • Jitter: • Maximum Jitter BEFORE: 36 ms • Maximum Jitter AFTER: 176 ms • We have “improved” the jitter in our system by -389% ! • RTT: • Average RTT BEFORE: 13 ms • Average RTT AFTER: 21 ms • We have “improved” the average RTT by -59% ! • Why?? • High overhead from the Update thread • Queried the Naming Service every 200 us! • Oops….

  18. RT-FT Optimization – Part 2 Reduced the update period from 200 us to 500 ms

  19. RT-FT Optimization – Part 2 With faults…but why the high periodic jitter? 13 spikes above 200 ms

  20. RT-FT Optimization – Part 2 Bug discovered and fixed from analyzing results 3 spikes above 200 ms

  21. RT-FT Fail-Over Measurements • Average RTT: • 41 ms • Jitter: • Average Faulty Jitter: 81 ms • Maximum Jitter: 480 ms • Failover time: • Previous max: 210 ms • Current max: 230 ms However, these numbers are not realistically useful because - Cluster variability influences jitter considerably - Measurements are environment dependent % of Outliers before = 0.1286% % of Outliers after = 0.0296%

  22. RT-FT-Performance Strategy • LOAD BALANCING: • Load balancing is performed on the client-side • Client randomly selects initial server to connect to • Upon failure, client randomly connects to another alive server • MOTIVATION: • Take advantage of multiple servers • Explore the effects of spreading a single game across multiple servers 23

  23. Performance Measurements Load Balancing Worsens RTT Performance

  24. Performance Measurements • Load balancing decreased performance • This is counter-intuitive • One single-threaded server should be slower than multiple single-threaded servers • Load balancing should have improved RTT since multiple servers could service separate games simultaneously • Server code was not modified in implementing load balancing • Problem has to be with concurrent accesses to the database • This pointed us to a bug in the database table locking code • Transactions and table locks were out of order, causing table locks to be released prematurely

  25. Performance Measurements Average RTT (µs) Load-Balancing with DB lock bug fixed

  26. Performance Measurements • Corrected Load Balancing • Load balancing resulted in improved performance • Non-balanced average RTT: 454 ms • Balanced average RTT: 255 ms

  27. Insights from Measurements • FT-Baseline • Can’t assume failover time is consistent or bounded • RT-FT-Optimization (update thread) • Reducing jitter resulted in increased average RTT • Scheduling the update thread too frequently results in increased jitter and overhead • Load Balancing • Load balancing can easily be done incorrectly • Spreading games across multiple machines does not necessarily improve performance • It can be difficult to select the right server to fail-over to • Single shared resource can be the bottleneck

  28. Open Issues • Let’s discuss some issues… • Newer cluster software would be nice! • Newer MySQL with finer-grained locking, stored procedures, bug fixes, etc. • Newer gcc needed for certain libraries • Clients can’t rejoin the game if the client crashes • Database server is a huge scalability bottleneck • If only we had more time… • GUI using real graphics instead of ASCII art • Login screen and lobby for clients to interact before game starts • Rankings for top ‘Borbamen’ based on some scoring system • Multiple maps, power-ups (e.g. drop more bombs, move faster, etc.)

  29. Conclusions • What we’ve learned… • Stateless servers make server recovery relatively simple • But this moves the performance bottleneck to the database • Testing to see that your system works is not good enough – looking at performance measurements can also point out implementation bugs • Really hard to test performance on shared servers… • Testing can be fun when you have a fun project  • Accomplishments • Failover is really fast for non-loaded servers • We created a “real-time” application • BORBAMAN= CORBA+BOMBERMAN is really fun!

  30. Conclusions • For the next time around… • Pick a different middleware that will let us run clients on machines outside of the games cluster • Run our own SQL server that doesn’t crash during stress testing O:-) • Language interoperability (e.g. Java clients with C++ servers) could be cool • Orbacus supposedly supports this • Store some state on the servers to reduce the load on the database

  31. And the winner goes to …..

  32. Finale! • No Questions Please. • GG EZ$$$ kthxbye

  33. Appendix Varying Server Load Affects RTT

  34. RT-FT Fail-Over Measurements

More Related