Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems

Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems

Team Members Mike Seto mseto@ Jeremy Ng jwng@ Wee Ming wmc@ Ian Kalinowski igk@ http://www.ece.cmu.edu/~ece749/teams/team7/ or just for ‘borbaman’

Baseline Application 1 • SYNOPSIS: • Fault-tolerant, real-time multiplayer game • Inspired by Hudsonsoft’s Bomberman • Players interact with other players, and their actions affect the shared environment • Players plant timed bombs which can destroy walls, players and other bombs. • Last player standing is the winner! • PLATFORM: • Middleware: Orbacus (CORBA/C++) on Linux • Graphics: NCURSES • Backend: MySQL 3

Baseline Application 2 • COMPONENTS: • Front end: one client per player • Middle-tier game servers • Back end: database • STRUCTURE: • A client participates in a game with other clients • 4 clients per game (more challenging and has real-time elements) • A client may only belong to one game • A server may support multiple games • Game does not start until 4 players have joined

Baseline Architecture

Fail-Over Measurements – Fault-free

Fault-Tolerance Goals • ORIGINAL FAULT-TOLERANT GOALS: • Preserves current game state under server failure • Coordinates of players, and player state • Bomb locations, timers • State of map • Score • Switch from failed server to another server within 1 second • Players who “drop” may rejoin game within 2 seconds

FT-Baseline Architecture • Stateless Servers • Two servers run on two machines with a single shared database • Passive replication: • No distinction between “primary” and “backup” servers • No checkpointing • Each server replica can receive and process client requests • But…clients only talk to one replica at any one time • Naming Service and Database system are single point of failure.

FT-Baseline Architecture • Guaranteeing Determinism • State is committed to the reliable database at every client invocation • State is read from the DB before processing any requests, and committed back to the DB after processing • Table locking per replica in the database insures atomic access per game and guarantees determinism between our replicas • Transactions with non-increasing sequence numbers are discarded • Transaction processing • Database transactions are guaranteed atomic • Consistent state is achieved by having the servers read state from the database before beginning any transaction

FT-Baseline Architecture

Mechanisms for Fail-Over • Failing-Over • Client detects failure by catching COMM/TRANSIENT exception • Client queries Naming Service for list of servers • Client connects to first available server in order listed in Naming Service • If this list is null, the client waits until a new server registers with the naming service

Fail-Over Measurements – 16 Faults

Fail-Over Measurements – Breakdown with 16 Faults

Fail-Over Measurements • Problem: • Average Fault-Free RTT: 14.7 ms • Average Failure-Induced RTT: 78.8 ms • Maximum Failure-Induced RTT: 1045.8 ms • Solution: Have servers pre-resolved by client, and have clients pre-establish connections with working servers. Too High!

RT-FT-Baseline Architecture • What we tried: • Clients create a low-priority Update thread which contacts the Naming Service at a regular interval, caches references of working servers, and attempts to pre-establish connections. • This thread also performs maintenance on existing connections and repopulate cache with new launched servers • What we expected: X 15

RT-FT Optimization – Part 1 Before and after multi-threaded optimization What went wrong?

Bounded “Real-Time” Fail-Over Measurements • Jitter: • Maximum Jitter BEFORE: 36 ms • Maximum Jitter AFTER: 176 ms • We have “improved” the jitter in our system by -389% ! • RTT: • Average RTT BEFORE: 13 ms • Average RTT AFTER: 21 ms • We have “improved” the average RTT by -59% ! • Why?? • High overhead from the Update thread • Queried the Naming Service every 200 us! • Oops….

RT-FT Optimization – Part 2 Reduced the update period from 200 us to 500 ms

RT-FT Optimization – Part 2 With faults…but why the high periodic jitter? 13 spikes above 200 ms

RT-FT Optimization – Part 2 Bug discovered and fixed from analyzing results 3 spikes above 200 ms

RT-FT Fail-Over Measurements • Average RTT: • 41 ms • Jitter: • Average Faulty Jitter: 81 ms • Maximum Jitter: 480 ms • Failover time: • Previous max: 210 ms • Current max: 230 ms However, these numbers are not realistically useful because - Cluster variability influences jitter considerably - Measurements are environment dependent % of Outliers before = 0.1286% % of Outliers after = 0.0296%

RT-FT-Performance Strategy • LOAD BALANCING: • Load balancing is performed on the client-side • Client randomly selects initial server to connect to • Upon failure, client randomly connects to another alive server • MOTIVATION: • Take advantage of multiple servers • Explore the effects of spreading a single game across multiple servers 23

Performance Measurements Load Balancing Worsens RTT Performance

Performance Measurements • Load balancing decreased performance • This is counter-intuitive • One single-threaded server should be slower than multiple single-threaded servers • Load balancing should have improved RTT since multiple servers could service separate games simultaneously • Server code was not modified in implementing load balancing • Problem has to be with concurrent accesses to the database • This pointed us to a bug in the database table locking code • Transactions and table locks were out of order, causing table locks to be released prematurely

Performance Measurements Average RTT (µs) Load-Balancing with DB lock bug fixed

Performance Measurements • Corrected Load Balancing • Load balancing resulted in improved performance • Non-balanced average RTT: 454 ms • Balanced average RTT: 255 ms

Insights from Measurements • FT-Baseline • Can’t assume failover time is consistent or bounded • RT-FT-Optimization (update thread) • Reducing jitter resulted in increased average RTT • Scheduling the update thread too frequently results in increased jitter and overhead • Load Balancing • Load balancing can easily be done incorrectly • Spreading games across multiple machines does not necessarily improve performance • It can be difficult to select the right server to fail-over to • Single shared resource can be the bottleneck

Open Issues • Let’s discuss some issues… • Newer cluster software would be nice! • Newer MySQL with finer-grained locking, stored procedures, bug fixes, etc. • Newer gcc needed for certain libraries • Clients can’t rejoin the game if the client crashes • Database server is a huge scalability bottleneck • If only we had more time… • GUI using real graphics instead of ASCII art • Login screen and lobby for clients to interact before game starts • Rankings for top ‘Borbamen’ based on some scoring system • Multiple maps, power-ups (e.g. drop more bombs, move faster, etc.)

Conclusions • What we’ve learned… • Stateless servers make server recovery relatively simple • But this moves the performance bottleneck to the database • Testing to see that your system works is not good enough – looking at performance measurements can also point out implementation bugs • Really hard to test performance on shared servers… • Testing can be fun when you have a fun project  • Accomplishments • Failover is really fast for non-loaded servers • We created a “real-time” application • BORBAMAN= CORBA+BOMBERMAN is really fun!

Conclusions • For the next time around… • Pick a different middleware that will let us run clients on machines outside of the games cluster • Run our own SQL server that doesn’t crash during stress testing O:-) • Language interoperability (e.g. Java clients with C++ servers) could be cool • Orbacus supposedly supports this • Store some state on the servers to reduce the load on the database

And the winner goes to …..

Finale! • No Questions Please. • GG EZ$$$ kthxbye

Appendix Varying Server Load Affects RTT

RT-FT Fail-Over Measurements

Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems

Team 7: SUPER B RBAMAN 18-749: Fault-Tolerant Distributed Systems

Presentation Transcript

Distributed File Systems

Distributed Systems

Distributed Object-Based Systems

Distributed Systems

Fault Tolerance

Distributed Object-Based Systems

Distributed Operating Systems

Distributed Cluster Computing Platforms

Fault-Tolerance: Practice Chapter 7

Fault Tolerance

Distributed Object-Based Systems

Distributed Systems

Distributed Systems

Distributed Systems

DISTRIBUTED COMPUTING

ITEC801 Distributed Systems

Super-Resolution

Distributed Systems

Chapter 23

Midterm Review CS 230 – Distributed Systems (ics.uci/~cs230)

DISTRIBUTED SYSTEMS