690 likes | 863 Views
Team 6: Slackers. 18749: Fault Tolerant Distributed Systems. Team Members Puneet Aggarwal Karim Jamal Steven Lawrance Hyunwoo Kim Tanmay Sinha. Team Members. URL: http://www.ece.cmu.edu/~ece749/teams-06/team6/. Overview. Baseline Application Baseline Architecture FT-Baseline Goals
E N D
Team 6: Slackers 18749: Fault Tolerant Distributed Systems Team Members Puneet Aggarwal Karim Jamal Steven Lawrance Hyunwoo Kim Tanmay Sinha
Team Members URL: http://www.ece.cmu.edu/~ece749/teams-06/team6/ Team Slackers - Park 'n Park
Overview • Baseline Application • Baseline Architecture • FT-Baseline Goals • FT-Baseline Architecture • Fail-Over Mechanisms • Fail-Over Measurements • Fault Tolerance Experimentation • Bounded “Real Time” Fail-Over Measurements • FT-RT-Performance Strategy • Other Features • Conclusions Team Slackers - Park 'n Park
Baseline Application Team Slackers - Park 'n Park
Baseline Application What is Park ‘n Park? • A system that manages the information and status of multiple parking lots. • Keeps track of how many spaces are available in the lot and at each level. • Recommends other available lots that are nearby if the current lot is full. • Allows drivers to enter/exit lots and move up/down levels once in a parking lot. Team Slackers - Park 'n Park
Baseline Application Why is it interesting? • Easy to implement • Easy to distribute over multiple systems • Potential of having multiple clients • Middle-tier can be made stateless • Hasn’t been done before in this class • And most of all…who wants this? Team Slackers - Park 'n Park
Baseline Application Development Tools • Java • Familiarity with language • Platform independence • CORBA • Long story (to be discussed later…) • MySQL • Familiarity with the package • Free!!! • Available on ECE cluster • Linux, Windows, and OS X • No one has the same system nowadays • Eclipse, Matlab, CVS, and PowerPoint • Powerful tools in their target markets Team Slackers - Park 'n Park
Baseline Application High-Level Components • Client • Provides an interface to interact with the user • Creates an instance of Client Manager • Server • Manages Client Manager Factory • Handles CORBA functions • Client Manager • Part of middle tier • Manages various client functions • Unique for each client • Client Manager Factory • Part of middle tier • Factory for Client Manager instances • Database • Stores the state for each client • Stores the state of the parking lots (i.e. occupancy of lots and levels, distances to other parking lots) • Naming Service • Allows client to obtain reference to a server Team Slackers - Park 'n Park
Baseline Architecture Team Slackers - Park 'n Park
Baseline Architecture High-Level Components 6. Invoke service method Client Manager 7. Request data Client 5. Create instance Client Manager Factory 4. Create client manager instance 1. Create instance Database Server 3. Contact naming service Middleware Legend Processes and Threads Naming Service 2. Register name x y Data Flow Team Slackers - Park 'n Park
FT-Baseline Goals Team Slackers - Park 'n Park
FT-Baseline Goals Main Goals • Replicate the entire middle tier in order to make the system fault-tolerant. The middle tier includes • Client Manager • Client Manager Factory • Server • No need to replicate the naming service, replication manager, and database because of added complexity and limited development time • Maintain the stateless nature of the middle tier by storing all state in the database • For the fault tolerant baseline application • 3 replicas of the servers on clue, chess, and go • Naming service (boggle), Replication Manager (boggle) and Database (previously on mahjongg, now on girltalk) on the sacred servers • Have not been replicated and are single point-of-failures Team Slackers - Park 'n Park
FT-Baseline Goals FT Framework • Replication Manager • Responsible for checking liveliness of servers • Performs fault detection and recovery of servers • Can handle an arbitrary amount of server replicas • Can be restarted • Fault Injector • kill -9 • Script to periodically kill primary server • Added in the RT-FT-Baseline implementation Team Slackers - Park 'n Park
FT-Baseline Architecture Team Slackers - Park 'n Park
FT-Baseline Architecture High Level Components 8. Request data Client 7. Invoke service method Client Manager 6. Create instance 5. Create client manager instance Client Manager Factory 4. Contact naming service Database 1. Create instance Naming Service Server poke() 2. Register name Middleware Legend 3. Notify of existence bind() / unbind() Processes and Threads Replication Manager x y Data Flow Team Slackers - Park 'n Park
Fail-Over Mechanism Team Slackers - Park 'n Park
Fail-Over Mechanism Fault Tolerant Client Manager • Resides on the client side • Invokes service methods on the client Manager on behalf of the client • Responsible for fail-over • Detects faults by catching exceptions • If an exception is thrown during a service call/invocation, it gets the primary server reference from the naming service and retries the failed operation using the new server reference Team Slackers - Park 'n Park
Fail-Over Mechanism Replication Manager • Detects faults using method called “poke” • Maintains a dynamic list of active servers • Restarts failed/corrupted servers • Performs naming service maintenance • Unbinds names of crashed servers • Rebinds name of primary server • Uses the most-recently-active methodology to choose a new primary server in case the primary server experiences a fault Team Slackers - Park 'n Park
Fail-Over Mechanism The Poke Method • “Pokes” the server periodically • Not only checks whether or not the server is alive, but also whether the server’s database connectivity is intact or is corrupted • Throws exceptions in case of faults (i.e. can’t connect to database) • The replication manager handles faults accordingly Team Slackers - Park 'n Park
Fail-Over Mechanism Exceptions Handled • COMM_FAILURE: CORBA exception • OBJECT_NOT_EXIST: CORBA exception • SystemException: CORBA exception • Exception: Java exception • AlreadyInLotException: Client is already in a lot • AtBottomLevelException: Car cannot move to a lower level because it's on the bottom floor • AtTopLevelException: Car cannot move to a higher level because it's on the top floor • InvalidClientException: ID provided by Client doesn’t match the ID stored in the system • LotFullException: System throws exception when the lot is full • LotNotFoundException: Lot number not found in the database • NotInLotException: Client's car is not in the lot • NotOnExitLevelException: Client is not on an exit level in the lot • ServiceUnavailableException: Exception that gets thrown when an unrecoverable database exception or some other error prevents the server from successfully completing a client-requested operation Team Slackers - Park 'n Park
Fail-Over Mechanism Response to Exceptions • Get new server reference and then re-try the failed operation when the following exception occurs • COMM_FAILURE • OBJECT_NOT_EXIST • ServiceUnavailableException • Report error to user and prompt for next command when the following exceptions occur • AlreadyInLotException • AtBottomLevelException • AtTopLevelException • LotFullException • LotNotFoundException • NotInLotException • NotOnExitLevelException • Client terminates when the following exceptions occur • InvalidClientException • SystemException • Exception Team Slackers - Park 'n Park
Fail-Over Mechanism Server References • The client obtains the reference to the primary server when • it is initially started • it notices that the server has crashed or been corrupted (i.e. COMM_FAILURE, ServiceUnavailableException) • When the client notices that there is no primary server reference in the naming service, it displays an appropriate message and then terminates Team Slackers - Park 'n Park
RT-FT-Baseline Architecture Team Slackers - Park 'n Park
High Level Components RT-FT-Baseline Architecture 8. Request data Client 7. Invoke service method Client Manager 6. Create instance 5. Create client manager instance Client Manager Factory 4. Contact naming service 1. Create instance poke() Naming Service Server 2. Register name Database Middleware bind()/unbind() 3. Notify of existence Replication Manager Legend Testing Manager Processes and Threads x y Data Flow Team Slackers - Park 'n Park x y Launches
Fault Tolerance Experimentation Team Slackers - Park 'n Park
The Fault Free Run - Graph 1 Fault Tolerance Experimentation While the mean latency stayed almost constant, the maximum latency varied Team Slackers - Park 'n Park
The Fault Free Run - Graph 2 Fault Tolerance Experimentation This demonstrates the conformance with the magical 1% theory Team Slackers - Park 'n Park
The Fault Free Run - Graph 3 Fault Tolerance Experimentation Mean latency increases as the reply size increases Team Slackers - Park 'n Park
Fault Tolerance Experimentation The Fault Free Run - Conclusions • Our data conforms to the magical 1% theory, indicating that outliers account for less than 1% of the data points • We hope that this helps with Tudor’s research Team Slackers - Park 'n Park
Bounded “Real Time” Fail Over Measurements Team Slackers - Park 'n Park
Bounded “Real-Time” Fail Over Measurements The Fault Induced Run - Graph High latency is observed during faults Team Slackers - Park 'n Park
Bounded “Real-Time” Fail Over Measurements The Fault Induced Run - Pie Chart Client’s fault recovery timeout causes most of the latency Team Slackers - Park 'n Park
Bounded “Real-Time” Fail Over Measurements The Fault Induced Run - Conclusions • We noticed that there is an observable latency when a fault occurs • Most of the latency was caused by the client’s fault recovery timeout • The second-highest contributor was the time that the client has to wait for the client manager to be restored on the new server Team Slackers - Park 'n Park
FT-RT-Performance Strategy Team Slackers - Park 'n Park
FT-RT-Performance Strategy Reducing Fail-Over Time • Implemented strategies • Adjust client fault recovery timeout • Use IOGRs and cloning-like strategies • Pre-create TCP/IP connections to all servers • Other strategies that could potentially be implemented • Database connection pool • Load balancing • Remove client ID consistency check Team Slackers - Park 'n Park
Measurements after Strategies Adjusting Waiting time • The following graphs are for different values of wait time at the client end • This is the time that the client waits in order to give the replication manager sufficient time to update the naming service with the new primary. Team Slackers - Park 'n Park
Measurements after Strategies Plot for 0 waiting time Team Slackers - Park 'n Park
Measurements after Strategies Plot for 500ms waiting time Team Slackers - Park 'n Park
Measurements after Strategies Plot for 1000ms waiting time Team Slackers - Park 'n Park
Measurements after Strategies Plot for 2000ms waiting time Team Slackers - Park 'n Park
Measurements after Strategies Plot for 2500ms waiting time Team Slackers - Park 'n Park
Measurements after Strategies Plot for 3000ms waiting time Team Slackers - Park 'n Park
Measurements after Strategies Plot for 3500ms waiting time Team Slackers - Park 'n Park
Measurements after Strategies Plot for 4000ms waiting time Team Slackers - Park 'n Park
Measurements after Strategies Plot for 4500ms waiting time Team Slackers - Park 'n Park
Measurements after Strategies Observations after After Adjusting Wait times • The best results can be seen with 4000ms wait time. • Even though there is a lot of reduction in fail-over time for lower values, we can observe significant amount of jitter. • The reason for the jitter is that the client doesn’t get the updated primary from the naming service. • Since our primary concern is bounded fail-over, we chose the strategy that has the least jitter, rather than the strategy that has the lowest latencies. • The average recovery time is reduced by a decent amount (from about 5-6 secs to 4.5-5 sec for 4000ms wait time). Team Slackers - Park 'n Park
Measurements after Strategies Implementing IOGR • Interoperable Object Group Reference • In this, the client gets the list of all active servers from the naming service • The client refreshes this list if all the servers in the list have failed • The following graphs were produced after this strategy was implemented Team Slackers - Park 'n Park
Measurements after Strategies Plot after IOGR strategy (same axis) <<COMMENTS>> Team Slackers - Park 'n Park
Measurements after Strategies Plot after IOGR strategy (different axis) Team Slackers - Park 'n Park
Measurements after Strategies Pie Chart after IOGR strategy Team Slackers - Park 'n Park