1 / 69

Team 6: Slackers

Team 6: Slackers. 18749: Fault Tolerant Distributed Systems. Team Members Puneet Aggarwal Karim Jamal Steven Lawrance Hyunwoo Kim Tanmay Sinha. Team Members. URL: http://www.ece.cmu.edu/~ece749/teams-06/team6/. Overview. Baseline Application Baseline Architecture FT-Baseline Goals

marie
Download Presentation

Team 6: Slackers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Team 6: Slackers 18749: Fault Tolerant Distributed Systems Team Members Puneet Aggarwal Karim Jamal Steven Lawrance Hyunwoo Kim Tanmay Sinha

  2. Team Members URL: http://www.ece.cmu.edu/~ece749/teams-06/team6/ Team Slackers - Park 'n Park

  3. Overview • Baseline Application • Baseline Architecture • FT-Baseline Goals • FT-Baseline Architecture • Fail-Over Mechanisms • Fail-Over Measurements • Fault Tolerance Experimentation • Bounded “Real Time” Fail-Over Measurements • FT-RT-Performance Strategy • Other Features • Conclusions Team Slackers - Park 'n Park

  4. Baseline Application Team Slackers - Park 'n Park

  5. Baseline Application What is Park ‘n Park? • A system that manages the information and status of multiple parking lots. • Keeps track of how many spaces are available in the lot and at each level. • Recommends other available lots that are nearby if the current lot is full. • Allows drivers to enter/exit lots and move up/down levels once in a parking lot. Team Slackers - Park 'n Park

  6. Baseline Application Why is it interesting? • Easy to implement • Easy to distribute over multiple systems • Potential of having multiple clients • Middle-tier can be made stateless • Hasn’t been done before in this class • And most of all…who wants this? Team Slackers - Park 'n Park

  7. Baseline Application Development Tools • Java • Familiarity with language • Platform independence • CORBA • Long story (to be discussed later…) • MySQL • Familiarity with the package • Free!!! • Available on ECE cluster • Linux, Windows, and OS X • No one has the same system nowadays • Eclipse, Matlab, CVS, and PowerPoint • Powerful tools in their target markets Team Slackers - Park 'n Park

  8. Baseline Application High-Level Components • Client • Provides an interface to interact with the user • Creates an instance of Client Manager • Server • Manages Client Manager Factory • Handles CORBA functions • Client Manager • Part of middle tier • Manages various client functions • Unique for each client • Client Manager Factory • Part of middle tier • Factory for Client Manager instances • Database • Stores the state for each client • Stores the state of the parking lots (i.e. occupancy of lots and levels, distances to other parking lots) • Naming Service • Allows client to obtain reference to a server Team Slackers - Park 'n Park

  9. Baseline Architecture Team Slackers - Park 'n Park

  10. Baseline Architecture High-Level Components 6. Invoke service method Client Manager 7. Request data Client 5. Create instance Client Manager Factory 4. Create client manager instance 1. Create instance Database Server 3. Contact naming service Middleware Legend Processes and Threads Naming Service 2. Register name x y Data Flow Team Slackers - Park 'n Park

  11. FT-Baseline Goals Team Slackers - Park 'n Park

  12. FT-Baseline Goals Main Goals • Replicate the entire middle tier in order to make the system fault-tolerant. The middle tier includes • Client Manager • Client Manager Factory • Server • No need to replicate the naming service, replication manager, and database because of added complexity and limited development time • Maintain the stateless nature of the middle tier by storing all state in the database • For the fault tolerant baseline application • 3 replicas of the servers on clue, chess, and go • Naming service (boggle), Replication Manager (boggle) and Database (previously on mahjongg, now on girltalk) on the sacred servers • Have not been replicated and are single point-of-failures Team Slackers - Park 'n Park

  13. FT-Baseline Goals FT Framework • Replication Manager • Responsible for checking liveliness of servers • Performs fault detection and recovery of servers • Can handle an arbitrary amount of server replicas • Can be restarted • Fault Injector • kill -9 • Script to periodically kill primary server • Added in the RT-FT-Baseline implementation Team Slackers - Park 'n Park

  14. FT-Baseline Architecture Team Slackers - Park 'n Park

  15. FT-Baseline Architecture High Level Components 8. Request data Client 7. Invoke service method Client Manager 6. Create instance 5. Create client manager instance Client Manager Factory 4. Contact naming service Database 1. Create instance Naming Service Server poke() 2. Register name Middleware Legend 3. Notify of existence bind() / unbind() Processes and Threads Replication Manager x y Data Flow Team Slackers - Park 'n Park

  16. Fail-Over Mechanism Team Slackers - Park 'n Park

  17. Fail-Over Mechanism Fault Tolerant Client Manager • Resides on the client side • Invokes service methods on the client Manager on behalf of the client • Responsible for fail-over • Detects faults by catching exceptions • If an exception is thrown during a service call/invocation, it gets the primary server reference from the naming service and retries the failed operation using the new server reference Team Slackers - Park 'n Park

  18. Fail-Over Mechanism Replication Manager • Detects faults using method called “poke” • Maintains a dynamic list of active servers • Restarts failed/corrupted servers • Performs naming service maintenance • Unbinds names of crashed servers • Rebinds name of primary server • Uses the most-recently-active methodology to choose a new primary server in case the primary server experiences a fault Team Slackers - Park 'n Park

  19. Fail-Over Mechanism The Poke Method • “Pokes” the server periodically • Not only checks whether or not the server is alive, but also whether the server’s database connectivity is intact or is corrupted • Throws exceptions in case of faults (i.e. can’t connect to database) • The replication manager handles faults accordingly Team Slackers - Park 'n Park

  20. Fail-Over Mechanism Exceptions Handled • COMM_FAILURE: CORBA exception • OBJECT_NOT_EXIST: CORBA exception • SystemException: CORBA exception • Exception: Java exception • AlreadyInLotException: Client is already in a lot • AtBottomLevelException: Car cannot move to a lower level because it's on the bottom floor • AtTopLevelException: Car cannot move to a higher level because it's on the top floor • InvalidClientException: ID provided by Client doesn’t match the ID stored in the system • LotFullException: System throws exception when the lot is full • LotNotFoundException: Lot number not found in the database • NotInLotException: Client's car is not in the lot • NotOnExitLevelException: Client is not on an exit level in the lot • ServiceUnavailableException: Exception that gets thrown when an unrecoverable database exception or some other error prevents the server from successfully completing a client-requested operation Team Slackers - Park 'n Park

  21. Fail-Over Mechanism Response to Exceptions • Get new server reference and then re-try the failed operation when the following exception occurs • COMM_FAILURE • OBJECT_NOT_EXIST • ServiceUnavailableException • Report error to user and prompt for next command when the following exceptions occur • AlreadyInLotException • AtBottomLevelException • AtTopLevelException • LotFullException • LotNotFoundException • NotInLotException • NotOnExitLevelException • Client terminates when the following exceptions occur • InvalidClientException • SystemException • Exception Team Slackers - Park 'n Park

  22. Fail-Over Mechanism Server References • The client obtains the reference to the primary server when • it is initially started • it notices that the server has crashed or been corrupted (i.e. COMM_FAILURE, ServiceUnavailableException) • When the client notices that there is no primary server reference in the naming service, it displays an appropriate message and then terminates Team Slackers - Park 'n Park

  23. RT-FT-Baseline Architecture Team Slackers - Park 'n Park

  24. High Level Components RT-FT-Baseline Architecture 8. Request data Client 7. Invoke service method Client Manager 6. Create instance 5. Create client manager instance Client Manager Factory 4. Contact naming service 1. Create instance poke() Naming Service Server 2. Register name Database Middleware bind()/unbind() 3. Notify of existence Replication Manager Legend Testing Manager Processes and Threads x y Data Flow Team Slackers - Park 'n Park x y Launches

  25. Fault Tolerance Experimentation Team Slackers - Park 'n Park

  26. The Fault Free Run - Graph 1 Fault Tolerance Experimentation While the mean latency stayed almost constant, the maximum latency varied Team Slackers - Park 'n Park

  27. The Fault Free Run - Graph 2 Fault Tolerance Experimentation This demonstrates the conformance with the magical 1% theory Team Slackers - Park 'n Park

  28. The Fault Free Run - Graph 3 Fault Tolerance Experimentation Mean latency increases as the reply size increases Team Slackers - Park 'n Park

  29. Fault Tolerance Experimentation The Fault Free Run - Conclusions • Our data conforms to the magical 1% theory, indicating that outliers account for less than 1% of the data points • We hope that this helps with Tudor’s research  Team Slackers - Park 'n Park

  30. Bounded “Real Time” Fail Over Measurements Team Slackers - Park 'n Park

  31. Bounded “Real-Time” Fail Over Measurements The Fault Induced Run - Graph High latency is observed during faults Team Slackers - Park 'n Park

  32. Bounded “Real-Time” Fail Over Measurements The Fault Induced Run - Pie Chart Client’s fault recovery timeout causes most of the latency Team Slackers - Park 'n Park

  33. Bounded “Real-Time” Fail Over Measurements The Fault Induced Run - Conclusions • We noticed that there is an observable latency when a fault occurs • Most of the latency was caused by the client’s fault recovery timeout • The second-highest contributor was the time that the client has to wait for the client manager to be restored on the new server Team Slackers - Park 'n Park

  34. FT-RT-Performance Strategy Team Slackers - Park 'n Park

  35. FT-RT-Performance Strategy Reducing Fail-Over Time • Implemented strategies • Adjust client fault recovery timeout • Use IOGRs and cloning-like strategies • Pre-create TCP/IP connections to all servers • Other strategies that could potentially be implemented • Database connection pool • Load balancing • Remove client ID consistency check Team Slackers - Park 'n Park

  36. Measurements after Strategies Adjusting Waiting time • The following graphs are for different values of wait time at the client end • This is the time that the client waits in order to give the replication manager sufficient time to update the naming service with the new primary. Team Slackers - Park 'n Park

  37. Measurements after Strategies Plot for 0 waiting time Team Slackers - Park 'n Park

  38. Measurements after Strategies Plot for 500ms waiting time Team Slackers - Park 'n Park

  39. Measurements after Strategies Plot for 1000ms waiting time Team Slackers - Park 'n Park

  40. Measurements after Strategies Plot for 2000ms waiting time Team Slackers - Park 'n Park

  41. Measurements after Strategies Plot for 2500ms waiting time Team Slackers - Park 'n Park

  42. Measurements after Strategies Plot for 3000ms waiting time Team Slackers - Park 'n Park

  43. Measurements after Strategies Plot for 3500ms waiting time Team Slackers - Park 'n Park

  44. Measurements after Strategies Plot for 4000ms waiting time Team Slackers - Park 'n Park

  45. Measurements after Strategies Plot for 4500ms waiting time Team Slackers - Park 'n Park

  46. Measurements after Strategies Observations after After Adjusting Wait times • The best results can be seen with 4000ms wait time. • Even though there is a lot of reduction in fail-over time for lower values, we can observe significant amount of jitter. • The reason for the jitter is that the client doesn’t get the updated primary from the naming service. • Since our primary concern is bounded fail-over, we chose the strategy that has the least jitter, rather than the strategy that has the lowest latencies. • The average recovery time is reduced by a decent amount (from about 5-6 secs to 4.5-5 sec for 4000ms wait time). Team Slackers - Park 'n Park

  47. Measurements after Strategies Implementing IOGR • Interoperable Object Group Reference • In this, the client gets the list of all active servers from the naming service • The client refreshes this list if all the servers in the list have failed • The following graphs were produced after this strategy was implemented Team Slackers - Park 'n Park

  48. Measurements after Strategies Plot after IOGR strategy (same axis) <<COMMENTS>> Team Slackers - Park 'n Park

  49. Measurements after Strategies Plot after IOGR strategy (different axis) Team Slackers - Park 'n Park

  50. Measurements after Strategies Pie Chart after IOGR strategy Team Slackers - Park 'n Park

More Related