1 / 23

Arseniy Khobotkov Arunesh Gupta Dhananjay Khaitan Saurabh Sharma

Explore the dependability and fault-tolerance mechanisms in the middleware system architecture with detailed analysis of software artifacts. This study investigates the design, security features, replication strategies, and fail-over mechanisms of the system. Key focus areas include fault injection, replication management, fault detection, and performance measurements.

charlesdiaz
Download Presentation

Arseniy Khobotkov Arunesh Gupta Dhananjay Khaitan Saurabh Sharma

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Team 7: supari 17-654: Analysis of Software Artifacts 18-846: Dependability Analysis of Middleware Arseniy Khobotkov Arunesh Gupta Dhananjay Khaitan Saurabh Sharma

  2. Team Members Arunesh Gupta arunesh@andrew.cmu.edu Saurabh Sharma saurabhs@andrew.cmu.edu Arseniy Khobotkov akhobotk@andrew.cmu.edu Dhananjay Khaitan dkhaitan@andrew.cmu.edu http://www.ece.cmu.edu/~ece846/team7/index.html

  3. Graphical User Interface • Registration Screen • New User • Old User • Transaction Screen • Updated stock feed • Buy / Sell • History

  4. Baseline Application • Financial Stock Trading Application: • Clients can register accounts with username/password • Clients receive constantly updated stock prices from independent data feed • Clients can buy/sell stock at guaranteed prices • Clients can view transaction history • Innovative Features: • Guarantee reliability of system within 5s • Guarantee price when client decides to make a transaction • Advanced security as Stock Feed server is not connected to main system • Stock feed and transaction system are SHA1 signed • Stock Feed server generates random variation in stock prices • Encrypted stock feed to prevent tampering • Technical Design Information: • Java based components with CORBA running on Unix/Linux • PostgreSQL database • Java front end

  5. Security Against Malicious Users • Every client has a socket connection to the Stock Feed server • Mimics NASDAQ system • The prices are signed by the stock server • We are signing using SHA1 with RSA • Stock Feed server adds an expiration time of 10 seconds to all data • Clocks on database are synchronized with Stock Feed clock • Further prevents stock price tampering • No possibility of using an old price to perform a transaction • Server decrypts signature to authenticate. • Reports exception to client if failure to authenticate • Client locks further transaction attempts • Database watches for time expiration • Returns exception to client if time expired • Allows client to retry with new data • We created a malicious client to test this behavior by trying to alter the encrypted data feeds

  6. Baseline Architecture

  7. Fault-Tolerance Goals – Passive Replication • System with ‘n’ replicas tolerates the failure of ‘n-1’ replicas • All state is stored in the database with a transaction ID / user ID tuple • Transaction can just be retried from the client in case of failure • Case 1: Failure on route to database – Simple retry by Client • Case 2: Failure on route back – Client retries but transaction id prevents duplication • Replicated Components • The server is replicated three times as they are the core component of the system • They are on different unix4x Machines • Sacred Components • ORB daemon • Stock Server • Replication Manager • Database Systems • Primary components • Replication Manager – kills faulty replicas and launches new ones • Fault Detector – located in client • Fault Injector – testing fault tolerance

  8. Fault Injector • Runs on a separate thread of the Replication Manager • Fault Injector Thread spawned by the Replication Manager • Accesses Replication Manager state to ensure 2 replicas are always running • Kills a random replica every 15 seconds • It takes some time for a dead server do be discovered and restarted • Makes for a consistent pattern of 2 transactions with all the servers up before a server is killed. • Provides for consistent analysis of data

  9. FT-Baseline Architecture

  10. Mechanisms for Fail-Over • Choose arbitrary replica and in case of failure try another one • Client detects faults as CORBA IDL Exceptions • COMM_FAILURE – CORBA thrown exception • Generic_Exception – Non-application specific errors. • At exception … • Contact Replication Manager to deal with faulty replica • ORBd Naming Service to locate a new working replica • Replication Manager … • Receives indication of failed replica from client • Kills faulty replica • Spawns a new replica on another machine • Two minimum replicas maintained • Replication Manager ensures that we always have two working replica in case one fails.

  11. Fail-Over Measurements • We expected that a faulty case would take longer at higher loads • This was not the case • CPU load measurement (on a minute basis) was not a correlated source of spikes. • Database transaction times and network latencies took up the bulk of our time. • Average FT execution = 289 ms • Average Fail-Over Execution = 458 ms

  12. Fail-Over Measurements • Transaction Time Dominates (Database Updates / Queries) • Not much we can do about that • Receive Exception / Get Invocation / Contact Replica • These form the focus of our later optimizations.

  13. RT-FT-Baseline Architecture (Active Replication) • FT-Baseline had three replicas running actively • Spawn a thread to contact each of the replicas • In case of a fault this removes • Time in which we contact naming service to find next active replica • Time taken to retry a transaction – 2 duplicate transactions already ‘in-flight’ • We get the fastest active response to the client • Number of replicas always maintained by the Replication Manager • CORBA does not support multicasting, thus the need for three threads.

  14. RT-FT-Baseline Architecture (Active Replication)

  15. Bounded “Real-Time” Fail-Over Measurements • Much more bounded behavior with Active Replication • One Naming Service / Replication Manager spike • 850ms second failover bound • Average Active Replication execution = 197 ms, 31.8% improvement • Average Active Replication Fault execution time = 382ms, 16.6% improvement

  16. RT-FT-Performance Strategy • Active Replication with caching responses at the Database • Database time is the main bottleneck in the system • When multiple threads reached the database with the same transaction they now get a cached copy of the response rather than going through the database again • Previously, we’d touch the database for every thread • VERY inefficient

  17. RT-FT Performance Architecture

  18. Performance Measurements • Caching results in significantly reduced times • Fault Recovery and Execution time average = 278ms • 27.2% improvement from FT-baseline • Transaction time average = 168ms • 14.7% improvement from un-cached times.

  19. Active Replication vs. Passive Replication • Active replication is far more relevant in a stock trading context • Seconds can be the difference between millions of dollars • When we started this project, we knew we’d have to go this route • No point in making a stock trading system with high response times and irregular, non-transparent faulty behavior • Measurements become far more difficult in Active Replication • Three threads running • Timing the behavior of each thread is not possible • Threads pre-empt each other at undefined intervals dependant on the JVM

  20. Other Features • Just some things we decided to throw in to the project… • Java Socket communication • Hashed Caching • SHA1 data signatures • What lessons did you learn from this additional effort? • These are all cool features. • But when you’re trying to make the basic system work, they are a huge distraction. • Get the base working, then move onto the complex.

  21. Open Issues • Issues for resolution • Truly persistent active threads • Currently we kill and re-invoke threads • Inefficient, 5 threads continually running would be much better • A thread pool. • Additional Features • Separate client and fault detector • Heartbeat system for even faster fault detection • Cache write-back policy • Save database writes for periods of low activity • Greatly reduce database access time • Multicast! • Get rid of thread based implementation • Reduce thread overhead and complexities • Have all three requests sent in parallel • Load Reduction System • Clients send request to only 2 replicas per transaction to distribute load

  22. Insights from Measurements • What insights did you gain from the three sets of measurements, and from analyzing the data? • Baseline FT (Passive Replication) • Inefficient recovery from faults. • Contacting the Naming Service at each fault was a major bottleneck. • This led us to believe active replication had to be done for RT behavior. • Baseline FT-RT (Active Replication) • Times for recovery were significantly lowered. • However, time for transaction was now an overwhelming bottleneck. • Performance enhancements, thus based on methods to reduce transaction costs. • Baseline FT-RT Performance (Active Replication with Caching) • Caching allowed us to not have to contact the database at every thread • Allows much faster thread performance • Greatly improved transaction times.

  23. Conclusions • What did you learn? • Building up and improving on a baseline program is very cool. • Interesting to see the progression. • Caveats of replication and duplicate suppression • Reliable distributed systems are difficult to design and code • The difficulty is not in the application but the reliability systems • What did you accomplish? • We are the only ones to have done active replication • Digitally signed and encrypted communications • A dynamic system with so many components works reliably • What would you do differently, if you could start the project from scratch now? • Added state in replicas. Its much harder, but also much more relevant • Modularized the project better • Better defined interfaces

More Related