160 likes | 271 Views
Reliable and Scalable Data Streaming in Multi-Hop Architecture. Sudhir Sangra, BMC Software Lalit Shukla , BMC Software. Contents. Introduction Challenges Approaches Acknowledge based Store and forward Distributed consumer model (Peer to Peer Acknowledge) Conclusion Questions.
E N D
Reliable and Scalable Data Streaming in Multi-Hop Architecture Sudhir Sangra, BMC Software Lalit Shukla , BMC Software
Contents • Introduction • Challenges • Approaches • Acknowledge based • Store and forward • Distributed consumer model (Peer to Peer Acknowledge) • Conclusion • Questions
Introduction • Multi-Hop architecture for various reasons but not limited to • Time to market - Buy decision – multiple component to integrate • Integration of the different component using different technologies underline, to provide end to end solution • Independent products performance and scalability needs to be remained intact while new features are added to solve the customer problem and provide value • Multiple hop to increase scalability, failover, load balance • Fully developed solution organically with technology parity in end to end multi-hop architecture for streaming and scalability but still there are challenges to handle when middleware need to be stateless
Challenges • Real time data streaming consumes resources on both the end (provider and consumer) impact scalability • Multi hop protocol difference • No TCP-IP like protocol to handle guaranteed delivery in multi-hop environment • Network bandwidth (Concerns when on WAN) specially on Software as a Service model • In-consistence detection trigger full data sync by provider • Provider failover from one middleware to another middleware initiated the full sync of data • Priority data such as event data is treated as metric data for transmission leading to delay in IT problem resolution
System behavior The end to end system performance used to be function of the state of middle ware. Analytics resource requirement (CPU/Memory) were reflecting the spikes and used to take long time to settle after fixing the fault at the middle ware or network
System behavior Database Performance was also a function of middleware state and network. Abrupt failure of any components were very expensive for DB operations.
Approach • There could be various approaches to provide reliability and scalability for data streaming but we will be focusing in context of performance and monitoring domain where data can be categorized in 3 different sets • Event • Performance metric • Discovered resource instance
End to End Acknowledge Model Approach • For each message sent, acknowledge is needed for reliability delivery, in case acknowledge is not received message is retransmitted. • Provide and consumer is sole responsible , middleware is stateless • Message are not discarded from provider until consumer sends an ACK or time to live timer expires • Instead of network layer TCP/IP protocol, application protocol is devised to handle the sequencing and ACK’ing as provider and consumer are not directly connected over socket • Round trip to receive ACK message is used to determine the time interval for retransmission of message. This allows to take the server load in consideration • Dynamic sliding window to handle the bust of messages • Message is added to window if window is empty and also added to disk cache • In case window is full, message is written to disk cache • Discovered resources instances messages are best suitable for this model as • No tolerance in data loss
Store and Forward Model Approach • In case of network glitch, server under maintenance mode or server unavailability due to unknown reasons, data should not be lost • Application layer is transparent to these network issues • Virtual socket abstract the file system and TCP/IP socket to peer • Reestablishment of the connection via same hop or different hop in case of fail over will not re-send the whole state but only delta which is generated after the link was broken increase scalability • Basically suitable for “performance metric” for which acknowledgment is not desired because • Huge data • Some % of tolerance is acceptable for data loss • Messages such as discovered resources instance piggy back on this model but does not degrade the model because change in system characteristics/resources are not frequent
Distributed Data Consumer Model Approach • Event messages are highest priority message. Loss in these messages leads to business loss • These message requires high grade reliability and message loss which are in transition is not acceptable • Peer – to – Peer acknowledgment distributed consumer model is best suited for these messages • Hop connected on TCP/IP socket is responsible for acknowledge • 100 % fault tolerant and reliability • Each hop is designated as processing hop or presentation hop • Event processing happens on each hop which reduces the network bandwidth and/or resources needed on end consumer to process the message because pre-processing of message is already done
System behavior Process pronet_cntl Sun 01/26/14 10:00 AM to Mon 01/27/14 10:00 AM The system resources utilization is now uniformly distributed and no longer remains the function of faulty middleware states.
Conclusion • Categorizing the data and applying the relevant model provided a significant improvement • The architecture has increased the number of allowed attributes under analytics from 1.2 M to 1.7 M, which is nearly 30% more than the traditional integration method of pulling data from the source nodes with no or minimal data loss • Number of device supported went from 200 to 1000 which is 5 X improvement with no or minimal data loss • The architecture became stateless and thus could fit to extend linearly with a mix of distributed analytics and distributed middleware
Questions ?s