Supporting Fault-Tolerance in Streaming Grid Applications

Supporting Fault-Tolerance in Streaming Grid Applications Qian Zhu, Liang Chen, Gagan Agrawal Department of Computer Science and Engineering The Ohio State University IPDPS 2008 Conference April 15th, 2008 Miami, Florida IPDPS 2008

Data Streaming Applications • Computational Steering • Interactively control scientific simulations • Computer Vision Based Surveillance • Track people and monitor critical infrastructure • Images captured by multiple cameras • Online Network Intrusion Detection • Analyze connection request logs • Identify unusual patterns IPDPS 2008

Fault-Tolerance • Definition • The ability of a system to respond gracefully to an unexpected hardware or software failure • Fault-Tolerance in Grid Applications • Redundancy-based fault-tolerance • Checkpointing-based fault-tolerance IPDPS 2008

Fault-Tolerance in Data Streaming Applications • Fault-Tolerance is Important for Data Stream Processing • Distributed data sources • Pipelined real-time processing and long running nature • Frequent and large-volume data transfers • Dynamic and unpredictable resource availability IPDPS 2008

Overview of GATES Middleware • Distributed Data Stream Processing • Automatic Resource Discovery • To Achieve the Best Accuracy While Maintaining the Real-Time Constraints (Self-Adaptation Algorithm) • Easy-To-Use (Java, XML, Web Services) • Our previous work: HPDC04, SC06, IPDPS06 IPDPS 2008

Outline • Motivation and Introduction • Overall Design for Fault-Tolerance • Experimental Evaluation • Related Work • Conclusion IPDPS 2008

Overall Design for Fault-Tolerance • Design Alternatives • Redundancy-based • Checkpointing-based • Drawbacks • Resource requirements • Synchronization of states for all replicas • Platform dependent • Large-volume checkpoints IPDPS 2008

Our Proposed Approach • Light-Weight Summary Structure (LSS) • Locally updated each processing round • Transferred to remote nodes • Heartbeat-based Fault Detection • Failure Recovery using LSS • Other Issues and Enhancements • Data Backup Buffer • Efficient Resource Allocation Algorithm IPDPS 2008

Definition of Light-weight Summary Structure (LSS) • Data Stream Processing Structure • Summary Information Accumulated Each Processing Loop Iteration • A Small Memory Size • ... • while(true) • { • read_data_from_streams(); • process_data(); • accumulate_intermediate_results(); • reset_auxiliary_structures(); • } • ... IPDPS 2008

LSS: An Example • Application: Counting Samples counting-lss S M F Data Source Computing the 10 most frequent numbers Computing the m most frequent numbers • counting-lss: • int: value of m • int array: the m most frequent numbers • int array: corresponding frequencies IPDPS 2008

Using LSS for Fault-Tolerance • Much Smaller Memory Size Than That of the Application • Auxiliary Structures are reset at the end of each iteration • Approximate Processing on Data Streams IPDPS 2008

Using LSS for Fault-Tolerance –cont’d • Compare LSS-based Fault-Tolerance to checkpointing in grid environments • Much smaller memory size than that of the application • A small amount of data is lost during failure recovery • LSS is independent of platforms IPDPS 2008

GATES Implementation for Fault-Tolerance Application: // Initialize auxiliary structures initialize_auxiliary_structures(); // Get an LSS instance from GATES counting-lss lss = GATES.getLSS(”counting-lss”); // Process streaming data while true // check if input buffer is invalid if inBuffer.getInputBufferStatus()==INVALID // Stop processing then break; read_data_from_streams(); process_data(); accumulate_intermediate_results_to_LSS(lss); update_local_LSS(lss); initialize_auxiliary_structures(); GATES: // Monitor service if local LSS updated then send_LSS_to_Candidates(lss); // Replication service remote_store_LSS(lss); IPDPS 2008

Failure Recovery Procedure IPDPS 2008

Other Issues and Enhancements • Data Backup Buffer • Data is stored in the backup buffer until acknowledgment is received • Obsolete data in the backup buffer will be replaced • Efficient Resource Allocation Algorithm • Candidate nodes • Dijkstra’s shortest path algorithm IPDPS 2008

Streaming Applications • Counting Samples (count-samps) • To determine the n most frequent numbers • LSS: m most frequent numbers • Clustering Evolving Streams (clustream) • To group data into n clusters • LSS: m micro-clusters • Distributed Frequency Counting (dist-freq-counting) • To find the most frequent itemset with threshold • LSS: most frequent itemset with threshold IPDPS 2008

Goals for the Experiments • Show that LSS Uses a Small Amount of Memory • Evaluate the Overhead of LSS for Fault-Tolerance • Show the Impact on Accuracy IPDPS 2008

Experiment Setup and Datasets • 64-Node Computing Cluster • Simulate Different Inter-node Bandwidths • Datasets • count-samps: data generated by a simulator • clustream: KDD-CUP’99 Network Intrusion Detection dataset • dist-freq-counting: IBM synthetic data generator IPDPS 2008

Memory Usage of LSS • LSS only occupied approximately 0.6%, 1.7%, 2.5% and 2.9%, respectively, of memory used by the entire application • LSS consumed 0.9% of the clustream application and 1.1% of the dist-freq-counting application IPDPS 2008

Using LSS for Fault-Tolerance: Performance • Execution Time of count-samps 4% 7% 10% IPDPS 2008

Using LSS for Fault-Tolerance: Performance • Execution Time of clustream 2.5% IPDPS 2008

Using LSS for Fault-Tolerance: Performance • Execution Time of dist-freq-counting 3.5% IPDPS 2008

Using LSS for Fault-Tolerance: Accuracy • Accuracy of count-samps 1% 6% IPDPS 2008

Using LSS for Fault-Tolerance: Accuracy • Accuracy of clustream IPDPS 2008

Related Work • Application-Level Checkpointing • Bronevetsky et al. (PPoPP03, ASPLOS04, SC06) • Replication-based Fault Tolerance • Abawajy et al. (IPDPS04), Murty et al. (HotDep06) Hwang et al. (ICDE05), Zheng et al. (Cluster04) • Fault Tolerance in Distributed Data Stream Processing • Balazinska et al. (SIGMOD05, ICDE05) IPDPS 2008

Conclusion • Use of LSS to Enable Efficient Failure-Recovery • Use of Additional Buffers to Control Data Loss • Efficient Resource Allocation Algorithm • Modest Overhead Associated with Fault-Detection and Failure-Recovery • Small Loss of Accuracy IPDPS 2008

Thank you! IPDPS 2008

Supporting Fault-Tolerance in Streaming Grid Applications

Supporting Fault-Tolerance in Streaming Grid Applications

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance