1 / 30

Supporting Fault-Tolerance in Streaming Grid Applications

This paper discusses the importance of fault-tolerance in data streaming applications and proposes a light-weight summary structure (LSS) for fault-recovery. The proposed approach uses a smaller memory size, allows for approximate processing on data streams, and is independent of platforms.

talley
Download Presentation

Supporting Fault-Tolerance in Streaming Grid Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supporting Fault-Tolerance in Streaming Grid Applications Qian Zhu, Liang Chen, Gagan Agrawal Department of Computer Science and Engineering The Ohio State University IPDPS 2008 Conference April 15th, 2008 Miami, Florida IPDPS 2008

  2. Data Streaming Applications • Computational Steering • Interactively control scientific simulations • Computer Vision Based Surveillance • Track people and monitor critical infrastructure • Images captured by multiple cameras • Online Network Intrusion Detection • Analyze connection request logs • Identify unusual patterns IPDPS 2008

  3. Fault-Tolerance • Definition • The ability of a system to respond gracefully to an unexpected hardware or software failure • Fault-Tolerance in Grid Applications • Redundancy-based fault-tolerance • Checkpointing-based fault-tolerance IPDPS 2008

  4. Fault-Tolerance in Data Streaming Applications • Fault-Tolerance is Important for Data Stream Processing • Distributed data sources • Pipelined real-time processing and long running nature • Frequent and large-volume data transfers • Dynamic and unpredictable resource availability IPDPS 2008

  5. Overview of GATES Middleware • Distributed Data Stream Processing • Automatic Resource Discovery • To Achieve the Best Accuracy While Maintaining the Real-Time Constraints (Self-Adaptation Algorithm) • Easy-To-Use (Java, XML, Web Services) • Our previous work: HPDC04, SC06, IPDPS06 IPDPS 2008

  6. Outline • Motivation and Introduction • Overall Design for Fault-Tolerance • Experimental Evaluation • Related Work • Conclusion IPDPS 2008

  7. Overall Design for Fault-Tolerance • Design Alternatives • Redundancy-based • Checkpointing-based • Drawbacks • Resource requirements • Synchronization of states for all replicas • Platform dependent • Large-volume checkpoints IPDPS 2008

  8. Our Proposed Approach • Light-Weight Summary Structure (LSS) • Locally updated each processing round • Transferred to remote nodes • Heartbeat-based Fault Detection • Failure Recovery using LSS • Other Issues and Enhancements • Data Backup Buffer • Efficient Resource Allocation Algorithm IPDPS 2008

  9. Definition of Light-weight Summary Structure (LSS) • Data Stream Processing Structure • Summary Information Accumulated Each Processing Loop Iteration • A Small Memory Size • ... • while(true) • { • read_data_from_streams(); • process_data(); • accumulate_intermediate_results(); • reset_auxiliary_structures(); • } • ... IPDPS 2008

  10. LSS: An Example • Application: Counting Samples counting-lss S M F Data Source Computing the 10 most frequent numbers Computing the m most frequent numbers • counting-lss: • int: value of m • int array: the m most frequent numbers • int array: corresponding frequencies IPDPS 2008

  11. Using LSS for Fault-Tolerance • Much Smaller Memory Size Than That of the Application • Auxiliary Structures are reset at the end of each iteration • Approximate Processing on Data Streams IPDPS 2008

  12. Using LSS for Fault-Tolerance –cont’d • Compare LSS-based Fault-Tolerance to checkpointing in grid environments • Much smaller memory size than that of the application • A small amount of data is lost during failure recovery • LSS is independent of platforms IPDPS 2008

  13. GATES Implementation for Fault-Tolerance Application: // Initialize auxiliary structures initialize_auxiliary_structures(); // Get an LSS instance from GATES counting-lss lss = GATES.getLSS(”counting-lss”); // Process streaming data while true // check if input buffer is invalid if inBuffer.getInputBufferStatus()==INVALID // Stop processing then break; read_data_from_streams(); process_data(); accumulate_intermediate_results_to_LSS(lss); update_local_LSS(lss); initialize_auxiliary_structures(); GATES: // Monitor service if local LSS updated then send_LSS_to_Candidates(lss); // Replication service remote_store_LSS(lss); IPDPS 2008

  14. Failure Recovery Procedure IPDPS 2008

  15. Other Issues and Enhancements • Data Backup Buffer • Data is stored in the backup buffer until acknowledgment is received • Obsolete data in the backup buffer will be replaced • Efficient Resource Allocation Algorithm • Candidate nodes • Dijkstra’s shortest path algorithm IPDPS 2008

  16. Outline • Motivation and Introduction • Overall Design for Fault-Tolerance • Experimental Evaluation • Related Work • Conclusion IPDPS 2008

  17. Streaming Applications • Counting Samples (count-samps) • To determine the n most frequent numbers • LSS: m most frequent numbers • Clustering Evolving Streams (clustream) • To group data into n clusters • LSS: m micro-clusters • Distributed Frequency Counting (dist-freq-counting) • To find the most frequent itemset with threshold • LSS: most frequent itemset with threshold IPDPS 2008

  18. Goals for the Experiments • Show that LSS Uses a Small Amount of Memory • Evaluate the Overhead of LSS for Fault-Tolerance • Show the Impact on Accuracy IPDPS 2008

  19. Experiment Setup and Datasets • 64-Node Computing Cluster • Simulate Different Inter-node Bandwidths • Datasets • count-samps: data generated by a simulator • clustream: KDD-CUP’99 Network Intrusion Detection dataset • dist-freq-counting: IBM synthetic data generator IPDPS 2008

  20. Memory Usage of LSS • LSS only occupied approximately 0.6%, 1.7%, 2.5% and 2.9%, respectively, of memory used by the entire application • LSS consumed 0.9% of the clustream application and 1.1% of the dist-freq-counting application IPDPS 2008

  21. Using LSS for Fault-Tolerance: Performance • Execution Time of count-samps 4% 7% 10% IPDPS 2008

  22. Using LSS for Fault-Tolerance: Performance • Execution Time of clustream 2.5% IPDPS 2008

  23. Using LSS for Fault-Tolerance: Performance • Execution Time of dist-freq-counting 3.5% IPDPS 2008

  24. Using LSS for Fault-Tolerance: Accuracy • Accuracy of count-samps 1% 6% IPDPS 2008

  25. Using LSS for Fault-Tolerance: Accuracy • Accuracy of clustream IPDPS 2008

  26. Outline • Motivation and Introduction • Overall Design for Fault-Tolerance • Experimental Evaluation • Related Work • Conclusion IPDPS 2008

  27. Related Work • Application-Level Checkpointing • Bronevetsky et al. (PPoPP03, ASPLOS04, SC06) • Replication-based Fault Tolerance • Abawajy et al. (IPDPS04), Murty et al. (HotDep06) Hwang et al. (ICDE05), Zheng et al. (Cluster04) • Fault Tolerance in Distributed Data Stream Processing • Balazinska et al. (SIGMOD05, ICDE05) IPDPS 2008

  28. Outline • Motivation and Introduction • Overall Design for Fault-Tolerance • Experimental Evaluation • Related Work • Conclusion IPDPS 2008

  29. Conclusion • Use of LSS to Enable Efficient Failure-Recovery • Use of Additional Buffers to Control Data Loss • Efficient Resource Allocation Algorithm • Modest Overhead Associated with Fault-Detection and Failure-Recovery • Small Loss of Accuracy IPDPS 2008

  30. Thank you! IPDPS 2008

More Related