1 / 27

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project. Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon. Outline. Introduction to the Water Threat Management Project Motivation

keita
Download Presentation

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MS Thesis DefenseDynamic Fault Tolerant Grid Workflowin the Water Threat Management Project Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon

  2. Outline • Introduction to the Water Threat Management Project • Motivation • Research Objectives • Fault-Tolerant Queue • Evaluation • Conclusion

  3. Water Threat Management • Motivation • Urban Water Distribution Systems (WDSs) can be an easy target of terror attacks - e.g. contaminating the water. • Methods • Detect contamination using the sensors located across the WDSs. • Run algorithms (developed by NCSU) to determine the sensor locations to minimize the searching time to find the contaminant source locations.

  4. Existing Water Threat Management System Architecture • Optimization Engine: Runs Evolutionary Algorithm (EA) • Simulation Engine: Runs EPANET

  5. Water Threat Management System Requirements • Requirements • Time sensitive • Massive calculation • Dynamic adaptation to a Grid environment • Fault tolerance • Our goal • The current system is not fault-tolerant - develop a fault-tolerant framework in the dynamic environment.

  6. Motivation • Resource (Site) Outage • 5% down during 2009 • Queue Wait Time TeraGrid User & System News (http://news.teragrid.org/)

  7. Research Objectives • Develop a fault-tolerant framework dealing with resource outages • Strategy: generation distribution on multiple sites • Reduce queue wait time • Strategy: dynamic job dependency

  8. Water Threat Management Application • Sequential & parallel processing

  9. Generation Distribution • Divide generations into multiple parts as multiple jobs. Distribute them on multiple sites.

  10. Dynamic Job Dependency • Problems of generation distribution on multiple sites • Additional queue wait times • Each job is dependent on another. • Cannot submit a job before the prior job finishes. • Solution: determine job dependency at run time. • Submit jobs at the same time. • Any job start first computes the first set of generations

  11. Dynamic WTM Workflow Management • Example scenario

  12. Fault-tolerant Queue • Most common fault-tolerant strategies in a Grid • Replication • Checkpointing • Limitation of checkpointing with time-criticality • Checkpointing performance degradation • Checkpointing may not be compatible on a different site (heterogeneity) • Cannot reschedule job on the same site in case of site outage • Choosing the replication strategy within the fault-tolerant queue

  13. Fault-tolerant Queue Design • Components • Command Line Interface • Task Pool • Resource Pool • Scheduler • Resource Checker (intergration with the TeraGrid Information Services)

  14. Fault Detection in Fault-tolerant Queue • Fault detection • Message from Grid Resource Allocation and Management (GRAM) in the Globus Toolkit • Communicate with GRAM to detect job failure • TeraGrid Information Services • GRAM service may fail when the resource is down • Publishes XML documents containing the outage information

  15. Evaluation – WTM performance • WTM application performance (original)

  16. Evaluation – Queue Wait Time • Queue wait time statistics

  17. Evaluation – Performance Overhead • Performance overhead • Integrating a fault-tolerant framework usually causes performance degradation • No performance loss in our framework

  18. Evaluation – Workflow Performance • Different type of workflow run time comparison • Original deployment VS. fault-tolerant deployment • Dynamic job dependency VS. static job dependency • Test each type of deployment in the real Grid system including queue wait time

  19. Evaluation – Workflow Performance • Workflow comparison results Experiment 1 Experiment 2 Experiment 3

  20. Simulation – Worst Case Run Time Comparison • A threat management system must deliver results in any circumstances. • Thus, a run time of the worst case is a critical factor in the Water Threat Management system.

  21. Simulation – Worst Case Run Time Comparison • Simulation setup • The generations are equally distributed among the machines. • Use the 2009 TeraGrid outage data. • Submit jobs every 5 minutes starting from 1/1/2009 12:00 am EST.

  22. Simulation – Worst Case Run Time Comparison • Simulation queue wait time setup (unit: minutes)

  23. Simulation – Worst Case Run Time Comparison TeraGrid User & System News (http://news.teragrid.org/)

  24. Simulation – Worst Case Run Time Comparison

  25. Simulation – Worst Case Run Time Comparison

  26. Simulation – Median Run Time, Worst Case (Max.) Run Time

  27. Conclusion • Achievement: • Worst case run time is significantly reduced. • Limitation: • In “general” cases, the dynamic workflow has performance degradation. • Due to the low failure rate & compute performance difference between difference machines. • Possible improvement: • Migrate the generation process to a faster machine whenever possible.

More Related