1 / 39

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project. Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon. Outline. Introduction to Water Threat Management Project Motivation

marty
Download Presentation

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MS Thesis DefenseDynamic Fault Tolerant Grid Workflowin the Water Threat Management Project Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon

  2. Outline • Introduction to Water Threat Management Project • Motivation • Research Objectives • Fault-Tolerant Queue • Analysis of Fault-Tolerant Workflow • Evaluation • Conclousion

  3. Water Threat Management • Managing contamination incidents in urban Water Distribution Systems (WDSs) • Simulating water quality and hydraulic behavior using EPANET. • Determining locations of sensors in WDSs to detect contaminations. • Searching the contaminant source location.

  4. Existing Water Threat Management System Architecture

  5. EPANET Simulation in the Simulation Engine ????

  6. Water Threat Management Requirements • Time sensitivity • Large computational power • Dynamic adaptation to a Grid environment • Fault tolerance

  7. Motivation • A Grid has faulty environments. TeraGrid User & System News (http://news.teragrid.org/)

  8. Motivation • A Grid has faulty environments. TeraGrid User & System News (http://news.teragrid.org/)

  9. Motivation • Outage Rate (TeraGrid) (add unit) during 2009 TeraGrid User & System News (http://news.teragrid.org/)

  10. Motivation • Queue wait time

  11. Research Objectives • Fault-tolerant Water Threat Management system deployment • Generation distribution on multiple sites • Reducing queue wait time • Dynamic job dependency

  12. Water Threat Management Application • Sequential & parallel processing

  13. Generation Distribution • Divide generations into multiple parts as multiple jobs.

  14. Generation Distribution • File communication

  15. Dynamic Job Dependency • Determine job dependency at run time.

  16. Dynamic Job Dependency • Without dynamic job dependency

  17. Dynamic Job Dependency • With dynamic job dependency

  18. Fault-tolerant Queue • Most common fault-tolerant strategies in a Grid • Replication • Checkpointing • Limitation of checkpointing with time-criticality • Checkpointing performance degradation • Checkpointing may not be compatible on a different site • Cannot reschedule job on the same site in case of site outage • Choosing the replication strategy within the fault-tolerant queue

  19. Fault-tolerant Queue Design • Architecture

  20. Fault-tolerant Queue Design • Components • Cyberaide Shell Command Line Interface • Task Pool • Resource Pool • Scheduler • Resource Checker (intergration with the TeraGrid Information Services)

  21. Fault-tolerant Queue Design • Command Line Interface (CLI) • The queue command cybershell> queue • Setting policy queue> policy -task -replicate • Submitting a job queue> submit -cmd /mydir/mpijob -mpi 16

  22. Fault-tolerant Queue Design • Lifecycle of a job

  23. Fault-tolerant Queue Design • Fault detection • Message from Grid Resource Allocation and Management (GRAM) in the Globus Toolkit • Communicate with GRAM to detect job failure • TeraGrid Information Services • Web Application

  24. Fault-tolerant Queue Design • TeraGrid Information Services

  25. Dynamic WTM Workflow Management • Example scenario

  26. Run Time Analysis of WTM Job Distribution • Hypothesis • The total run time can be reduced by the generation distribution in case of failure • The run time of the original WTM job: T • Divide the original job into N parts (jobs) • The run time of each divided job is T/N • There is a lower probability of failure with T/N than T • If N increases, the total run time decreases.

  27. Run Time Analysis of WTM Job Distribution • Failure probability during time x: • The expected number of times to run a job until it succeeds (geometric distribution): T(n): run time of n generations, k: number of jobs

  28. Run Time Analysis of WTM Job Distribution • Run time for X: • Run time with m times:

  29. Run Time Analysis of WTM Job Distribution • Run time of k jobs:

  30. Evaluation • WTM application performance (generation)

  31. Evaluation • Queue wait time statistics

  32. Evaluation … • Performance overhead • More..

  33. Evaluation … • Goal: run time of Different type of Workflow comparison • Setup, what to measure, why measure

  34. Evaluation … • Workflow comparison (submitted jobs at different times) +

  35. Evaluation • Simulation • Statistical model for the original WTM deployment t: run time of a job, p: failure rate, q: avg. queue wait time • Statistical model for the dynamic WTM deployment k: number of jobs, qi: avg. queue wait time of ith job, ti: run time of ith job

  36. Evaluation • Generation distribution optimization

  37. Evaluation … explain more • Performance simulation

  38. Conclusion • The queue wait time in the workflow can be reduced by the dynamic job dependency strategy (with the generation distribution on multiple sites). • Fault tolerance in the WTM deployment can be achieved by the fault-tolerant queue integrating GRAM and TeraGrid Information Services.

  39. References • L. Rossman, “EPANET 2 users manual,” US Environmental Protection Agency, Cincinnati, Ohio, Tech. Rep., 2000. • “TeraGrid Information Services,” Web Page. [Online]. Available: http://info.teragrid.org/ • ——, A Globus Primer: Describing Globus Toolkit 4, Globus, August 2005. [Online]. Available: http://www.globus.org/toolkit/docs/4.0/key/GT4 Primer 0.6.pdf

More Related