400 likes | 587 Views
MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project. Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon. Outline. Introduction to Water Threat Management Project Motivation
E N D
MS Thesis DefenseDynamic Fault Tolerant Grid Workflowin the Water Threat Management Project Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon
Outline • Introduction to Water Threat Management Project • Motivation • Research Objectives • Fault-Tolerant Queue • Analysis of Fault-Tolerant Workflow • Evaluation • Conclousion
Water Threat Management • Managing contamination incidents in urban Water Distribution Systems (WDSs) • Simulating water quality and hydraulic behavior using EPANET. • Determining locations of sensors in WDSs to detect contaminations. • Searching the contaminant source location.
Water Threat Management Requirements • Time sensitivity • Large computational power • Dynamic adaptation to a Grid environment • Fault tolerance
Motivation • A Grid has faulty environments. TeraGrid User & System News (http://news.teragrid.org/)
Motivation • A Grid has faulty environments. TeraGrid User & System News (http://news.teragrid.org/)
Motivation • Outage Rate (TeraGrid) (add unit) during 2009 TeraGrid User & System News (http://news.teragrid.org/)
Motivation • Queue wait time
Research Objectives • Fault-tolerant Water Threat Management system deployment • Generation distribution on multiple sites • Reducing queue wait time • Dynamic job dependency
Water Threat Management Application • Sequential & parallel processing
Generation Distribution • Divide generations into multiple parts as multiple jobs.
Generation Distribution • File communication
Dynamic Job Dependency • Determine job dependency at run time.
Dynamic Job Dependency • Without dynamic job dependency
Dynamic Job Dependency • With dynamic job dependency
Fault-tolerant Queue • Most common fault-tolerant strategies in a Grid • Replication • Checkpointing • Limitation of checkpointing with time-criticality • Checkpointing performance degradation • Checkpointing may not be compatible on a different site • Cannot reschedule job on the same site in case of site outage • Choosing the replication strategy within the fault-tolerant queue
Fault-tolerant Queue Design • Architecture
Fault-tolerant Queue Design • Components • Cyberaide Shell Command Line Interface • Task Pool • Resource Pool • Scheduler • Resource Checker (intergration with the TeraGrid Information Services)
Fault-tolerant Queue Design • Command Line Interface (CLI) • The queue command cybershell> queue • Setting policy queue> policy -task -replicate • Submitting a job queue> submit -cmd /mydir/mpijob -mpi 16
Fault-tolerant Queue Design • Lifecycle of a job
Fault-tolerant Queue Design • Fault detection • Message from Grid Resource Allocation and Management (GRAM) in the Globus Toolkit • Communicate with GRAM to detect job failure • TeraGrid Information Services • Web Application
Fault-tolerant Queue Design • TeraGrid Information Services
Dynamic WTM Workflow Management • Example scenario
Run Time Analysis of WTM Job Distribution • Hypothesis • The total run time can be reduced by the generation distribution in case of failure • The run time of the original WTM job: T • Divide the original job into N parts (jobs) • The run time of each divided job is T/N • There is a lower probability of failure with T/N than T • If N increases, the total run time decreases.
Run Time Analysis of WTM Job Distribution • Failure probability during time x: • The expected number of times to run a job until it succeeds (geometric distribution): T(n): run time of n generations, k: number of jobs
Run Time Analysis of WTM Job Distribution • Run time for X: • Run time with m times:
Run Time Analysis of WTM Job Distribution • Run time of k jobs:
Evaluation • WTM application performance (generation)
Evaluation • Queue wait time statistics
Evaluation … • Performance overhead • More..
Evaluation … • Goal: run time of Different type of Workflow comparison • Setup, what to measure, why measure
Evaluation … • Workflow comparison (submitted jobs at different times) +
Evaluation • Simulation • Statistical model for the original WTM deployment t: run time of a job, p: failure rate, q: avg. queue wait time • Statistical model for the dynamic WTM deployment k: number of jobs, qi: avg. queue wait time of ith job, ti: run time of ith job
Evaluation • Generation distribution optimization
Evaluation … explain more • Performance simulation
Conclusion • The queue wait time in the workflow can be reduced by the dynamic job dependency strategy (with the generation distribution on multiple sites). • Fault tolerance in the WTM deployment can be achieved by the fault-tolerant queue integrating GRAM and TeraGrid Information Services.
References • L. Rossman, “EPANET 2 users manual,” US Environmental Protection Agency, Cincinnati, Ohio, Tech. Rep., 2000. • “TeraGrid Information Services,” Web Page. [Online]. Available: http://info.teragrid.org/ • ——, A Globus Primer: Describing Globus Toolkit 4, Globus, August 2005. [Online]. Available: http://www.globus.org/toolkit/docs/4.0/key/GT4 Primer 0.6.pdf