MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

MS Thesis DefenseDynamic Fault Tolerant Grid Workflowin the Water Threat Management Project Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon

Outline • Introduction to Water Threat Management Project • Motivation • Research Objectives • Fault-Tolerant Queue • Analysis of Fault-Tolerant Workflow • Evaluation • Conclousion

Water Threat Management • Managing contamination incidents in urban Water Distribution Systems (WDSs) • Simulating water quality and hydraulic behavior using EPANET. • Determining locations of sensors in WDSs to detect contaminations. • Searching the contaminant source location.

Existing Water Threat Management System Architecture

EPANET Simulation in the Simulation Engine ????

Water Threat Management Requirements • Time sensitivity • Large computational power • Dynamic adaptation to a Grid environment • Fault tolerance

Motivation • A Grid has faulty environments. TeraGrid User & System News (http://news.teragrid.org/)

Motivation • Outage Rate (TeraGrid) (add unit) during 2009 TeraGrid User & System News (http://news.teragrid.org/)

Motivation • Queue wait time

Research Objectives • Fault-tolerant Water Threat Management system deployment • Generation distribution on multiple sites • Reducing queue wait time • Dynamic job dependency

Water Threat Management Application • Sequential & parallel processing

Generation Distribution • Divide generations into multiple parts as multiple jobs.

Generation Distribution • File communication

Dynamic Job Dependency • Determine job dependency at run time.

Dynamic Job Dependency • Without dynamic job dependency

Dynamic Job Dependency • With dynamic job dependency

Fault-tolerant Queue • Most common fault-tolerant strategies in a Grid • Replication • Checkpointing • Limitation of checkpointing with time-criticality • Checkpointing performance degradation • Checkpointing may not be compatible on a different site • Cannot reschedule job on the same site in case of site outage • Choosing the replication strategy within the fault-tolerant queue

Fault-tolerant Queue Design • Architecture

Fault-tolerant Queue Design • Components • Cyberaide Shell Command Line Interface • Task Pool • Resource Pool • Scheduler • Resource Checker (intergration with the TeraGrid Information Services)

Fault-tolerant Queue Design • Command Line Interface (CLI) • The queue command cybershell> queue • Setting policy queue> policy -task -replicate • Submitting a job queue> submit -cmd /mydir/mpijob -mpi 16

Fault-tolerant Queue Design • Lifecycle of a job

Fault-tolerant Queue Design • Fault detection • Message from Grid Resource Allocation and Management (GRAM) in the Globus Toolkit • Communicate with GRAM to detect job failure • TeraGrid Information Services • Web Application

Fault-tolerant Queue Design • TeraGrid Information Services

Dynamic WTM Workflow Management • Example scenario

Run Time Analysis of WTM Job Distribution • Hypothesis • The total run time can be reduced by the generation distribution in case of failure • The run time of the original WTM job: T • Divide the original job into N parts (jobs) • The run time of each divided job is T/N • There is a lower probability of failure with T/N than T • If N increases, the total run time decreases.

Run Time Analysis of WTM Job Distribution • Failure probability during time x: • The expected number of times to run a job until it succeeds (geometric distribution): T(n): run time of n generations, k: number of jobs

Run Time Analysis of WTM Job Distribution • Run time for X: • Run time with m times:

Run Time Analysis of WTM Job Distribution • Run time of k jobs:

Evaluation • WTM application performance (generation)

Evaluation • Queue wait time statistics

Evaluation … • Performance overhead • More..

Evaluation … • Goal: run time of Different type of Workflow comparison • Setup, what to measure, why measure

Evaluation … • Workflow comparison (submitted jobs at different times) +

Evaluation • Simulation • Statistical model for the original WTM deployment t: run time of a job, p: failure rate, q: avg. queue wait time • Statistical model for the dynamic WTM deployment k: number of jobs, qi: avg. queue wait time of ith job, ti: run time of ith job

Evaluation • Generation distribution optimization

Evaluation … explain more • Performance simulation

Conclusion • The queue wait time in the workflow can be reduced by the dynamic job dependency strategy (with the generation distribution on multiple sites). • Fault tolerance in the WTM deployment can be achieved by the fault-tolerant queue integrating GRAM and TeraGrid Information Services.

References • L. Rossman, “EPANET 2 users manual,” US Environmental Protection Agency, Cincinnati, Ohio, Tech. Rep., 2000. • “TeraGrid Information Services,” Web Page. [Online]. Available: http://info.teragrid.org/ • ——, A Globus Primer: Describing Globus Toolkit 4, Globus, August 2005. [Online]. Available: http://www.globus.org/toolkit/docs/4.0/key/GT4 Primer 0.6.pdf

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

Presentation Transcript

Graph spanners : static, dynamic and fault tolerant

Fault-Tolerant Broadcast

MS Thesis Defense:

Fault-Tolerant Broadcast

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

Scalable, Fault-tolerant Management of Grid Services

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Fault Tolerant MPI

MS Thesis Defense

Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

Fault Tolerant Configuration

FAULT-TOLERANT COMPUTING

Project Wisdom Stone Fault Tolerant Networking

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar

fault-tolerant

Fault-tolerant routing

Fault-Tolerant Consensus