MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

MS Thesis DefenseDynamic Fault Tolerant Grid Workflowin the Water Threat Management Project Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon

Outline • Introduction to the Water Threat Management Project • Motivation • Research Objectives • Fault-Tolerant Queue • Evaluation • Conclusion

Water Threat Management • Motivation • Urban Water Distribution Systems (WDSs) can be an easy target of terror attacks - e.g. contaminating the water. • Methods • Detect contamination using the sensors located across the WDSs. • Run algorithms (developed by NCSU) to determine the sensor locations to minimize the searching time to find the contaminant source locations.

Existing Water Threat Management System Architecture • Optimization Engine: Runs Evolutionary Algorithm (EA) • Simulation Engine: Runs EPANET

Water Threat Management System Requirements • Requirements • Time sensitive • Massive calculation • Dynamic adaptation to a Grid environment • Fault tolerance • Our goal • The current system is not fault-tolerant - develop a fault-tolerant framework in the dynamic environment.

Motivation • Resource (Site) Outage • 5% down during 2009 • Queue Wait Time TeraGrid User & System News (http://news.teragrid.org/)

Research Objectives • Develop a fault-tolerant framework dealing with resource outages • Strategy: generation distribution on multiple sites • Reduce queue wait time • Strategy: dynamic job dependency

Water Threat Management Application • Sequential & parallel processing

Generation Distribution • Divide generations into multiple parts as multiple jobs. Distribute them on multiple sites.

Dynamic Job Dependency • Problems of generation distribution on multiple sites • Additional queue wait times • Each job is dependent on another. • Cannot submit a job before the prior job finishes. • Solution: determine job dependency at run time. • Submit jobs at the same time. • Any job start first computes the first set of generations

Dynamic WTM Workflow Management • Example scenario

Fault-tolerant Queue • Most common fault-tolerant strategies in a Grid • Replication • Checkpointing • Limitation of checkpointing with time-criticality • Checkpointing performance degradation • Checkpointing may not be compatible on a different site (heterogeneity) • Cannot reschedule job on the same site in case of site outage • Choosing the replication strategy within the fault-tolerant queue

Fault-tolerant Queue Design • Components • Command Line Interface • Task Pool • Resource Pool • Scheduler • Resource Checker (intergration with the TeraGrid Information Services)

Fault Detection in Fault-tolerant Queue • Fault detection • Message from Grid Resource Allocation and Management (GRAM) in the Globus Toolkit • Communicate with GRAM to detect job failure • TeraGrid Information Services • GRAM service may fail when the resource is down • Publishes XML documents containing the outage information

Evaluation – WTM performance • WTM application performance (original)

Evaluation – Queue Wait Time • Queue wait time statistics

Evaluation – Performance Overhead • Performance overhead • Integrating a fault-tolerant framework usually causes performance degradation • No performance loss in our framework

Evaluation – Workflow Performance • Different type of workflow run time comparison • Original deployment VS. fault-tolerant deployment • Dynamic job dependency VS. static job dependency • Test each type of deployment in the real Grid system including queue wait time

Evaluation – Workflow Performance • Workflow comparison results Experiment 1 Experiment 2 Experiment 3

Simulation – Worst Case Run Time Comparison • A threat management system must deliver results in any circumstances. • Thus, a run time of the worst case is a critical factor in the Water Threat Management system.

Simulation – Worst Case Run Time Comparison • Simulation setup • The generations are equally distributed among the machines. • Use the 2009 TeraGrid outage data. • Submit jobs every 5 minutes starting from 1/1/2009 12:00 am EST.

Simulation – Worst Case Run Time Comparison • Simulation queue wait time setup (unit: minutes)

Simulation – Worst Case Run Time Comparison TeraGrid User & System News (http://news.teragrid.org/)

Simulation – Worst Case Run Time Comparison

Simulation – Median Run Time, Worst Case (Max.) Run Time

Conclusion • Achievement: • Worst case run time is significantly reduced. • Limitation: • In “general” cases, the dynamic workflow has performance degradation. • Due to the low failure rate & compute performance difference between difference machines. • Possible improvement: • Migrate the generation process to a faster machine whenever possible.

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

Presentation Transcript

Graph spanners : static, dynamic and fault tolerant

Fault-Tolerant Broadcast

MS Thesis Defense:

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

Fault-Tolerant Broadcast

Scalable, Fault-tolerant Management of Grid Services

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Fault Tolerant MPI

MS Thesis Defense

Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

FAULT-TOLERANT COMPUTING

Project Wisdom Stone Fault Tolerant Networking

Fault Tolerant Configuration

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar

fault-tolerant

Fault-tolerant routing

Fault-Tolerant Consensus