270 likes | 519 Views
MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project. Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon. Outline. Introduction to the Water Threat Management Project Motivation
E N D
MS Thesis DefenseDynamic Fault Tolerant Grid Workflowin the Water Threat Management Project Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon
Outline • Introduction to the Water Threat Management Project • Motivation • Research Objectives • Fault-Tolerant Queue • Evaluation • Conclusion
Water Threat Management • Motivation • Urban Water Distribution Systems (WDSs) can be an easy target of terror attacks - e.g. contaminating the water. • Methods • Detect contamination using the sensors located across the WDSs. • Run algorithms (developed by NCSU) to determine the sensor locations to minimize the searching time to find the contaminant source locations.
Existing Water Threat Management System Architecture • Optimization Engine: Runs Evolutionary Algorithm (EA) • Simulation Engine: Runs EPANET
Water Threat Management System Requirements • Requirements • Time sensitive • Massive calculation • Dynamic adaptation to a Grid environment • Fault tolerance • Our goal • The current system is not fault-tolerant - develop a fault-tolerant framework in the dynamic environment.
Motivation • Resource (Site) Outage • 5% down during 2009 • Queue Wait Time TeraGrid User & System News (http://news.teragrid.org/)
Research Objectives • Develop a fault-tolerant framework dealing with resource outages • Strategy: generation distribution on multiple sites • Reduce queue wait time • Strategy: dynamic job dependency
Water Threat Management Application • Sequential & parallel processing
Generation Distribution • Divide generations into multiple parts as multiple jobs. Distribute them on multiple sites.
Dynamic Job Dependency • Problems of generation distribution on multiple sites • Additional queue wait times • Each job is dependent on another. • Cannot submit a job before the prior job finishes. • Solution: determine job dependency at run time. • Submit jobs at the same time. • Any job start first computes the first set of generations
Dynamic WTM Workflow Management • Example scenario
Fault-tolerant Queue • Most common fault-tolerant strategies in a Grid • Replication • Checkpointing • Limitation of checkpointing with time-criticality • Checkpointing performance degradation • Checkpointing may not be compatible on a different site (heterogeneity) • Cannot reschedule job on the same site in case of site outage • Choosing the replication strategy within the fault-tolerant queue
Fault-tolerant Queue Design • Components • Command Line Interface • Task Pool • Resource Pool • Scheduler • Resource Checker (intergration with the TeraGrid Information Services)
Fault Detection in Fault-tolerant Queue • Fault detection • Message from Grid Resource Allocation and Management (GRAM) in the Globus Toolkit • Communicate with GRAM to detect job failure • TeraGrid Information Services • GRAM service may fail when the resource is down • Publishes XML documents containing the outage information
Evaluation – WTM performance • WTM application performance (original)
Evaluation – Queue Wait Time • Queue wait time statistics
Evaluation – Performance Overhead • Performance overhead • Integrating a fault-tolerant framework usually causes performance degradation • No performance loss in our framework
Evaluation – Workflow Performance • Different type of workflow run time comparison • Original deployment VS. fault-tolerant deployment • Dynamic job dependency VS. static job dependency • Test each type of deployment in the real Grid system including queue wait time
Evaluation – Workflow Performance • Workflow comparison results Experiment 1 Experiment 2 Experiment 3
Simulation – Worst Case Run Time Comparison • A threat management system must deliver results in any circumstances. • Thus, a run time of the worst case is a critical factor in the Water Threat Management system.
Simulation – Worst Case Run Time Comparison • Simulation setup • The generations are equally distributed among the machines. • Use the 2009 TeraGrid outage data. • Submit jobs every 5 minutes starting from 1/1/2009 12:00 am EST.
Simulation – Worst Case Run Time Comparison • Simulation queue wait time setup (unit: minutes)
Simulation – Worst Case Run Time Comparison TeraGrid User & System News (http://news.teragrid.org/)
Conclusion • Achievement: • Worst case run time is significantly reduced. • Limitation: • In “general” cases, the dynamic workflow has performance degradation. • Due to the low failure rate & compute performance difference between difference machines. • Possible improvement: • Migrate the generation process to a faster machine whenever possible.