350 likes | 453 Views
MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project. Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon. Outline. Introduction to Water Threat Management Project Motivation
E N D
MS Thesis DefenseDynamic Fault Tolerant Grid Workflowin the Water Threat Management Project Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon
Outline • Introduction to Water Threat Management Project • Motivation • Research Objectives • Fault-Tolerant Queue • Evaluation • Conclusion
Water Threat Management • Motivation • Urban Water Distribution Systems (WDSs) can be an easy target of terror attacks - e.g. contaminating the water. • Methods • Detect contamination using sensors located across the WDSs. • Run algorithms (developed by NCSA) to determine the sensor locations to minimize the searching time to find the contaminant source locations (sensors are expensive).
Water Threat Management • Requirements • Time sensitive • Massive calculation • Dynamic adaptation to a Grid environment • Fault tolerance • Our goals • The current system is not fault-tolerant. • Develop a fault-tolerant framework and increase performance in the faulty environment.
Motivation – (1) Resource Outages • TeraGrid resource outages during 2009. TeraGrid User & System News (http://news.teragrid.org/)
Motivation – (1) Resource Outages • Outage Rate (total outage time / year) in 2009 TeraGrid User & System News (http://news.teragrid.org/)
Motivation – (1) Resource Outages • WTM deployment problem with outages TeraGrid User & System News (http://news.teragrid.org/)
Motivation – (2) Queue Wait Time • Queue wait time
Research Objectives • Develop a fault-tolerant framework dealing with resource outages • Strategy: generation distribution on multiple sites • Reduce queue wait time • Strategy: dynamic job dependency
Water Threat Management Application • Sequential & parallel processing
Generation Distribution • Divide generations into multiple parts as multiple jobs.
Generation Distribution • File communication
Dynamic Job Dependency • Problems of generation distribution on multiple sites • Additional queue wait times • Each job is dependent on another. • Cannot submit a job before the prior job finishes. • Solution: determine job dependency at run time. • Submit jobs at the same time. • Any job start first computes the first set of generations
Dynamic WTM Workflow Management • Example scenario
Fault-tolerant Queue • Most common fault-tolerant strategies in a Grid • Replication • Checkpointing • Limitation of checkpointing with time-criticality • Checkpointing performance degradation • Checkpointing may not be compatible on a different site (heterogeneity) • Cannot reschedule job on the same site in case of site outage • Choosing the replication strategy within the fault-tolerant queue
Fault-tolerant Queue Design • Architecture
Fault-tolerant Queue Design • Components • Command Line Interface • Task Pool • Resource Pool • Scheduler • Resource Checker (intergration with the TeraGrid Information Services)
Fault-tolerant Queue Design • Fault detection • Message from Grid Resource Allocation and Management (GRAM) in the Globus Toolkit • Communicate with GRAM to detect job failure • TeraGrid Information Services • GRAM service may fail when the resource is down • Publishes XML documents containing the outage information
Evaluation – WTM performance • WTM application performance (generation)
Evaluation – Queue Wait Time • Queue wait time statistics
Evaluation - Overhead • Performance overhead • Integrating a fault-tolerant framework usually causes performance degradation • No performance loss in our framework
Evaluation – Workflow Performance • Different type of workflow run time comparison • Original deployment VS. fault-tolerant deployment • Dynamic job dependency VS. static job dependency • Test each type of deployment in the real Grid system including queue wait time
Evaluation – Workflow Performance • Setup points • What to measure • Job run time + queue wait time • 4 different types of deployment • Original on Abe • Original on Big Red • Static fault-tolerant workflow on Abe + Big Red • Dynamic fault-tolerant workflow on Abe + Big Red • 6 different jobs • 6 = 1 (original) + 1 (original) + 2 (static) + 2 (dynamic)
Evaluation – Workflow Performance • Setup points • “Submit” 4 different deployments at the same time • 5 jobs are submitted at the same time (1 job is for static workflow). • Repeat this at different times • The queue wait times will make different results
Evaluation – Workflow Performance • Workflow comparison results
Simulation – Run Time Comparison • Average run time • Statistical model for the original WTM deployment t: run time of a job, p: failure rate, q: avg. queue wait time • Statistical model for the dynamic WTM deployment k: number of jobs, qi: avg. queue wait time of ith job, ti: run time of ith job
Simulation – Run Time Comparison • Results (queue wait time + job run time + “failure” time)
Simulation – Worst Case Run Time Comparison • A threat management system must deliver results in any circumstances. • Thus, a run time of the worst case is a critical factor in the Water Threat Management system.
Simulation – Worst Case Run Time Comparison • Simulation setup • Use the 2009 TeraGrid outage data for this simulation • Submit jobs every 5 minutes during 2009 and compare the worst case run time between the original deployment and the dynamic workflow deployment
Conclusion • In general, the dynamic fault-tolerant workflow has similar performance to the performance of the original deployment. • However, the dynamic workflow ofthe worst case scenario has much better performance than the performance of the worst case scenario of the original deployment.