Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

Young Suk Moon Dynamic Fault Tolerant Grid Workflowin the Water Threat Management Project

Urban Water Distribution Systems • Supplying water • Pipe networks • Redundant flow paths • Millions of pipes • http://www.crwr.utexas.edu/gis/gishydro03/Classroom/trmproj/Garcia-Fresca/UrbanRecharge.htm

Water Threat Management Project • Analyzing contaminations of water in WDSs • EPANET simulation (developed at Environmental Protection Agency)‏ find the optimal solution find the contaminant source Simulation Engine (MPI)‏ Sensor Data Optimization Engine Middle Ware EPANET EPANET EPANET EPANET Grid Resources • From my presentation slides in the project/thesis seminar class

Project Requirements for WTM • Changing the MPI system to a loosely coupled system • Parallel execution of EPANET • Large number of evaluations • Integrate fault tolerance

Fault Tolerant Strategies • Replication • Run the same job on multiple machines concurrently • Fast, less reliable, needs enough resources • Checkpoint-restart • Store current computing states periodically • Slow (checkpoint overhead), more reliable

Fault Tolerant Strategies in the project • Replication • A job (replica) is submitted to multiple nodes to run the same jobs concurrently • Multiple queuing • A set of jobs is submitted to multiple nodes with different (job) orders to run different jobs concurrently j1 j1 j1 j2 j1 j3 j2 j1 j3 Machines

System Design • Figure from my pre-proposal

Model Description • J = {j1, j2, j3, ... ,jn} • Q = {q1, q2, q3, ... ,qm} • R = {r1, r2, r3, ... ,rl}

Model Description • Mapping jobs to queues • ft : J → Q • ft (j) = { j ∈ J | ∀q, ∃j, ft (j) = q ∈ Q} • Mapping queues to available resources • g t : Q → P (R)‏ • g t (q) = {q ∈ Q | ∀q, ∃Ra , g t (q) = Ra ∈ P (R)} • Mapping a queue to a resource • h t : (Q, Ra ) → Q × Ra • h t (q, r) = {q ∈ Q, r ∈ Rat | (q, r) ∈ Q × Rat }

An example of Dynamic Fault Tolerance Selection Algorithm na: number of nodes that are available nr : number of jobs that can be run in parallel while (there is any job remaining)‏ na← check resource availability nr← check job parallelism if nr < na < 2nr then do partial replication and partial queuing else if na≥ 2nr then do full replication else do queuing end while

Resource Selection • A number of ways to choose resources • Minimization functions related to • Performance of resources • Temperature of resources • laplace’s equation

References • G. von Laszewski, K. Mahinthakumar, R. Ranjithan, D. Brill, J. Uber, K. Harrison, S. Sreepathi, and E. Zechman, “An Adaptive Cyberinfrastructure for Threat Management in Urban Water Distribution Systems,” in Proceedings of ICCS 2006, vol. 3993, 2006, pp. 401–. • S. Sreepathi, “Cyberinfrastructure for Contamination Source Characterization in Water Distribution Systems,” Master’s thesis, North Carolina State University, 2006 • L. Ramakrishnam and D. A. Reed, “Performability modeling for scheduling and fault tolerance strategies for scientific workflows,” in Proceedings of the 17th international symposium on High performance distributed computing, Boston, MA, USA: ACM, June 2008, pp. 23-34 • G. Wrzesiska, R. V. V. Nieuwpoort, J. Maassen, T. Kielmann, and H. E. Bal, “Fault-tolerant Scheduling of Fine-grained Tasks in Grid Environments,” in International Journal of High Performance Applications, vol. 20, no. 1, February 2006, pp. 103-114. • “Laplace’s Equation” http://mathworld.wolfram.com/LaplacesEquation.html

Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

Presentation Transcript

Graph spanners : static, dynamic and fault tolerant

Fault-Tolerant Broadcast

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

Fault-Tolerant Broadcast

MS Thesis Defense Dynamic Fault Tolerant Grid Workflow in the Water Threat Management Project

Scalable, Fault-tolerant Management of Grid Services

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Fault Tolerant MPI

FAULT-TOLERANT COMPUTING

Project Wisdom Stone Fault Tolerant Networking

Fault Tolerant Configuration

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar

fault-tolerant

Fault-tolerant routing

Fault-Tolerant Consensus

Fault-Tolerant Broadcast