Latency Hiding in Dynamic Partitioning and Load Balancing of Grid Computing Applications

Latency Hiding in Dynamic Partitioning and Load Balancing of Grid Computing Applications Sajal K. Das andDaniel J. Harvey Department of Computer Science and Engineering The University of Texas at Arlington E-mail: {das,harvey}@cse.uta.edu Rupak Biswas NASA Ames Research Center E-mail: rbiswas@nas.nasa.gov

Presentation Overview • The Information Power Grid (IPG) • Motivations • Load Balancing and Partitioning • Our Contributions • The new MinEX Partitioner • Experimental Study • Performance Results • Conclusions and Ongoing Research

The Information Power Grid (IPG) • Harness the power of geographically separated resources • Developed by NASA and other collaborative partners • Utilize a distributed environment to solve large-scale computational problems • Additional relevant applications identified by I-Way experiment • Remote access to large databases with high-end graphics facilities • Remote virtual reality access to instruments • Remote interactions with supercomputer simulations

Motivations • Develop techniques to enhance the feasibility of running applications on the IPG • Effective load-balancer/partitioner for a distributed environment • Allow for latency tolerance to overcome low bandwidths • Predict application performance by simulationof IPG

Load Balancing and Partitioning GOAL: Distribute workload evenly among processors • Static load balancers • Balance load prior to execution • Examples: smart-compilers, schedulers • Dynamic load balancers • Balance as application is processed • Examples: adaptive contracting, gradient, symmetric broadcast networks • Semi-dynamic load balancers • Temporarily stop processing to balance workload • Utilize a partitioning technique • Examples: MeTiS, Jostle, PLUM

Our Contributions • Limitations of existing partitioners • Separate partitioning and data redistribution steps • Lack of latency tolerance • Balance loads with excessive communication and data movement • Propose a new partitioner (MinEX) for IPG environment • Minimize total runtime rather than balancing workload • Compensate for high latency on the IPG • Compare with existing methods

The MinEX Partitioner • Diffusive algorithm with goal to minimize total runtime • User-supplied function for latency tolerance • Account for data redistribution cost during partitioning • Collapse pairs of vertices incrementally • Partition the contracted graph • Refine graph gradually to original in reverse order • Vertex reassignment considered at each refinement

Processing Weight Wgtv = PWgtv x Procc Communication Cost Comm = CWgt(v,w) x Connect(cp,cq) Redistribution Cost Remap = RWgtv x Connect(Cp,Cq) if p q Weighted Queue Length QWgt(p) = (Wgtv + Comm + Remap ) Heaviest load (MaxQWgt) Lightest load (MinQWgt) Average load (AvgQWgt) Total system load QWgtToT = QWgt(p) Load Imbalance Factor LoadImb = MaxQWgt/AvgQWgt Metrics Utilized v p v p v p v p

MinVar, Gain, and ThroTTle • Processor workload variance from MinQWgt • MinVar = p(QWgt(p) - MinQWgt)2 • MinVar reflects the improvement in MinVar after a vertex reassignment • Gain is the change(QWgtToT) to total system load resulting from a vertex reassignment • ThroTTle is a user defined parameter • Vertex moves that improve MinVar are allowed if Gain/Throttle <= MinVar

MinEX Data Structures • Mesh: {|V|, |E|, vTot, *VMap, *VList, *EList} |V| : Number of active vertices |E| : Total number of edges vTot : Total number of vertices *VMap : Pointer to list of active vertices *VList : Pointer to complete list of vertices *EList : Pointer to list of edges EList entries contains {w,CWgt(v,w)} w = adjacent vertex CWgt(v,w) = edge communication weight

MinEX Data Structures(continued) • VList (for each vertex v): {PWgt, RWgt, |e|, *e, merge, lookup, *VMap, *heap, border} PWgt : Computational weight RWgt : Redistribution weight |e| : Number of incident edges *e : Pointer to the first edge merge : Vertex that merged with v (or -1) lookup : Active vertex containing v (or -1) *VMap : Pointer to v’s position in VMap *heap : Pointer to heap entry for v border : Indicates if v is a border vertex

ProcedureFind(v)If (merge == -1) Return vIf (lookup ! = -1) And (lookup <= vTot)Then Return lookup = Find(lookup)Else Return lookup = Find(merge) Form meta-verticesby collapsing edges Use maximalCWgt(v,w) / (RWgtv+RWgtw) C2 C2 C2 C2 A R1 B R1 C R1 A R1 B R1 C R1 C2 C2 C2 C8 C2 C2 C2 C8 C4 MC C2 C2 D R4 E R2 F R2 D R4 E R2 F R2 MF C8 C8 C2 Stack VMap= A,B,C,D,E,F,G|E|=16 |V|=7 Stack VMap= A,B,H,D,E,G|E|=19 |V|=67 G R2 G R2 H R3 C2 C,F Minex Contraction Phase

MinEX Partition Phase • Contracted graph allows efficient partitioning • Heap with pointers is created • For each vertex, compute optimal reassignment • MinVar, Gain, and ThroTTle criteria satisfied • Vertices are added to the Gain min-heap • The VList *heap pointer is set • Heap is adjusted as vertices are reassigned • Process stops when heap becomes empty

MinEX Refinement Phase • Refinement proceeds in reverse order from contraction through popping vertex pairs off the stack • Reassignment of each refined vertex consideredand partitioning process restarted • Vertex lookup and merge values reset by following the merge chain when edges are accessed(if lookup > vTot)

Expected MaxQWgt Varying ThroTTle Expected LoadImb Varying ThroTTle Analysis of ThroTTle Values (P=32) ThroTTle Values ThroTTle Values

1. Send data sets to be moved 2. Send edge data 3. Process vertices not waiting for edge communication 4. Receive, unpack remapped data sets 5. Receive, unpack communication data 6. Repeat steps 2-5 until all vertices are processed Move data sets and edge data first Achieve latency tolerance by overlapping processing with communication Optimistic view: Processing completely hides the latency Pessimistic view: No latency hiding occurs Application passes to MinEX the latency hiding function Latency Tolerance Approach

Experimental Study:Simulation of an IPG Environment • Configuration File defines clusters, processors, and interconnect slowdowns • Processors in a cluster are assumed homogeneous • Connect(c1, c2) = interconnect slowdown betweenclusters c1 and c2 (unity for no slowdown) • If c1 = c2, Connect(c1, c2) = intraconnect slowdown • Procc represents the processing slowdown (normalized to unity) within a cluster • Configuration File mapped to processing graph by MinEX so actual vertex assignments in the distributed environment can be modeled

Test Application:Unstructured Adaptive Mesh • Time-dependent shock wave propagated thru cylindrical volume • Tetrahedral mesh discretization • Coarsen previously refined elements • Mesh grows from 50K to 1.8M tets over nine adaptation levels • Workload becomes unbalanced as mesh is adapted

Characteristics Of Test Application • Mesh elements interact only with immediate neighbors • High communication and remapping costs • Numerical solver not included

MinEX Partitioner Performance • SBN: Dynamic load-balancer based on Symmetric Broadcast Network that was adapted for mesh applications • PLUM: Semi-dynamic framework for processing adaptive, unstructured meshes • MinEX comparisons with SBN and PLUM:

Expected runtimes(no latency tolerance) Expected runtimes (maximum latency tolerance) Experimental Results(P=32) INTERCONNECT SLOWDOWNS INTERCONNECT SLOWDOWNS Runtimes in thousands of units

Conclusions & Ongoing Research • Introduced a new partitioner called MinEX and experimented in simulated IPG environments • Runtimes increase with larger slowdowns as clusters are added • Additional clusters increase benefits of latency tolerance • Estimated runtimes with MinEX improved by a factor of five over no partitioning • Currently applying MinEX to the N-body problem (Barnes-Hut algorithm)

Latency Hiding in Dynamic Partitioning and Load Balancing of Grid Computing Applications

Latency Hiding in Dynamic Partitioning and Load Balancing of Grid Computing Applications

Presentation Transcript

Load Balancing Part 1: Dynamic Load Balancing

Load Balancing and Grid Computing

Load Balancing and Intelligent Load Balancing

Dynamic Load Sharing and Balancing

Dynamic Load Balancing and Channel Allocation in Indoor WLAN

Conflict-minimizing Dynamic Load Balancing for P2P Desktop Grid

Dynamic balancing applications

Dynamic Topology Aware Load Balancing Algorithms for MD Applications

Partitioning and Load-Balancing in Trilinos

Dynamic Load Balancing

Dynamic Simulation Load Balancing

Dynamic Load Balancing in an SDN Environment

Dynamic Scheduling and Load Balancing in Distributed Java Applications

Grid Computing and Applications in China

The Asynchronous Dynamic Load-Balancing Library

Dynamic Topology Aware Load Balancing Algorithms for MD Applications

Dynamic Load Balancing in Scientific Simulation

Dynamic Load Balancing in Distributed Hash Tables

Dynamic Load Balancing for VORPAL

Load Balancing and Data Management in Cloud Computing