220 likes | 348 Views
Latency Hiding in Dynamic Partitioning and Load Balancing of Grid Computing Applications. Sajal K. Das and Daniel J. Harvey Department of Computer Science and Engineering The University of Texas at Arlington E-mail: {das,harvey}@cse.uta.edu Rupak Biswas NASA Ames Research Center
E N D
Latency Hiding in Dynamic Partitioning and Load Balancing of Grid Computing Applications Sajal K. Das andDaniel J. Harvey Department of Computer Science and Engineering The University of Texas at Arlington E-mail: {das,harvey}@cse.uta.edu Rupak Biswas NASA Ames Research Center E-mail: rbiswas@nas.nasa.gov
Presentation Overview • The Information Power Grid (IPG) • Motivations • Load Balancing and Partitioning • Our Contributions • The new MinEX Partitioner • Experimental Study • Performance Results • Conclusions and Ongoing Research
The Information Power Grid (IPG) • Harness the power of geographically separated resources • Developed by NASA and other collaborative partners • Utilize a distributed environment to solve large-scale computational problems • Additional relevant applications identified by I-Way experiment • Remote access to large databases with high-end graphics facilities • Remote virtual reality access to instruments • Remote interactions with supercomputer simulations
Motivations • Develop techniques to enhance the feasibility of running applications on the IPG • Effective load-balancer/partitioner for a distributed environment • Allow for latency tolerance to overcome low bandwidths • Predict application performance by simulationof IPG
Load Balancing and Partitioning GOAL: Distribute workload evenly among processors • Static load balancers • Balance load prior to execution • Examples: smart-compilers, schedulers • Dynamic load balancers • Balance as application is processed • Examples: adaptive contracting, gradient, symmetric broadcast networks • Semi-dynamic load balancers • Temporarily stop processing to balance workload • Utilize a partitioning technique • Examples: MeTiS, Jostle, PLUM
Our Contributions • Limitations of existing partitioners • Separate partitioning and data redistribution steps • Lack of latency tolerance • Balance loads with excessive communication and data movement • Propose a new partitioner (MinEX) for IPG environment • Minimize total runtime rather than balancing workload • Compensate for high latency on the IPG • Compare with existing methods
The MinEX Partitioner • Diffusive algorithm with goal to minimize total runtime • User-supplied function for latency tolerance • Account for data redistribution cost during partitioning • Collapse pairs of vertices incrementally • Partition the contracted graph • Refine graph gradually to original in reverse order • Vertex reassignment considered at each refinement
Processing Weight Wgtv = PWgtv x Procc Communication Cost Comm = CWgt(v,w) x Connect(cp,cq) Redistribution Cost Remap = RWgtv x Connect(Cp,Cq) if p q Weighted Queue Length QWgt(p) = (Wgtv + Comm + Remap ) Heaviest load (MaxQWgt) Lightest load (MinQWgt) Average load (AvgQWgt) Total system load QWgtToT = QWgt(p) Load Imbalance Factor LoadImb = MaxQWgt/AvgQWgt Metrics Utilized v p v p v p v p
MinVar, Gain, and ThroTTle • Processor workload variance from MinQWgt • MinVar = p(QWgt(p) - MinQWgt)2 • MinVar reflects the improvement in MinVar after a vertex reassignment • Gain is the change(QWgtToT) to total system load resulting from a vertex reassignment • ThroTTle is a user defined parameter • Vertex moves that improve MinVar are allowed if Gain/Throttle <= MinVar
MinEX Data Structures • Mesh: {|V|, |E|, vTot, *VMap, *VList, *EList} |V| : Number of active vertices |E| : Total number of edges vTot : Total number of vertices *VMap : Pointer to list of active vertices *VList : Pointer to complete list of vertices *EList : Pointer to list of edges EList entries contains {w,CWgt(v,w)} w = adjacent vertex CWgt(v,w) = edge communication weight
MinEX Data Structures(continued) • VList (for each vertex v): {PWgt, RWgt, |e|, *e, merge, lookup, *VMap, *heap, border} PWgt : Computational weight RWgt : Redistribution weight |e| : Number of incident edges *e : Pointer to the first edge merge : Vertex that merged with v (or -1) lookup : Active vertex containing v (or -1) *VMap : Pointer to v’s position in VMap *heap : Pointer to heap entry for v border : Indicates if v is a border vertex
ProcedureFind(v)If (merge == -1) Return vIf (lookup ! = -1) And (lookup <= vTot)Then Return lookup = Find(lookup)Else Return lookup = Find(merge) Form meta-verticesby collapsing edges Use maximalCWgt(v,w) / (RWgtv+RWgtw) C2 C2 C2 C2 A R1 B R1 C R1 A R1 B R1 C R1 C2 C2 C2 C8 C2 C2 C2 C8 C4 MC C2 C2 D R4 E R2 F R2 D R4 E R2 F R2 MF C8 C8 C2 Stack VMap= A,B,C,D,E,F,G|E|=16 |V|=7 Stack VMap= A,B,H,D,E,G|E|=19 |V|=67 G R2 G R2 H R3 C2 C,F Minex Contraction Phase
MinEX Partition Phase • Contracted graph allows efficient partitioning • Heap with pointers is created • For each vertex, compute optimal reassignment • MinVar, Gain, and ThroTTle criteria satisfied • Vertices are added to the Gain min-heap • The VList *heap pointer is set • Heap is adjusted as vertices are reassigned • Process stops when heap becomes empty
MinEX Refinement Phase • Refinement proceeds in reverse order from contraction through popping vertex pairs off the stack • Reassignment of each refined vertex consideredand partitioning process restarted • Vertex lookup and merge values reset by following the merge chain when edges are accessed(if lookup > vTot)
Expected MaxQWgt Varying ThroTTle Expected LoadImb Varying ThroTTle Analysis of ThroTTle Values (P=32) ThroTTle Values ThroTTle Values
1. Send data sets to be moved 2. Send edge data 3. Process vertices not waiting for edge communication 4. Receive, unpack remapped data sets 5. Receive, unpack communication data 6. Repeat steps 2-5 until all vertices are processed Move data sets and edge data first Achieve latency tolerance by overlapping processing with communication Optimistic view: Processing completely hides the latency Pessimistic view: No latency hiding occurs Application passes to MinEX the latency hiding function Latency Tolerance Approach
Experimental Study:Simulation of an IPG Environment • Configuration File defines clusters, processors, and interconnect slowdowns • Processors in a cluster are assumed homogeneous • Connect(c1, c2) = interconnect slowdown betweenclusters c1 and c2 (unity for no slowdown) • If c1 = c2, Connect(c1, c2) = intraconnect slowdown • Procc represents the processing slowdown (normalized to unity) within a cluster • Configuration File mapped to processing graph by MinEX so actual vertex assignments in the distributed environment can be modeled
Test Application:Unstructured Adaptive Mesh • Time-dependent shock wave propagated thru cylindrical volume • Tetrahedral mesh discretization • Coarsen previously refined elements • Mesh grows from 50K to 1.8M tets over nine adaptation levels • Workload becomes unbalanced as mesh is adapted
Characteristics Of Test Application • Mesh elements interact only with immediate neighbors • High communication and remapping costs • Numerical solver not included
MinEX Partitioner Performance • SBN: Dynamic load-balancer based on Symmetric Broadcast Network that was adapted for mesh applications • PLUM: Semi-dynamic framework for processing adaptive, unstructured meshes • MinEX comparisons with SBN and PLUM:
Expected runtimes(no latency tolerance) Expected runtimes (maximum latency tolerance) Experimental Results(P=32) INTERCONNECT SLOWDOWNS INTERCONNECT SLOWDOWNS Runtimes in thousands of units
Conclusions & Ongoing Research • Introduced a new partitioner called MinEX and experimented in simulated IPG environments • Runtimes increase with larger slowdowns as clusters are added • Additional clusters increase benefits of latency tolerance • Estimated runtimes with MinEX improved by a factor of five over no partitioning • Currently applying MinEX to the N-body problem (Barnes-Hut algorithm)