270 likes | 404 Views
Designing an Efficient Partitioning Algorithm for Grid Environments with Application to N-Body Problems. Daniel J. Harvey Department of Computer Science Southern Oregon University E-mail: harveyd@sou.edu Sajal K. Das Department of Computer Science and Engineering
E N D
Designing an Efficient Partitioning Algorithm for Grid Environments with Application to N-Body Problems Daniel J. Harvey Department of Computer Science Southern Oregon University E-mail: harveyd@sou.edu Sajal K. Das Department of Computer Science and Engineering The University of Texas at Arlington E-mail: das@cse.uta.edu Rupak Biswas NASA Ames Research Center E-mail: rbiswas@nas.nasa.gov
Presentation Overview • The information power grid (IPG) • The MinEX partitioner • This paper’s contributions • Metrics utilized • The N-Body problem • MinEX refinements • Experimental study • Performance results • Conclusions and on-going research
The Information Power Grid (IPG) • Harness power of geographically separated resources • Developed by NASA and other collaborative partners • Utilize geographically separated processors to solve large-scale computational problems • Characteristics • limited bandwidth and high latency • heterogeneous configurations • Relevant applications identified by I-Way experiment • Remote access to large databases requiring high-end graphics • Remote virtual reality access to instruments • Remote interactions with super-computer simulations
Load Balancing Approaches Especially important in grid environments Traditional Load Balancing Objectives Distribute workload evenly among processors Minimize idle time • Static load-balancing • Balance load prior to execution • Examples: smart-compilers, schedulers • Dynamic load-balancing • Balance as application is processed • Examples: adaptive contracting, gradient, symmetric broadcast networks • Semi-dynamic load-balancing (Our focus in this paper) • Temporarily stop application processing to balance workload • Utilizes a partitioning technique • Examples: MeTiS, Jostle, PLUM
The MinEX Partitioner • We previously introduced a novel partitioner called MinEX • Minex: A latency-tolerant dynamic partitioner for grid computing applications, FGCS, 18 (2002), pp. 477—489 • MinEX’s unique characterisitcs include • Environment: designed specifically for heterogeneous geographically distributed environments • Grid: maps configuration graph onto the partition graph; produces partitions reflecting the grid • Goal: minimize runtime rather than balance processing workload and minimize edge cut • Latency: accounts for latency tolerance during partitioning • Accounts for: data movement & communication overhead
This Paper’s Contributions • To compare MinEX performance to METIS, a state the art partitioner • Result: Speed of execution is competitive • Result: Quality of partitions reduce application runtime by up to a factor of 6 • Estimate performance utilizing a wide range of heterogeneous grid configurations • Apply MinEX to a real-life application (the N-Body problem) executing in simulated grid environments • Introduce refinements to our initial algorithm
The MinEX Partitioner • Multi-level scheme • Collapse edges incrementally • Partitions the contracted graph • Refines the graph in reverse • Reassignments executed to improve partition quality • Creates diffusive or from scratch partitions • User-supplied function estimates solver latency tolerance • Accounts for data redistribution cost during partitioning
Processing weight Wgt = PWgtv x Procc Communication cost Comm = SwepCWgt(v,w) x Connect(c,d) Redistribution cost Remap = RWgtv x Connect(c,d) if pq Weighted queue length QWgt(p) = Svep(Wgt + Comm + Remap ) Heaviest load (MaxQWgt) Qlenp = Vertices e p Average load (WSysLL) Total system load QWgtToT = SpePQWgt(p) Imbalance factor LoadImb = MaxQWgt/WSysLL Metrics Utilized v p v p v p v p
MinVar, Gain andThroTTle • Processor workload variance from WSysLL • Var = Sp(QWgt(p) - WSysLL)2 • DVar reflects the improvement in MinVar after a vertex reassignment. A positive value implies that the Var value has increased • Gain is the change(DQWgtToT) to total system load resulting from a vertex reassignment • ThroTTle is a user defined parameter. If Gain>0, Vertex moves that improve DVar are allowed if Gain2/-DVar <= ThroTTle
The N-Body Problem • Classical problem of simulating movement of a set of bodies • Based upon gravitational or electrostatic forces • Iterates over a series of time steps • At each step for each body • Compute forces from all other bodies using the gravitational laws • Calculates Acceleration and integrates twice to compute the position at the next time step • If all the force calculations are formed, O(n2) computations are required at each time step.
Barnes & Hut Solution (Framework for experiments) • Reduces computational complexity from O(n2) to O(n lg n) • Clusters of bodies that are far from a cell are treated as a single body using the total center of mass and the center of mass position • Cell Cv is considered far from Cell Cw if the size of the cell divided by the distance between cells is less than a constantF • Our implementation (For each time-step) • Create the octtree of cells • Form a graph graph using the cells of the octtree • Partition the graph, distribute cells to be relocated among processors • Run the solver
The Partitioning Graph • Each vertex, v, in the partitioning graph corresponds to a leaf cell, Cv with |Cv| bodies, in the N-Body oct tree and has two associated weights. PWgtv models computations associated with the body, RWgtv represents data distribution cost • PWgtv = |Cv| x (|Cv|-1+CloseB+Farv+2) • RWgtv = |Cv| • Each edge (v,w) weight CWgt(v,w) models the communication cost between cells Cv and Cw. • CWgt(v,w) = |cw| if Cw is close to cw; 0 otherwise.
Graph Modifications • METIS Limitations • Cannot operate on directed graphs • Cannot tolerate edge weights of zero • N-Body graph • CWgt(v,w) can be different than CWgt(w,v) because |Cv| may not equal |cw| • CWgt(v,w) can equal 0 if Cv is close to cW but Cw is far from Cv. • For direct comparisons, experiments are run using • Original N-Body graph (Graph G) • Modified Graph (Graph Gm)
MinEX Basic Partition Criteria • Minimize MaxQWgt rather than balance processor workloads. • Collapse edges that result in the best Gain value using a min-heap • Call user-defined latency tolerance function to estimate latency tolerance • Move verticices from overloaded processors (QWgtp > WSysLL) to underloaded processors (QWgtp < WSysLL) • Reject potential reassignments that:(i) have a positive DVar (ii) are rejected by the reassignment filter function
Projects Qwgtnew, DVar, newGain Vertex totals used: Edge weights same cluster Edge weights other clusters Local Edge weights Total outgoing edge weight Relocation, Processing weights IF (newQWgtfrom > Qwgtfrom) Reject Assignment IF (newQWgtto < Qwgtto) Reject Assignment IF (Dvar >= 0) Reject Assignment IF newGain>0 && newGain2/-Dvar>ThroTTle Reject Assignment Dnew=newQWgtfrom-newQWgtto Dold=QWgtfrom-QWgtto) IF fabs(Dnew)>abs(Dnew) IF newQWgtfrom<Qwgtto Reject Assignment IF newQWgtto>Qwgtfrom Reject Assignment Assignment Passes Filter Reassignment Filter FunctionGoal: Avoid unnecessary edge processing and reject deliterious reassignmnents that cause increased edge processing
Additional refinements (to enhance performance) • Graph contraction phase • Bucket sort vertices by process • Quickly find candidates for merging • Maintain a list of processors sorted by QWgt • Few processors change position after vertex moves • Maintaining this list incurs minimal overhead • Defined user-defined latency tolerance function (called before each potential reassignment) • Double MinEX(User *user, Ipg *ipg, Qtot *tot) • User = User options passed to the partitioner • Ipg = Grid configuration graph • tot contains Pprocp, Commp, Remapp, QLenp
Experimental StudySimulation of a Grid Environment • Simulated Grid Environment vs actual grids • Low cost alternative to constructing a wide range heterogeneous configurations • Limited grid facilities are available in the field and are usually homogeneous • Methodology • Discrete time simulation • Utilize configuration graph to model processing speed, communication latency, and bandwidth • Configurations (Processors=32,64,128; Interconnect slowdowns=10,100;Clusters=4,8) • HO: Constant processing and intra-communication capabilityUP: Faster processors have faster intra-communication capability • DN: Faster processors have slower intra-communication capability
Reassignment Filter Effectiveness • Reassignment filter eliminates virtually all overhead with vertex moves that are rejected • Almost all assignments passing the filter were accepted
Scalability Test (Scales well to 128 processors)P varied between 8 and 1024, Runtimes compared
ThroTTle Test (Initially Improves as throttle increases until curve flattens out)
Multiple Time Step TestP=64, I=10, C=8, B=16K • Running multiple iterations does not significantly impact the results • The rest of the experiments will be based on a single time step
Partitioner Speed Comparisons • MinEX has the advantage for P=32 and P=64 • METIS has the advantage for P=1k • Overall, MinEX is competitive
Partition Quality Comparisons (C=8) • MinEX and METIS show similar results for Homogeneous configurations. • Heterogeneous configurations show clear advantage to MinEX
Partition Quality Comparisons (C=8) • Similar results to I=10 experiments • MinEX-Gm results are in general somewhat worse than MinEX-G because of less accurate application modeling • METIS results are significantly worse than MinEX; but less compared to faster interconnects. Slower interconnect speed makes grid more homogeneous
Partition Quality ComparisonsAdditional Observations • DN configuration results are similar to UP experiments with a few exceptions • DN runs are worse than the UP runs in a few cases (998 vs 1489 if P=128, C=4, I=100, B=64K) • The MinEX projected 975, but converged to 1489. • When Simulating a second input channel, the solver converges at 975 for DN. No such improvement for METIS • HO runs with P=32 & 64, I=100, B=256K give METIS an advantage (7399 to 5199 and 4231 and 3334 respectively). • MinEX is converging tightly (LoadImb=1.0001) to a high value • Perhaps the criteria for reassignments needs to be further refined.
Conclusions • Direct comparisons between MinEX and METIS • MinEX produces partitions that reduce runtime by up to a factor of 6 in highly-heterogeneous grids • MinEX and METIS are competitive in homogeneous grids • MinEX is competitive to METIS as far as speed of execution • Implemented performance refinements to MinEX • The reassignment filter minimizes overhead associated with potential reassignments that are rejected • Sorting processors by QWgt speed up partitioning decisions • A bucket sort speeds up finding edges to collapse • Minex can partition directed graphs • Not commonly allowed by current partitioners • Account for latency tolerance during partitioning • Established the benefit and feasibility of this approach • N-body solver implemention • using the partitioning and message passing model.
On-going Research • MinEX Refinements • Analyze effect of using multiple I/o channels and network dynamics • Refine the method of selecting vertices for reassignment • Refine the discrete time simulator • Develop a general-purpose tool for simulating heterogeneous grids • Establish the accuracy of the simulator by comparing its projections to the performance of applications running on real parallel systems