1 / 27

Daniel J. Harvey Department of Computer Science Southern Oregon University

Designing an Efficient Partitioning Algorithm for Grid Environments with Application to N-Body Problems. Daniel J. Harvey Department of Computer Science Southern Oregon University E-mail: harveyd@sou.edu Sajal K. Das Department of Computer Science and Engineering

brand
Download Presentation

Daniel J. Harvey Department of Computer Science Southern Oregon University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Designing an Efficient Partitioning Algorithm for Grid Environments with Application to N-Body Problems Daniel J. Harvey Department of Computer Science Southern Oregon University E-mail: harveyd@sou.edu Sajal K. Das Department of Computer Science and Engineering The University of Texas at Arlington E-mail: das@cse.uta.edu Rupak Biswas NASA Ames Research Center E-mail: rbiswas@nas.nasa.gov

  2. Presentation Overview • The information power grid (IPG) • The MinEX partitioner • This paper’s contributions • Metrics utilized • The N-Body problem • MinEX refinements • Experimental study • Performance results • Conclusions and on-going research

  3. The Information Power Grid (IPG) • Harness power of geographically separated resources • Developed by NASA and other collaborative partners • Utilize geographically separated processors to solve large-scale computational problems • Characteristics • limited bandwidth and high latency • heterogeneous configurations • Relevant applications identified by I-Way experiment • Remote access to large databases requiring high-end graphics • Remote virtual reality access to instruments • Remote interactions with super-computer simulations

  4. Load Balancing Approaches Especially important in grid environments Traditional Load Balancing Objectives Distribute workload evenly among processors Minimize idle time • Static load-balancing • Balance load prior to execution • Examples: smart-compilers, schedulers • Dynamic load-balancing • Balance as application is processed • Examples: adaptive contracting, gradient, symmetric broadcast networks • Semi-dynamic load-balancing (Our focus in this paper) • Temporarily stop application processing to balance workload • Utilizes a partitioning technique • Examples: MeTiS, Jostle, PLUM

  5. The MinEX Partitioner • We previously introduced a novel partitioner called MinEX • Minex: A latency-tolerant dynamic partitioner for grid computing applications, FGCS, 18 (2002), pp. 477—489 • MinEX’s unique characterisitcs include • Environment: designed specifically for heterogeneous geographically distributed environments • Grid: maps configuration graph onto the partition graph; produces partitions reflecting the grid • Goal: minimize runtime rather than balance processing workload and minimize edge cut • Latency: accounts for latency tolerance during partitioning • Accounts for: data movement & communication overhead

  6. This Paper’s Contributions • To compare MinEX performance to METIS, a state the art partitioner • Result: Speed of execution is competitive • Result: Quality of partitions reduce application runtime by up to a factor of 6 • Estimate performance utilizing a wide range of heterogeneous grid configurations • Apply MinEX to a real-life application (the N-Body problem) executing in simulated grid environments • Introduce refinements to our initial algorithm

  7. The MinEX Partitioner • Multi-level scheme • Collapse edges incrementally • Partitions the contracted graph • Refines the graph in reverse • Reassignments executed to improve partition quality • Creates diffusive or from scratch partitions • User-supplied function estimates solver latency tolerance • Accounts for data redistribution cost during partitioning

  8. Processing weight Wgt = PWgtv x Procc Communication cost Comm = SwepCWgt(v,w) x Connect(c,d) Redistribution cost Remap = RWgtv x Connect(c,d) if pq Weighted queue length QWgt(p) = Svep(Wgt + Comm + Remap ) Heaviest load (MaxQWgt) Qlenp = Vertices e p Average load (WSysLL) Total system load QWgtToT = SpePQWgt(p) Imbalance factor LoadImb = MaxQWgt/WSysLL Metrics Utilized v p v p v p v p

  9. MinVar, Gain andThroTTle • Processor workload variance from WSysLL • Var = Sp(QWgt(p) - WSysLL)2 • DVar reflects the improvement in MinVar after a vertex reassignment. A positive value implies that the Var value has increased • Gain is the change(DQWgtToT) to total system load resulting from a vertex reassignment • ThroTTle is a user defined parameter. If Gain>0, Vertex moves that improve DVar are allowed if Gain2/-DVar <= ThroTTle

  10. The N-Body Problem • Classical problem of simulating movement of a set of bodies • Based upon gravitational or electrostatic forces • Iterates over a series of time steps • At each step for each body • Compute forces from all other bodies using the gravitational laws • Calculates Acceleration and integrates twice to compute the position at the next time step • If all the force calculations are formed, O(n2) computations are required at each time step.

  11. Barnes & Hut Solution (Framework for experiments) • Reduces computational complexity from O(n2) to O(n lg n) • Clusters of bodies that are far from a cell are treated as a single body using the total center of mass and the center of mass position • Cell Cv is considered far from Cell Cw if the size of the cell divided by the distance between cells is less than a constantF • Our implementation (For each time-step) • Create the octtree of cells • Form a graph graph using the cells of the octtree • Partition the graph, distribute cells to be relocated among processors • Run the solver

  12. The Partitioning Graph • Each vertex, v, in the partitioning graph corresponds to a leaf cell, Cv with |Cv| bodies, in the N-Body oct tree and has two associated weights. PWgtv models computations associated with the body, RWgtv represents data distribution cost • PWgtv = |Cv| x (|Cv|-1+CloseB+Farv+2) • RWgtv = |Cv| • Each edge (v,w) weight CWgt(v,w) models the communication cost between cells Cv and Cw. • CWgt(v,w) = |cw| if Cw is close to cw; 0 otherwise.

  13. Graph Modifications • METIS Limitations • Cannot operate on directed graphs • Cannot tolerate edge weights of zero • N-Body graph • CWgt(v,w) can be different than CWgt(w,v) because |Cv| may not equal |cw| • CWgt(v,w) can equal 0 if Cv is close to cW but Cw is far from Cv. • For direct comparisons, experiments are run using • Original N-Body graph (Graph G) • Modified Graph (Graph Gm)

  14. MinEX Basic Partition Criteria • Minimize MaxQWgt rather than balance processor workloads. • Collapse edges that result in the best Gain value using a min-heap • Call user-defined latency tolerance function to estimate latency tolerance • Move verticices from overloaded processors (QWgtp > WSysLL) to underloaded processors (QWgtp < WSysLL) • Reject potential reassignments that:(i) have a positive DVar (ii) are rejected by the reassignment filter function

  15. Projects Qwgtnew, DVar, newGain Vertex totals used: Edge weights same cluster Edge weights other clusters Local Edge weights Total outgoing edge weight Relocation, Processing weights IF (newQWgtfrom > Qwgtfrom) Reject Assignment IF (newQWgtto < Qwgtto) Reject Assignment IF (Dvar >= 0) Reject Assignment IF newGain>0 && newGain2/-Dvar>ThroTTle Reject Assignment Dnew=newQWgtfrom-newQWgtto Dold=QWgtfrom-QWgtto) IF fabs(Dnew)>abs(Dnew) IF newQWgtfrom<Qwgtto Reject Assignment IF newQWgtto>Qwgtfrom Reject Assignment Assignment Passes Filter Reassignment Filter FunctionGoal: Avoid unnecessary edge processing and reject deliterious reassignmnents that cause increased edge processing

  16. Additional refinements (to enhance performance) • Graph contraction phase • Bucket sort vertices by process • Quickly find candidates for merging • Maintain a list of processors sorted by QWgt • Few processors change position after vertex moves • Maintaining this list incurs minimal overhead • Defined user-defined latency tolerance function (called before each potential reassignment) • Double MinEX(User *user, Ipg *ipg, Qtot *tot) • User = User options passed to the partitioner • Ipg = Grid configuration graph • tot contains Pprocp, Commp, Remapp, QLenp

  17. Experimental StudySimulation of a Grid Environment • Simulated Grid Environment vs actual grids • Low cost alternative to constructing a wide range heterogeneous configurations • Limited grid facilities are available in the field and are usually homogeneous • Methodology • Discrete time simulation • Utilize configuration graph to model processing speed, communication latency, and bandwidth • Configurations (Processors=32,64,128; Interconnect slowdowns=10,100;Clusters=4,8) • HO: Constant processing and intra-communication capabilityUP: Faster processors have faster intra-communication capability • DN: Faster processors have slower intra-communication capability

  18. Reassignment Filter Effectiveness • Reassignment filter eliminates virtually all overhead with vertex moves that are rejected • Almost all assignments passing the filter were accepted

  19. Scalability Test (Scales well to 128 processors)P varied between 8 and 1024, Runtimes compared

  20. ThroTTle Test (Initially Improves as throttle increases until curve flattens out)

  21. Multiple Time Step TestP=64, I=10, C=8, B=16K • Running multiple iterations does not significantly impact the results • The rest of the experiments will be based on a single time step

  22. Partitioner Speed Comparisons • MinEX has the advantage for P=32 and P=64 • METIS has the advantage for P=1k • Overall, MinEX is competitive

  23. Partition Quality Comparisons (C=8) • MinEX and METIS show similar results for Homogeneous configurations. • Heterogeneous configurations show clear advantage to MinEX

  24. Partition Quality Comparisons (C=8) • Similar results to I=10 experiments • MinEX-Gm results are in general somewhat worse than MinEX-G because of less accurate application modeling • METIS results are significantly worse than MinEX; but less compared to faster interconnects. Slower interconnect speed makes grid more homogeneous

  25. Partition Quality ComparisonsAdditional Observations • DN configuration results are similar to UP experiments with a few exceptions • DN runs are worse than the UP runs in a few cases (998 vs 1489 if P=128, C=4, I=100, B=64K) • The MinEX projected 975, but converged to 1489. • When Simulating a second input channel, the solver converges at 975 for DN. No such improvement for METIS • HO runs with P=32 & 64, I=100, B=256K give METIS an advantage (7399 to 5199 and 4231 and 3334 respectively). • MinEX is converging tightly (LoadImb=1.0001) to a high value • Perhaps the criteria for reassignments needs to be further refined.

  26. Conclusions • Direct comparisons between MinEX and METIS • MinEX produces partitions that reduce runtime by up to a factor of 6 in highly-heterogeneous grids • MinEX and METIS are competitive in homogeneous grids • MinEX is competitive to METIS as far as speed of execution • Implemented performance refinements to MinEX • The reassignment filter minimizes overhead associated with potential reassignments that are rejected • Sorting processors by QWgt speed up partitioning decisions • A bucket sort speeds up finding edges to collapse • Minex can partition directed graphs • Not commonly allowed by current partitioners • Account for latency tolerance during partitioning • Established the benefit and feasibility of this approach • N-body solver implemention • using the partitioning and message passing model.

  27. On-going Research • MinEX Refinements • Analyze effect of using multiple I/o channels and network dynamics • Refine the method of selecting vertices for reassignment • Refine the discrete time simulator • Develop a general-purpose tool for simulating heterogeneous grids • Establish the accuracy of the simulator by comparing its projections to the performance of applications running on real parallel systems

More Related