1 / 26

Performance of a Heterogeneous Grid Partitioner for N-body Applications

Performance of a Heterogeneous Grid Partitioner for N-body Applications. Daniel J. Harvey Department of Computer Science Southern Oregon University E-mail: harveyd@sou.edu Sajal K. Das Department of Computer Science and Engineering The University of Texas at Arlington

Download Presentation

Performance of a Heterogeneous Grid Partitioner for N-body Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance of a Heterogeneous Grid Partitioner for N-body Applications Daniel J. Harvey Department of Computer Science Southern Oregon University E-mail: harveyd@sou.edu Sajal K. Das Department of Computer Science and Engineering The University of Texas at Arlington E-mail: das@cse.uta.edu Rupak Biswas NASA Ames Research Center E-mail: rbiswas@nas.nasa.gov

  2. Presentation Overview • The information power grid (IPG) • The MinEX partitioner • This paper’s contributions • MinEX refinements • The N-Body problem • Experimental study • Performance results • Conclusions and on-going research

  3. The Information Power Grid (IPG) • Harness power of geographically separated resources • Developed by NASA and other collaborative partners • Utilize geographically separated processors to solve large-scale computational problems • Characteristics • limited bandwidth and high latency • heterogeneous configurations • Relevant applications identified by I-Way experiment • Remote access to large databases requiring high-end graphics • Remote virtual reality access to instruments • Remote interactions with super-computer simulations

  4. The MinEX Partitioner • We previously introduced a novel partitioner called MinEX • Minex: A latency-tolerant dynamic partitioner for grid computing applications, FGCS, 18 (2002), pp. 477—489 • MinEX’s unique characterisitcs include • Environment: designed specifically for heterogeneous geographically distributed environments • Grid: maps configuration graph onto the partition graph; produces partitions reflecting the grid • Goal: minimize runtime rather than balance processing workload and minimize edge cut • Latency: accounts for latency tolerance during partitioning • Accounts for: data movement & communication overhead

  5. This Paper’s Contributions • Evaluate MinEX performance with a wide range of heterogeneous grid configurations • Compare MinEX to METIS, a popular state the art partitioner • Run experiments using a real-life application solver executing in simulated grid environments • Introduce refinements to our initial algorithm • Results • MinEX speed of execution is competitive with METIS • MinEX produces superior grid-based partitions that reduce application runtime by up to a factor of 6

  6. The MinEX Partitioner • Multi-level scheme • Collapse edges incrementally • Partitions the contracted graph • Refines the graph in reverse • Reassignments during refinement improves partition quality • Creates diffusive or from scratch partitions • User-supplied function estimates solver latency tolerance • Accounts for data redistribution cost during partitioning

  7. Processing weight Wgt = PWgtv x Procc Communication cost Comm = SwepCWgt(v,w) x Connect(c,d) Redistribution cost Remap = RWgtv x Connect(c,d) if pq Weighted queue length QWgt(p) = Svep(Wgt + Comm + Remap ) Heaviest load (MaxQWgt) Qlenp = Vertices e p Average load (WSysLL) Total system load QWgtToT = SpePQWgt(p) Imbalance factor LoadImb = MaxQWgt/WSysLL Metrics Utilized v p v p v p v p

  8. MinVar, Gain andThroTTle • Processor workload variance from WSysLL • Var = Sp(QWgt(p) - WSysLL)2 • DVar reflects the improvement in MinVar after a vertex reassignment. A positive value implies that the Var value has increased • Gain is the change(DQWgtToT) to total system load resulting from a vertex reassignment • ThroTTle is a user defined parameter. If Gain>0, Vertex moves that improve DVar are allowed if Gain2/-DVar <= ThroTTle

  9. MinEX Basic Partition Criteria • Minimize MaxQWgt rather than balance processor workloads. • Move verticices from overloaded processors (QWgtp > WSysLL) to underloaded processors (QWgtp < WSysLL)

  10. Projects Qwgtnew, DVar, newGain Vertex totals used: Edge weights same cluster Edge weights other clusters Local Edge weights Total outgoing edge weight Relocation, Processing weights IF (newQWgtfrom > Qwgtfrom) Reject Assignment IF (newQWgtto < Qwgtto) Reject Assignment IF (Dvar >= 0) Reject Assignment IF newGain>0 && newGain2/-Dvar>ThroTTle Reject Assignment Dnew=newQWgtfrom-newQWgtto Dold=QWgtfrom-QWgtto) IF fabs(Dnew)>abs(Dnew) IF newQWgtfrom<Qwgtto Reject Assignment IF newQWgtto>Qwgtfrom Reject Assignment Assignment Passes Filter Reassignment Filter FunctionGoal: Minimize edge related processing; reject deleterious assignments

  11. Additional MinEX Refinements • Graph contraction phase • Bucket sort vertices by processor • Find edges to merge without searching • Defined user-defined latency tolerance function (called before each potential reassignment) • Double MinEX(User *user, Ipg *ipg, Qtot *tot) • User = User options passed to the partitioner • Ipg = Grid configuration graph • tot contains Pprocp, Commp, Remapp, QLenp

  12. The N-Body ProblemClassical problem of simulating the movement of a set of bodies • The Solution is based upon gravitational or electrostatic forces • The application Iterates over a series of time steps • At each step for each body • Compute forces from all other bodies using the gravitational laws • Calculates Acceleration and integrates twice to compute the position at the next time step • Call the partitioner to balance the next-step computations among the processors.

  13. Barnes & Hut Solution (Framework for experiments) • Reduces computational complexity from O(n2) to O(n lg n) • Clusters of bodies that are far from a cell are treated as a single body using the total center of mass and the center of mass position • Cell Cv is considered far from Cell Cw if the size of the cell divided by the distance between cells is less than a constantF • Our implementation • Initialization • Create the octtree of cells • Form a graph graph using the cells of the octtree • Each time step • Partition the graph, distribute cells to be relocated among processors • Run the solver

  14. The Partitioning GraphConstructed from the Barnes&Hut OctTree • One vertex per cell, Cv with |Cv| bodies • Two associated weights • PWgtv models the required computations PWgtv = |Cv| x (|Cv|-1+CloseB+Farv+2) • RWgt models data distribution RWgtv = |Cv| • Edges model communication between close cells • Each edge (v,w) relates to cells Cv and Cw. CWgt(v,w) = |cw| if Cw is close to cw; else 0

  15. Graph Modifications • N-Body graph • CWgt(v,w) can be different than CWgt(w,v) because |Cv| may not equal |cw| • CWgt(v,w) can equal 0 if Cv is close to cW but Cw is far from Cv. • METIS Limitations • Cannot operate on directed graphs • Cannot tolerate edge weights of zero • For direct comparisons, experiments are run using • Original N-Body graph (Graph G) • Modified Graph (Graph Gm)

  16. Experimental StudySimulation of a Grid Environment • Simulated Grid Environment vs actual grids • Low cost alternative to constructing a wide range heterogeneous configurations • Limited grid facilities are available in the field and are usually homogeneous • Methodology • Discrete time simulation • Utilize configuration graph to model processing speed, communication latency, and bandwidth • Configurations (Processors=32,64,128; Interconnect slowdowns=10,100;Clusters=4,8) • HO: Constant processing and intra-communication capabilityUP: Faster processors have faster intra-communication capability • DN: Faster processors have slower intra-communication capability

  17. Filter Effectiveness (C=8) • Reassignment filter eliminates virtually all overhead with vertex moves that are rejected • Almost all assignments passing the filter were accepted

  18. Scalability Test (Scales well to 128 processors)P varied between 8 and 1024, C=8, Runtimes compared

  19. ThroTTle Test (C=8)(Initially Improves as throttle increases until curve flattens out)

  20. Multiple Time Step TestP=64, I=10, C=8, B=16K • Multiple iterations have limited impact • Subsequent experiments run a single time step

  21. Partitioner Speed Comparisons • MinEX has the advantage for P=32 and P=64 • METIS has the advantage for P=1k • Overall, MinEX is competitive

  22. Partition Quality Comparisons (C=8) • MinEX and METIS show similar results for Homogeneous configurations. • Heterogeneous configurations show clear advantage to MinEX

  23. Partition Quality Comparisons (C=8) • Similar results to I=10 experiments • MinEX-Gm results are in general somewhat worse than MinEX-G because of less accurate application modeling • METIS results are significantly worse than MinEX; but less compared to faster interconnects. Slower interconnect speed makes grid more homogeneous

  24. Partition Quality ComparisonsAdditional Observations • DN configuration results are similar to UP experiments with a few exceptions • DN runs are worse than the UP runs in a few cases (998 vs 1489 if P=128, C=4, I=100, B=64K) • The MinEX projected 975, but converged to 1489. • When Simulating a second input channel, the solver converges at 975 for DN. No such improvement for METIS • HO runs with P=32 & 64, I=100, B=256K give METIS an advantage (7399 to 5199 and 4231 and 3334 respectively). • MinEX is converging tightly (LoadImb=1.0001) to a high value • Perhaps the criteria for reassignments needs to be further refined.

  25. Conclusions • Direct comparisons between MinEX and METIS • An N-body solver on simulated grid environments form the basis for our experiments • MinEX produces partitions that reduce runtime by up to a factor of 6 in highly-heterogeneous grids • MinEX and METIS are competitive in homogeneous grids • MinEX is competitive to METIS as far as speed of execution • Implemented performance refinements to MinEX • The reassignment filter minimizes overhead associated with potential reassignments that are rejected • Sorting processors by QWgt speed up partitioning decisions • A bucket sort speeds up finding edges to collapse • Minex can partition directed graphs • Not commonly allowed by current partitioners • Account for latency tolerance during partitioning • Established the benefit and feasibility of this approach

  26. On-going Research • MinEX Refinements • Analyze effect of using multiple I/O channels and network dynamics • Refine the method of selecting vertices for reassignment • Refine the discrete time simulator • Develop a general-purpose tool for simulating heterogeneous grids • Establish the accuracy of the simulator by comparing its projections to the performance of applications running on actual grids

More Related