260 likes | 352 Views
Performance of a Heterogeneous Grid Partitioner for N-body Applications. Daniel J. Harvey Department of Computer Science Southern Oregon University E-mail: harveyd@sou.edu Sajal K. Das Department of Computer Science and Engineering The University of Texas at Arlington
E N D
Performance of a Heterogeneous Grid Partitioner for N-body Applications Daniel J. Harvey Department of Computer Science Southern Oregon University E-mail: harveyd@sou.edu Sajal K. Das Department of Computer Science and Engineering The University of Texas at Arlington E-mail: das@cse.uta.edu Rupak Biswas NASA Ames Research Center E-mail: rbiswas@nas.nasa.gov
Presentation Overview • The information power grid (IPG) • The MinEX partitioner • This paper’s contributions • MinEX refinements • The N-Body problem • Experimental study • Performance results • Conclusions and on-going research
The Information Power Grid (IPG) • Harness power of geographically separated resources • Developed by NASA and other collaborative partners • Utilize geographically separated processors to solve large-scale computational problems • Characteristics • limited bandwidth and high latency • heterogeneous configurations • Relevant applications identified by I-Way experiment • Remote access to large databases requiring high-end graphics • Remote virtual reality access to instruments • Remote interactions with super-computer simulations
The MinEX Partitioner • We previously introduced a novel partitioner called MinEX • Minex: A latency-tolerant dynamic partitioner for grid computing applications, FGCS, 18 (2002), pp. 477—489 • MinEX’s unique characterisitcs include • Environment: designed specifically for heterogeneous geographically distributed environments • Grid: maps configuration graph onto the partition graph; produces partitions reflecting the grid • Goal: minimize runtime rather than balance processing workload and minimize edge cut • Latency: accounts for latency tolerance during partitioning • Accounts for: data movement & communication overhead
This Paper’s Contributions • Evaluate MinEX performance with a wide range of heterogeneous grid configurations • Compare MinEX to METIS, a popular state the art partitioner • Run experiments using a real-life application solver executing in simulated grid environments • Introduce refinements to our initial algorithm • Results • MinEX speed of execution is competitive with METIS • MinEX produces superior grid-based partitions that reduce application runtime by up to a factor of 6
The MinEX Partitioner • Multi-level scheme • Collapse edges incrementally • Partitions the contracted graph • Refines the graph in reverse • Reassignments during refinement improves partition quality • Creates diffusive or from scratch partitions • User-supplied function estimates solver latency tolerance • Accounts for data redistribution cost during partitioning
Processing weight Wgt = PWgtv x Procc Communication cost Comm = SwepCWgt(v,w) x Connect(c,d) Redistribution cost Remap = RWgtv x Connect(c,d) if pq Weighted queue length QWgt(p) = Svep(Wgt + Comm + Remap ) Heaviest load (MaxQWgt) Qlenp = Vertices e p Average load (WSysLL) Total system load QWgtToT = SpePQWgt(p) Imbalance factor LoadImb = MaxQWgt/WSysLL Metrics Utilized v p v p v p v p
MinVar, Gain andThroTTle • Processor workload variance from WSysLL • Var = Sp(QWgt(p) - WSysLL)2 • DVar reflects the improvement in MinVar after a vertex reassignment. A positive value implies that the Var value has increased • Gain is the change(DQWgtToT) to total system load resulting from a vertex reassignment • ThroTTle is a user defined parameter. If Gain>0, Vertex moves that improve DVar are allowed if Gain2/-DVar <= ThroTTle
MinEX Basic Partition Criteria • Minimize MaxQWgt rather than balance processor workloads. • Move verticices from overloaded processors (QWgtp > WSysLL) to underloaded processors (QWgtp < WSysLL)
Projects Qwgtnew, DVar, newGain Vertex totals used: Edge weights same cluster Edge weights other clusters Local Edge weights Total outgoing edge weight Relocation, Processing weights IF (newQWgtfrom > Qwgtfrom) Reject Assignment IF (newQWgtto < Qwgtto) Reject Assignment IF (Dvar >= 0) Reject Assignment IF newGain>0 && newGain2/-Dvar>ThroTTle Reject Assignment Dnew=newQWgtfrom-newQWgtto Dold=QWgtfrom-QWgtto) IF fabs(Dnew)>abs(Dnew) IF newQWgtfrom<Qwgtto Reject Assignment IF newQWgtto>Qwgtfrom Reject Assignment Assignment Passes Filter Reassignment Filter FunctionGoal: Minimize edge related processing; reject deleterious assignments
Additional MinEX Refinements • Graph contraction phase • Bucket sort vertices by processor • Find edges to merge without searching • Defined user-defined latency tolerance function (called before each potential reassignment) • Double MinEX(User *user, Ipg *ipg, Qtot *tot) • User = User options passed to the partitioner • Ipg = Grid configuration graph • tot contains Pprocp, Commp, Remapp, QLenp
The N-Body ProblemClassical problem of simulating the movement of a set of bodies • The Solution is based upon gravitational or electrostatic forces • The application Iterates over a series of time steps • At each step for each body • Compute forces from all other bodies using the gravitational laws • Calculates Acceleration and integrates twice to compute the position at the next time step • Call the partitioner to balance the next-step computations among the processors.
Barnes & Hut Solution (Framework for experiments) • Reduces computational complexity from O(n2) to O(n lg n) • Clusters of bodies that are far from a cell are treated as a single body using the total center of mass and the center of mass position • Cell Cv is considered far from Cell Cw if the size of the cell divided by the distance between cells is less than a constantF • Our implementation • Initialization • Create the octtree of cells • Form a graph graph using the cells of the octtree • Each time step • Partition the graph, distribute cells to be relocated among processors • Run the solver
The Partitioning GraphConstructed from the Barnes&Hut OctTree • One vertex per cell, Cv with |Cv| bodies • Two associated weights • PWgtv models the required computations PWgtv = |Cv| x (|Cv|-1+CloseB+Farv+2) • RWgt models data distribution RWgtv = |Cv| • Edges model communication between close cells • Each edge (v,w) relates to cells Cv and Cw. CWgt(v,w) = |cw| if Cw is close to cw; else 0
Graph Modifications • N-Body graph • CWgt(v,w) can be different than CWgt(w,v) because |Cv| may not equal |cw| • CWgt(v,w) can equal 0 if Cv is close to cW but Cw is far from Cv. • METIS Limitations • Cannot operate on directed graphs • Cannot tolerate edge weights of zero • For direct comparisons, experiments are run using • Original N-Body graph (Graph G) • Modified Graph (Graph Gm)
Experimental StudySimulation of a Grid Environment • Simulated Grid Environment vs actual grids • Low cost alternative to constructing a wide range heterogeneous configurations • Limited grid facilities are available in the field and are usually homogeneous • Methodology • Discrete time simulation • Utilize configuration graph to model processing speed, communication latency, and bandwidth • Configurations (Processors=32,64,128; Interconnect slowdowns=10,100;Clusters=4,8) • HO: Constant processing and intra-communication capabilityUP: Faster processors have faster intra-communication capability • DN: Faster processors have slower intra-communication capability
Filter Effectiveness (C=8) • Reassignment filter eliminates virtually all overhead with vertex moves that are rejected • Almost all assignments passing the filter were accepted
Scalability Test (Scales well to 128 processors)P varied between 8 and 1024, C=8, Runtimes compared
ThroTTle Test (C=8)(Initially Improves as throttle increases until curve flattens out)
Multiple Time Step TestP=64, I=10, C=8, B=16K • Multiple iterations have limited impact • Subsequent experiments run a single time step
Partitioner Speed Comparisons • MinEX has the advantage for P=32 and P=64 • METIS has the advantage for P=1k • Overall, MinEX is competitive
Partition Quality Comparisons (C=8) • MinEX and METIS show similar results for Homogeneous configurations. • Heterogeneous configurations show clear advantage to MinEX
Partition Quality Comparisons (C=8) • Similar results to I=10 experiments • MinEX-Gm results are in general somewhat worse than MinEX-G because of less accurate application modeling • METIS results are significantly worse than MinEX; but less compared to faster interconnects. Slower interconnect speed makes grid more homogeneous
Partition Quality ComparisonsAdditional Observations • DN configuration results are similar to UP experiments with a few exceptions • DN runs are worse than the UP runs in a few cases (998 vs 1489 if P=128, C=4, I=100, B=64K) • The MinEX projected 975, but converged to 1489. • When Simulating a second input channel, the solver converges at 975 for DN. No such improvement for METIS • HO runs with P=32 & 64, I=100, B=256K give METIS an advantage (7399 to 5199 and 4231 and 3334 respectively). • MinEX is converging tightly (LoadImb=1.0001) to a high value • Perhaps the criteria for reassignments needs to be further refined.
Conclusions • Direct comparisons between MinEX and METIS • An N-body solver on simulated grid environments form the basis for our experiments • MinEX produces partitions that reduce runtime by up to a factor of 6 in highly-heterogeneous grids • MinEX and METIS are competitive in homogeneous grids • MinEX is competitive to METIS as far as speed of execution • Implemented performance refinements to MinEX • The reassignment filter minimizes overhead associated with potential reassignments that are rejected • Sorting processors by QWgt speed up partitioning decisions • A bucket sort speeds up finding edges to collapse • Minex can partition directed graphs • Not commonly allowed by current partitioners • Account for latency tolerance during partitioning • Established the benefit and feasibility of this approach
On-going Research • MinEX Refinements • Analyze effect of using multiple I/O channels and network dynamics • Refine the method of selecting vertices for reassignment • Refine the discrete time simulator • Develop a general-purpose tool for simulating heterogeneous grids • Establish the accuracy of the simulator by comparing its projections to the performance of applications running on actual grids