Daniel J. Harvey Department of Computer Science Southern Oregon University

Designing an Efficient Partitioning Algorithm for Grid Environments with Application to N-Body Problems Daniel J. Harvey Department of Computer Science Southern Oregon University E-mail: harveyd@sou.edu Sajal K. Das Department of Computer Science and Engineering The University of Texas at Arlington E-mail: das@cse.uta.edu Rupak Biswas NASA Ames Research Center E-mail: rbiswas@nas.nasa.gov

Presentation Overview • The information power grid (IPG) • The MinEX partitioner • This paper’s contributions • Metrics utilized • The N-Body problem • MinEX refinements • Experimental study • Performance results • Conclusions and on-going research

The Information Power Grid (IPG) • Harness power of geographically separated resources • Developed by NASA and other collaborative partners • Utilize geographically separated processors to solve large-scale computational problems • Characteristics • limited bandwidth and high latency • heterogeneous configurations • Relevant applications identified by I-Way experiment • Remote access to large databases requiring high-end graphics • Remote virtual reality access to instruments • Remote interactions with super-computer simulations

Load Balancing Approaches Especially important in grid environments Traditional Load Balancing Objectives Distribute workload evenly among processors Minimize idle time • Static load-balancing • Balance load prior to execution • Examples: smart-compilers, schedulers • Dynamic load-balancing • Balance as application is processed • Examples: adaptive contracting, gradient, symmetric broadcast networks • Semi-dynamic load-balancing (Our focus in this paper) • Temporarily stop application processing to balance workload • Utilizes a partitioning technique • Examples: MeTiS, Jostle, PLUM

The MinEX Partitioner • We previously introduced a novel partitioner called MinEX • Minex: A latency-tolerant dynamic partitioner for grid computing applications, FGCS, 18 (2002), pp. 477—489 • MinEX’s unique characterisitcs include • Environment: designed specifically for heterogeneous geographically distributed environments • Grid: maps configuration graph onto the partition graph; produces partitions reflecting the grid • Goal: minimize runtime rather than balance processing workload and minimize edge cut • Latency: accounts for latency tolerance during partitioning • Accounts for: data movement & communication overhead

This Paper’s Contributions • To compare MinEX performance to METIS, a state the art partitioner • Result: Speed of execution is competitive • Result: Quality of partitions reduce application runtime by up to a factor of 6 • Estimate performance utilizing a wide range of heterogeneous grid configurations • Apply MinEX to a real-life application (the N-Body problem) executing in simulated grid environments • Introduce refinements to our initial algorithm

The MinEX Partitioner • Multi-level scheme • Collapse edges incrementally • Partitions the contracted graph • Refines the graph in reverse • Reassignments executed to improve partition quality • Creates diffusive or from scratch partitions • User-supplied function estimates solver latency tolerance • Accounts for data redistribution cost during partitioning

Processing weight Wgt = PWgtv x Procc Communication cost Comm = SwepCWgt(v,w) x Connect(c,d) Redistribution cost Remap = RWgtv x Connect(c,d) if pq Weighted queue length QWgt(p) = Svep(Wgt + Comm + Remap ) Heaviest load (MaxQWgt) Qlenp = Vertices e p Average load (WSysLL) Total system load QWgtToT = SpePQWgt(p) Imbalance factor LoadImb = MaxQWgt/WSysLL Metrics Utilized v p v p v p v p

MinVar, Gain andThroTTle • Processor workload variance from WSysLL • Var = Sp(QWgt(p) - WSysLL)2 • DVar reflects the improvement in MinVar after a vertex reassignment. A positive value implies that the Var value has increased • Gain is the change(DQWgtToT) to total system load resulting from a vertex reassignment • ThroTTle is a user defined parameter. If Gain>0, Vertex moves that improve DVar are allowed if Gain2/-DVar <= ThroTTle

The N-Body Problem • Classical problem of simulating movement of a set of bodies • Based upon gravitational or electrostatic forces • Iterates over a series of time steps • At each step for each body • Compute forces from all other bodies using the gravitational laws • Calculates Acceleration and integrates twice to compute the position at the next time step • If all the force calculations are formed, O(n2) computations are required at each time step.

Barnes & Hut Solution (Framework for experiments) • Reduces computational complexity from O(n2) to O(n lg n) • Clusters of bodies that are far from a cell are treated as a single body using the total center of mass and the center of mass position • Cell Cv is considered far from Cell Cw if the size of the cell divided by the distance between cells is less than a constantF • Our implementation (For each time-step) • Create the octtree of cells • Form a graph graph using the cells of the octtree • Partition the graph, distribute cells to be relocated among processors • Run the solver

The Partitioning Graph • Each vertex, v, in the partitioning graph corresponds to a leaf cell, Cv with |Cv| bodies, in the N-Body oct tree and has two associated weights. PWgtv models computations associated with the body, RWgtv represents data distribution cost • PWgtv = |Cv| x (|Cv|-1+CloseB+Farv+2) • RWgtv = |Cv| • Each edge (v,w) weight CWgt(v,w) models the communication cost between cells Cv and Cw. • CWgt(v,w) = |cw| if Cw is close to cw; 0 otherwise.

Graph Modifications • METIS Limitations • Cannot operate on directed graphs • Cannot tolerate edge weights of zero • N-Body graph • CWgt(v,w) can be different than CWgt(w,v) because |Cv| may not equal |cw| • CWgt(v,w) can equal 0 if Cv is close to cW but Cw is far from Cv. • For direct comparisons, experiments are run using • Original N-Body graph (Graph G) • Modified Graph (Graph Gm)

MinEX Basic Partition Criteria • Minimize MaxQWgt rather than balance processor workloads. • Collapse edges that result in the best Gain value using a min-heap • Call user-defined latency tolerance function to estimate latency tolerance • Move verticices from overloaded processors (QWgtp > WSysLL) to underloaded processors (QWgtp < WSysLL) • Reject potential reassignments that:(i) have a positive DVar (ii) are rejected by the reassignment filter function

Projects Qwgtnew, DVar, newGain Vertex totals used: Edge weights same cluster Edge weights other clusters Local Edge weights Total outgoing edge weight Relocation, Processing weights IF (newQWgtfrom > Qwgtfrom) Reject Assignment IF (newQWgtto < Qwgtto) Reject Assignment IF (Dvar >= 0) Reject Assignment IF newGain>0 && newGain2/-Dvar>ThroTTle Reject Assignment Dnew=newQWgtfrom-newQWgtto Dold=QWgtfrom-QWgtto) IF fabs(Dnew)>abs(Dnew) IF newQWgtfrom<Qwgtto Reject Assignment IF newQWgtto>Qwgtfrom Reject Assignment Assignment Passes Filter Reassignment Filter FunctionGoal: Avoid unnecessary edge processing and reject deliterious reassignmnents that cause increased edge processing

Additional refinements (to enhance performance) • Graph contraction phase • Bucket sort vertices by process • Quickly find candidates for merging • Maintain a list of processors sorted by QWgt • Few processors change position after vertex moves • Maintaining this list incurs minimal overhead • Defined user-defined latency tolerance function (called before each potential reassignment) • Double MinEX(User *user, Ipg *ipg, Qtot *tot) • User = User options passed to the partitioner • Ipg = Grid configuration graph • tot contains Pprocp, Commp, Remapp, QLenp

Experimental StudySimulation of a Grid Environment • Simulated Grid Environment vs actual grids • Low cost alternative to constructing a wide range heterogeneous configurations • Limited grid facilities are available in the field and are usually homogeneous • Methodology • Discrete time simulation • Utilize configuration graph to model processing speed, communication latency, and bandwidth • Configurations (Processors=32,64,128; Interconnect slowdowns=10,100;Clusters=4,8) • HO: Constant processing and intra-communication capabilityUP: Faster processors have faster intra-communication capability • DN: Faster processors have slower intra-communication capability

Reassignment Filter Effectiveness • Reassignment filter eliminates virtually all overhead with vertex moves that are rejected • Almost all assignments passing the filter were accepted

Scalability Test (Scales well to 128 processors)P varied between 8 and 1024, Runtimes compared

ThroTTle Test (Initially Improves as throttle increases until curve flattens out)

Multiple Time Step TestP=64, I=10, C=8, B=16K • Running multiple iterations does not significantly impact the results • The rest of the experiments will be based on a single time step

Partitioner Speed Comparisons • MinEX has the advantage for P=32 and P=64 • METIS has the advantage for P=1k • Overall, MinEX is competitive

Partition Quality Comparisons (C=8) • MinEX and METIS show similar results for Homogeneous configurations. • Heterogeneous configurations show clear advantage to MinEX

Partition Quality Comparisons (C=8) • Similar results to I=10 experiments • MinEX-Gm results are in general somewhat worse than MinEX-G because of less accurate application modeling • METIS results are significantly worse than MinEX; but less compared to faster interconnects. Slower interconnect speed makes grid more homogeneous

Partition Quality ComparisonsAdditional Observations • DN configuration results are similar to UP experiments with a few exceptions • DN runs are worse than the UP runs in a few cases (998 vs 1489 if P=128, C=4, I=100, B=64K) • The MinEX projected 975, but converged to 1489. • When Simulating a second input channel, the solver converges at 975 for DN. No such improvement for METIS • HO runs with P=32 & 64, I=100, B=256K give METIS an advantage (7399 to 5199 and 4231 and 3334 respectively). • MinEX is converging tightly (LoadImb=1.0001) to a high value • Perhaps the criteria for reassignments needs to be further refined.

Conclusions • Direct comparisons between MinEX and METIS • MinEX produces partitions that reduce runtime by up to a factor of 6 in highly-heterogeneous grids • MinEX and METIS are competitive in homogeneous grids • MinEX is competitive to METIS as far as speed of execution • Implemented performance refinements to MinEX • The reassignment filter minimizes overhead associated with potential reassignments that are rejected • Sorting processors by QWgt speed up partitioning decisions • A bucket sort speeds up finding edges to collapse • Minex can partition directed graphs • Not commonly allowed by current partitioners • Account for latency tolerance during partitioning • Established the benefit and feasibility of this approach • N-body solver implemention • using the partitioning and message passing model.

On-going Research • MinEX Refinements • Analyze effect of using multiple I/o channels and network dynamics • Refine the method of selecting vertices for reassignment • Refine the discrete time simulator • Develop a general-purpose tool for simulating heterogeneous grids • Establish the accuracy of the simulator by comparing its projections to the performance of applications running on real parallel systems

Daniel J. Harvey Department of Computer Science Southern Oregon University

Daniel J. Harvey Department of Computer Science Southern Oregon University

Presentation Transcript

Southern Oregon University

Shahram Ghandeharizadeh Computer Science Department University of Southern California

Dr. Henry Hexmoor Department of Computer Science Southern Illinois University Carbondale

Dr. Henry Hexmoor Department of Computer Science Southern Illinois University Carbondale

Department of Computer Science Southern Illinois University Carbondale

Department of Computer Science Southern Illinois University Edwardsville Spring, 2010

University of Southern California Department Computer Science

Department of Computer Science Southern Illinois University Edwardsville Spring, 2010

J. Martin, M. Westall Department of Computer Science Clemson University

Department of Computer Science, Princeton University

Department of Computer Science Southern Illinois University Edwardsville Spring, 2010

Department of Computer Science Southern Illinois University Carbondale

Department of Computer Science Southern Illinois University Carbondale

Shahram Ghandeharizadeh Computer Science Department University of Southern California

Columbia University Department of Computer Science

J. Martin, M. Westall Department of Computer Science Clemson University

Department of Computer Science Southern Illinois University Carbondale

Concordia University Department of Computer Science

Department of Computer Science Southern Illinois University Carbondale

Columbia University Department of Computer Science