Modeling a Million-Node Dragonfly Network using Massively Parallel Discrete-Event Simulation

Modeling a Million-Node Dragonfly Network using Massively Parallel Discrete-Event Simulation Misbah Mubarak, Christopher D. Carothers Rensselaer Polytechnic Institute Robert Ross, Philip Carns Argonne National Laboratory

Outline • Dragonfly Network Topology • Validation of the dragonfly model • Performance Comparison with booksim • Scaling dragonfly model on BG/P and BG/Q • Conclusion & future work

The Dragonfly Network Topology • A two level directly connected topology • Uses high-radix routers • Large number of ports per router • Each port has moderate bandwidth “p”: Number of compute nodes connected to a router “a”: Number of routers in a group “h”: Number of global channels per router k=a + p + h – 1 a=2p=2h (Recommended configuration)

Simulating interconnect networks • Expected size of exascale systems • Millions of compute cores • Up to 1 million compute nodes • Critical to have a low-latency, small diameter and low-cost interconnect network • Exascale HPC systems cannot be effectively simulated with small-scale prototypes • We use Rensselaer Optimistic Simulation System (ROSS) to simulate a dragonfly model with millions of nodes • Our dragonfly model attains a peak event rate of 1.33 billion events/sec • Total committed events: 872 billion

Dragonfly Model Configuration • Traffic Patterns • Uniform Random Traffic (UR) • Nearest Neighbor Traffic (or Worst Case traffic WC) • Virtual channels • To avoid deadlocks • Credit based flow control • Upstream nodes/routers keep track of buffer slots • An input-queued virtual channel router • Each router port supports up to ‘v’ virtual channels

Interval generate arrive channel delay Wait for credit N Buffer full? send to outbound buffer send Packet Y Packet credit Packet Wait for credit send credit Y send Buffer full? source router Y channel delay Buffer full? arrive N N destination router arrive Wait for credit Destination node Y Sending node Buffer full? send N Source router arrive Intermediate router(s) channel delay credit credit Destination node Destination router

Dragonfly Model Routing Algorithms • Minimal Routing (MIN) • Uniform random traffic: High throughput, low latency • Nearest neighbor traffic: causes congestion, high latency, low throughput • Non-minimal routing (VAL) • Half the throughput as MIN under UR traffic • Nearest neighbor traffic: optimal performance (about 50% throughput) • Global Adaptive routing • Chooses between MIN and VAL by sensing the traffic conditions on the global channels

Dragonfly Model Minimal Routing P G0 G1 G0 G1 P R0 R1 R5 R4 R0 R1 R5 R4 R3 R2 R6 R7 R3 R2 R6 R7 (ii) Packet traverses to R1 over local channel (i) Packet arrives at R0, Destination Router = R7 G0 G0 G1 G1 P R1 R0 R4 R5 R1 R0 R4 R5 P R3 R2 R6 R3 R7 R2 R6 R7 (iii) Packet traverses from R1 to R4 over the global channel (iv) Packet traverses to R7 over local channel

Dragonfly Model Validation • Dragonfly network topologies in design • PERCS network topology • Machines from Echelon project • Booksim: • A cycle accurate simulator with dragonfly model • Used by Dally et. al to validate the dragonfly topology proposal • Runs in serial mode only • Supports minimal and global adaptive routing • Performance results shown on 1,024 nodes and 264 routers • We validated our ROSS dragonfly model against booksim

Global Adaptive Routing---Threshold selection (ROSS vs. Booksim) • Booksim uses an adaptive threshold to bias the UGAL algorithm towards minimal or non-minimal routing • We incorporated a similar threshold in ROSS • To find the threshold value to bias traffic towards non-minimal, we did experiments to find the optimal threshold value. • The value that yields maximum non-minimal packets is -180 Ifmin_queue_size < (2 * nonmin_queue_size) + adaptive_thresholdthen route minimally Else route non-minimally End if Global Adaptive Routing

ROSS vs. booksim– Uniform Random traffic With minimal routing, ROSS has an average of 4.2% and a maximum of 7% difference from booksim results With global adaptive routing, ROSS has an average of 3% and a maximum of 7.8% difference from booksim results

ROSS vs. booksim– Nearest neighbor traffic • The nearest neighbor traffic yields a very high latency and low throughput with minimal routing. • This traffic pattern can be load balanced by either non-minimal or adaptive routing • Non-minimal routing gives slightly under 50% throughput with nearest neighbor traffic

Dragonfly performance: ROSS vs. booksim • ROSS attains the following performance speedup • Minimum of 5x up to a maximum of 11x speedup over booksim with MIN routing • Minimum of 5.3x speedup and a maximum of 12.38x speedup with global adaptive routing

ROSS Dragonfly model on BG/P and BG/Q • We evaluated the strong scaling characteristics of the dragonfly model on • Argonne Leadership Computing Facility (ALCF) IBM Blue Gene/P system (Intrepid) • Computational Center for Nanotechnology Innovations (CCNI) IBM Blue Gene/Q • We scheduled 64 MPI tasks per node on BG/Q and 4 MPI tasks per node on BG/P • Performance was evaluated through the following metrics • Committed event rate • Percentage of remote events • ROSS event efficiency • Simulation run time

ROSS Parameters • ROSS employs Time Warp Optimistic synchronization protocol • To reduce state saving overheads, ROSS employs an event roll back mechanism • ROSS event efficiency determines the amount of useful work performed by the simulation • Global Virtual Time (GVT) imposes a lower bound on the simulation time • GVT is controlled by batch and gvt-interval parameters • On average, batch* gvt-interval events are processed between each GVT epoch

ROSS Dragonfly Performance Results on BG/P vs. BG/Q • Event efficiency drops and total rollbacks increase on BG/P after 16K MPI tasks • Less off-node communication on BG/Q vs. BG/P • Each MPI task has more processing power on BG/P and simulation advances quickly

ROSS Dragonfly Performance Results on BG/P vs. BG/Q • The event efficiency stays high on both BG/P and BG/Q as each MPI task has substantial work load • The computation performed at each MPI task dominates the number of rolled back events

Conclusion & Future work • Conclusion • We presented a parallel discrete-event simulation for a dragonfly network topology • We validated our simulator with cycle accurate simulator booksim • We demonstrated the ability of our simulator to scale on very large models with up to 50M nodes • Future work • Introduce an improved queue congestion sensing policy for global adaptive routing • Experiment with other variations of nearest neighbor traffic in dragonfly • Compare the dragonfly network model with other candidate topology models for exascale computing

Modeling a Million-Node Dragonfly Network using Massively Parallel Discrete-Event Simulation

Modeling a Million-Node Dragonfly Network using Massively Parallel Discrete-Event Simulation

Presentation Transcript

Parallel Discrete Event Simulation

Modeling a Million-Node Dragonfly Network using Massively Parallel Discrete-Event Simulation

Discrete Event Simulation

Discrete Event Simulation

Discrete Event Simulation - 8

DISCRETE-EVENT SIMULATION MODEL

Discrete Event (time) Simulation

Discrete Event Simulation

Discrete-Event Simulation: A First Course

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation

Discrete Event Systems Simulation

Parallel Discrete Event Simulation of Manufacturing Systems using PARSEC

Discrete Event Simulation - 3

Discrete Event Simulation

Parallel Discrete-Event Simulations

Discrete Event Simulation - 4

Discrete Event Simulation - 10

Discrete Event Simulation

Parallel Discrete Event Simulation

Parallel Discrete Event Simulation (PDES) at ORNL

Discrete Event Simulation