1 / 22

Modeling a Million-Node Dragonfly Network using Massively Parallel Discrete-Event Simulation

Modeling a Million-Node Dragonfly Network using Massively Parallel Discrete-Event Simulation. Misbah Mubarak , Christopher D. Carothers Rensselaer Polytechnic Institute Robert Ross, Philip Carns Argonne National Laboratory. Outline. Dragonfly Network Topology

frayne
Download Presentation

Modeling a Million-Node Dragonfly Network using Massively Parallel Discrete-Event Simulation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling a Million-Node Dragonfly Network using Massively Parallel Discrete-Event Simulation Misbah Mubarak, Christopher D. Carothers Rensselaer Polytechnic Institute Robert Ross, Philip Carns Argonne National Laboratory

  2. Outline • Dragonfly Network Topology • Validation of the dragonfly model • Performance Comparison with booksim • Scaling dragonfly model on BG/P and BG/Q • Conclusion & future work

  3. The Dragonfly Network Topology • A two level directly connected topology • Uses high-radix routers • Large number of ports per router • Each port has moderate bandwidth “p”: Number of compute nodes connected to a router “a”: Number of routers in a group “h”: Number of global channels per router k=a + p + h – 1 a=2p=2h (Recommended configuration)

  4. Simulating interconnect networks • Expected size of exascale systems • Millions of compute cores • Up to 1 million compute nodes • Critical to have a low-latency, small diameter and low-cost interconnect network • Exascale HPC systems cannot be effectively simulated with small-scale prototypes • We use Rensselaer Optimistic Simulation System (ROSS) to simulate a dragonfly model with millions of nodes • Our dragonfly model attains a peak event rate of 1.33 billion events/sec • Total committed events: 872 billion

  5. Dragonfly Model Configuration • Traffic Patterns • Uniform Random Traffic (UR) • Nearest Neighbor Traffic (or Worst Case traffic WC) • Virtual channels • To avoid deadlocks • Credit based flow control • Upstream nodes/routers keep track of buffer slots • An input-queued virtual channel router • Each router port supports up to ‘v’ virtual channels

  6. Interval generate arrive channel delay Wait for credit N Buffer full? send to outbound buffer send Packet Y Packet credit Packet Wait for credit send credit Y send Buffer full? source router Y channel delay Buffer full? arrive N N destination router arrive Wait for credit Destination node Y Sending node Buffer full? send N Source router arrive Intermediate router(s) channel delay credit credit Destination node Destination router

  7. Dragonfly Model Routing Algorithms • Minimal Routing (MIN) • Uniform random traffic: High throughput, low latency • Nearest neighbor traffic: causes congestion, high latency, low throughput • Non-minimal routing (VAL) • Half the throughput as MIN under UR traffic • Nearest neighbor traffic: optimal performance (about 50% throughput) • Global Adaptive routing • Chooses between MIN and VAL by sensing the traffic conditions on the global channels

  8. Dragonfly Model Minimal Routing P G0 G1 G0 G1 P R0 R1 R5 R4 R0 R1 R5 R4 R3 R2 R6 R7 R3 R2 R6 R7 (ii) Packet traverses to R1 over local channel (i) Packet arrives at R0, Destination Router = R7 G0 G0 G1 G1 P R1 R0 R4 R5 R1 R0 R4 R5 P R3 R2 R6 R3 R7 R2 R6 R7 (iii) Packet traverses from R1 to R4 over the global channel (iv) Packet traverses to R7 over local channel

  9. Outline • Dragonfly Network Topology • Validation of the dragonfly model • Performance Comparison with booksim • Scaling dragonfly model on BG/P and BG/Q • Conclusion & future work

  10. Dragonfly Model Validation • Dragonfly network topologies in design • PERCS network topology • Machines from Echelon project • Booksim: • A cycle accurate simulator with dragonfly model • Used by Dally et. al to validate the dragonfly topology proposal • Runs in serial mode only • Supports minimal and global adaptive routing • Performance results shown on 1,024 nodes and 264 routers • We validated our ROSS dragonfly model against booksim

  11. Global Adaptive Routing---Threshold selection (ROSS vs. Booksim) • Booksim uses an adaptive threshold to bias the UGAL algorithm towards minimal or non-minimal routing • We incorporated a similar threshold in ROSS • To find the threshold value to bias traffic towards non-minimal, we did experiments to find the optimal threshold value. • The value that yields maximum non-minimal packets is -180 Ifmin_queue_size < (2 * nonmin_queue_size) + adaptive_thresholdthen route minimally Else route non-minimally End if Global Adaptive Routing

  12. ROSS vs. booksim– Uniform Random traffic With minimal routing, ROSS has an average of 4.2% and a maximum of 7% difference from booksim results With global adaptive routing, ROSS has an average of 3% and a maximum of 7.8% difference from booksim results

  13. ROSS vs. booksim– Nearest neighbor traffic • The nearest neighbor traffic yields a very high latency and low throughput with minimal routing. • This traffic pattern can be load balanced by either non-minimal or adaptive routing • Non-minimal routing gives slightly under 50% throughput with nearest neighbor traffic

  14. Outline • Dragonfly Network Topology • Validation of the dragonfly model • Performance Comparison with booksim • Scaling dragonfly model on BG/P and BG/Q • Conclusion & future work

  15. Dragonfly performance: ROSS vs. booksim • ROSS attains the following performance speedup • Minimum of 5x up to a maximum of 11x speedup over booksim with MIN routing • Minimum of 5.3x speedup and a maximum of 12.38x speedup with global adaptive routing

  16. Outline • Dragonfly Network Topology • Validation of the dragonfly model • Performance Comparison with booksim • Scaling dragonfly model on BG/P and BG/Q • Conclusion & future work

  17. ROSS Dragonfly model on BG/P and BG/Q • We evaluated the strong scaling characteristics of the dragonfly model on • Argonne Leadership Computing Facility (ALCF) IBM Blue Gene/P system (Intrepid) • Computational Center for Nanotechnology Innovations (CCNI) IBM Blue Gene/Q • We scheduled 64 MPI tasks per node on BG/Q and 4 MPI tasks per node on BG/P • Performance was evaluated through the following metrics • Committed event rate • Percentage of remote events • ROSS event efficiency • Simulation run time

  18. ROSS Parameters • ROSS employs Time Warp Optimistic synchronization protocol • To reduce state saving overheads, ROSS employs an event roll back mechanism • ROSS event efficiency determines the amount of useful work performed by the simulation • Global Virtual Time (GVT) imposes a lower bound on the simulation time • GVT is controlled by batch and gvt-interval parameters • On average, batch* gvt-interval events are processed between each GVT epoch

  19. ROSS Dragonfly Performance Results on BG/P vs. BG/Q • Event efficiency drops and total rollbacks increase on BG/P after 16K MPI tasks • Less off-node communication on BG/Q vs. BG/P • Each MPI task has more processing power on BG/P and simulation advances quickly

  20. ROSS Dragonfly Performance Results on BG/P vs. BG/Q • The event efficiency stays high on both BG/P and BG/Q as each MPI task has substantial work load • The computation performed at each MPI task dominates the number of rolled back events

  21. Outline • Dragonfly Network Topology • Validation of the dragonfly model • Performance Comparison with booksim • Scaling dragonfly model on BG/P and BG/Q • Conclusion & future work

  22. Conclusion & Future work • Conclusion • We presented a parallel discrete-event simulation for a dragonfly network topology • We validated our simulator with cycle accurate simulator booksim • We demonstrated the ability of our simulator to scale on very large models with up to 50M nodes • Future work • Introduce an improved queue congestion sensing policy for global adaptive routing • Experiment with other variations of nearest neighbor traffic in dragonfly • Compare the dragonfly network model with other candidate topology models for exascale computing

More Related