1 / 22

Advancing Supercomputer Performance Through Interconnection Topology Synthesis

Advancing Supercomputer Performance Through Interconnection Topology Synthesis. Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department of Computer Science and Engineering University of California, San Diego. Outline. Introduction Design Flow, Formulation & Algorithms

cellis
Download Presentation

Advancing Supercomputer Performance Through Interconnection Topology Synthesis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department of Computer Science and Engineering University of California, San Diego

  2. Outline • Introduction • Design Flow, Formulation & Algorithms • Example: Blue Gene/L Packaging • Overview • Models & Constraints • Experiments • Benchmark Instances • Generated Instances • Conclusion & Future Work

  3. Interconnection Networks • Interconnection networks become a more critical factor than computing or memory modules (W.Dally, HPCA 2007 Keynote Speech) • Popular network topologies: • Hypercube (SGI Origin2000) • 2D torus (Cray X1) • 3D torus (Cray T3E and XT3, IBM Blue Gene/L) • Crossbar (NEC Earth Simulator) • Folded Clos (Cray BlackWidow) • Fat tree, flattened butterfly, Etc.

  4. Our Work • We propose a design methodology to select the best topology to minimize the average latency • Design flow is fully automated • Physical constraints can be specified by users • Efficient multi-commodity flow algorithm to evaluate • Demonstrate the efficiency using Blue Gene/L packaging framework

  5. Delay Models Topology Pool Best Topology MCF Evaluation Solver Communication Patterns Physical Constraints Design Flow

  6. Multi-Commodity Flow (MCF) • Graph G(V,E) • K commodities, each has a source and a sink, and demand amount d(k) • Each edge e has a capacity u(e) • Each edge e has a weight w(e) • Minimum Cost MCF: each commodity k is routed units under the capacity constraints, minimize , where f(e) is the flow routed on edge e

  7. Map Supercomputer Performance Evaluation to MCF Problem • Nodes –processors • Edges –interconnection links • Commodities –communications • Demands –communication bandwidth(injection rate) • Flow amount –wires assignments • Capacity constraints –physical constraints (wires, pins, board dim) • Edge weight –unit latency (unit power)

  8. An Example on Maximum Concurrent Flow • Two commodities: s1->t1, s2->t2, both have demand d(1)=d(2)=1 • Optimal throughput = 1.5

  9. Approximation Algorithms • The duality theory in LP: for a maximization, primal feasible , dual feasible D, optimal solution OPT • Increase and decrease D iteratively till the duality gap is small enough

  10. Midplane: 8x8x8 Torus Blue Gene/L: An Example

  11. Assumptions • We follow the same hierarchical structure: midplane – node card – compute card • The properties of boards (dimensions, # layers, dielectric) keep unchanged • We seek better topologies than the existing 3D torus to implement the networks in the midplane

  12. Topology Generation • Generate 8-node 1D topologies and duplicate to each row and column • Topologies are isomorph-free and has maximum degree bound for each node #isomorph-free topologies

  13. Node Card Graph Model Horizontal: Strongly Connected; Vertical: Generated Topology

  14. Midplane Graph Model Coteus et al., “Packaging the Blue Gene/L Supercomputer”IBM J of Res & Dev, Vol. 43, pp. 213-248

  15. Experiment 1: Benchmark Instances • NAS Parallel Benchmarks (121/128 processes) Benchmark source code Best topology Compiled with Intel Trace Collector & Analyzer Our design flow Executable Task placement Run on multi-processor machines Simulated annealing placement Traffic Patterns Output

  16. Benchmarks Characteristics Communication Pattern: MG

  17. Results • Optimal: each instance has different topology • Aggregate: one topology for all instances • 3D Torus: 3D torus topology

  18. Experiment 2: Generated Instances • Randomly generated communications • Scalar values which represent the demand for bandwidth between each pair of nodes • More general, time independent • Control Parameters • # communication demands: O(n) pairs • Communication amount: uniform traffic but vary case by case (different congestion level)

  19. Latency & Throughput Tradeoffs Distribution: 40% / 50% / 10%

  20. Topologies with Different Injection Rates With larger injection rate, more (red) links are needed to go through the cut between 4 and 5, in order to reduce the number of hops

  21. Conclusion • An design flow for interconnection network synthesis • Fully automated • Explore large design space • Efficient evaluation algorithm • Future work • Power consumption • Accurate simulation

  22. Q&A Thank you!

More Related