1 / 23

Hitachi SR8000 Supercomputer

Hitachi SR8000 Supercomputer. LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology 010652000 Introduction to Parallel Computing. Group 2: Juha Huttunen, Tite 4 Olli Ryhänen, Tite 4. History of SR8000.

meli
Download Presentation

Hitachi SR8000 Supercomputer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hitachi SR8000Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology 010652000 Introduction to Parallel Computing Group 2: Juha Huttunen, Tite 4 Olli Ryhänen, Tite 4

  2. History of SR8000 • Successor of Hitachi S-3800 vector super computer and SR2201 parallel computer.

  3. Overview of system architecture • Distributed-memory parallel computer with pseudo-vector SMP nodes.

  4. Processing Unit • IBM PowerPC CPU architecture with Hitachi’s extensions • 64 bit PowerPC RISC processors • Available in speeds: 250MHz, 300MHz, 375MHz and 450MHz • Hitachi extensions • Additional 128 floating-point registers (total of 160 FPRs) • Fast hardware barrier synchronisation mechanism • Pseudo Vector Processing (PVP)

  5. 160 Floating-Point registers • 160 FP registers • FR0 – FR31 global part • FR32 – FR128 slide part • FPR operations extended to handle slide part Inner Product of two arrays

  6. Pseudo Vector Processing (PVP) • Introduced in Hitachi SR2201 supercomputer • Designed to solve memory bandwidth problems in RISC CPUs • Performance similar of vector processor • Non-blocking arithmetic execution • Reduce chances of cache misses • Pipelined data loading • pre-fetch • pre-load

  7. Pseudo Vector Processing (PVP) • Performance effect of PVP

  8. Node Structure • Pseudo vector SMP-nodes • 8 instruction processors (IP) for computation • 1 system control processor (SP) for management • Co-operative Micro-processors in single Address Space (COMPAS) • Maximum number of nodes is 512 (4096 processors) • Node types • Processing Nodes (PRN) • I/O Nodes (ION) • Supervisory Node (SVN) • One per system

  9. Node Partitioning/Grouping • A physical node can belong to many logical partitions • A node can belong to multiple node groups • Node groups are created dynamically by the master node

  10. COMPAS • Auto parallelization by the compiler • Hardware support for fast fork/join sequences • Small start-up overhead • Cache coherency • Fast signalling between child and parent processes

  11. COMPAS • Performance effect of COMPAS

  12. Interconnection Network • Interconnection network • Multidimensional crossbar • 1, 2 or 3-dimensional • Maximum of 8 nodes/dimension • External connections via I/O nodes • Ethernet, ATM, etc. • Remote Direct Memory Access (RDMA) • Data transfer between nodes • Minimizes operating system overhead • Support in MPI and PVM libraries

  13. RDMA

  14. Overview of Architecture

  15. Software on SR8000 • Operating System • HI-UX with MPP (Massively Parallel Processing) features • Built-in maintenance tools • 64 bit addressing with 32 bit code support • Single system for the whole computer • Programming tools • Optimized F77, F90, Parallel Fortran, C and C++ compilers • MPI-2 (Message Parsing Interface) • PVM (Parallel Virtual Machine) • Variety of debugging tools (eg. Vampir and Totalview)

  16. Hybrid Programming Model • Supports several parallel programming methods • MPI + COMPAS • Each node has one MPI process • Pseudo vectorization by PVP • Auto parallelization by COMPAS • MPI + OpenMP • Each node has one MPI process • Divided to threads between the 8 CPUs by OpenMP • MPI + MPP • Each CPU has one MPI process (max 8 processes/node) • COMPAS • Each node has one process • Pseudo vectorization by PVP • Auto parallelization by COMPAS

  17. Hybrid Programming Model • OpenMP • Each node has one process • Divided to threads between the 8 CPUs by OpenMP • Scalar • One application with a single thread on one CPU • Can use the 9th CPU • ION • Default model for commands like ’ls’, ’vi’ etc. • Can use the 9th CPU

  18. Hybrid Programming ModelPerformance Effects • Parallel vector-matrix multiplication used as example

  19. Performance Figures • 10 places on the Top 500 list • Highest rankings 26 and 27 • Theoretical maximum performance 7,3Tflop/s with 512 nodes • Node performance depends on the model, from 8Gflop/s to 14,4Gflop/s depending on the CPU speed. • Maximum memory capacity 8TB • Latency from processor to various locations • To memory: 30 – 200 nanoseconds • To remote memory via RDMA feature: ~3-5 microseconds • MPI (without RDMA): ~6-20 microseconds • To disk: ~8 milliseconds • To tape: ~30 seconds

  20. Scalability • Highly scalable architecture • Fast interconnection network and modular node structure • Externally coupling 2 G1 frames performance of 1709Gflop/s out of 2074Gflop/s was achieved (82% efficiency)

  21. Leibniz-Rechenzentrum • SR8000-F1 in Leibniz-Rechenzentrum (LRZ), Munich • German federal top-level compute server in Bavaria • System information • 168 nodes (1344 processors, 375 MHz) • 1344GB of memory • 8 GB/node • 4 nodes with 16 GB • 10TB of disk storage

  22. Leibniz-Rechenzentrum • Performance • Peak performance per CPU – 1,5 GFlop/s (per node 12 GFlop/s) • Total peak performance – 2016 GFlop/s (Linpack 1645 GFlop/s) • I/O bandwidth – to /home 600 MB/s, to /tmp 2,4GB/s • Expected efficiency (from LRZ benchmarks) • >600 GFlop/s • Performance from main memory (most unfavourable case) • >244 GFlop/s

  23. Leibniz-Rechenzentrum • Unidirectional communication bandwidth: • MPI without RDMA – 770 MB/s • MPI without RDMA – 950 MB/s • Hardware – 1000 MB/s • 2*unidirectional bisection bandwidth • MPI and RDMA – 2x79 = 158 GB/s • Hardware – 2x84 = 168 GB/s

More Related