Chapter 2 Computer Clusters

Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism

Overview of GPU Clusters • GPUs are becoming high-performance accelerators for data-parallel computing. • Modern GPU chips contain hundreds of processor cores per chip. • Each GPU chip is capable of achieving up to 1 Tflops for single-precision (SP) arithmetic, and more than 80 Gflops for double-precision (DP) calculations. • Recent HPC-optimized GPUs contain up to 4 GB of on-board memory, and are capable of sustaining memory bandwidths exceeding 100 GB/second.

GPU clusters are built with a large number of GPU chips. Most GPU clusters are structured with homogeneous GPUs of the same hardware class, make, and model. • The software used in a GPU cluster includes the OS, GPU drivers, and cluster API such as an MPI.

GPU clusters have already demonstrated their capability to achieve Pflops performance in some of the Top 500 systems. • The high performance of a GPU cluster is attributed mainly to the following factors: • massively parallel multicore architecture, • high throughput in multithreaded floating-point arithmetic, • significantly reduced time in massive data movement using large on-chip cache memory. • GPU clusters result in not only a quantum jump in speed performance, but also significantly reduced space, power, and cooling demands. • These reductions in power, environment, and management complexity make GPU clusters very attractive for use in future HPC applications.

Case Study – Echelon GPU Cluster • NVIDIA Echelon GPU cluster is the state-of-the-art design for Exascale computing. • This Echelon project is led by Bill Dally at NVIDIA and is partially funded by DARPA under the Ubiquitous High-Performance Computing (UHPC) program. • The Echelon GPU design shows the architecture of a future GPU accelerator

Echelon GPU Chip Design

Echelon GPU Cluster Architecture Image from http://insidehpc.com/2010/11/26/nvidia-reveals-details-of-echelon-gpu-designs-for-exascale/

To achieve Eflops performance, we need to use at least N = 400 cabinets. Or 327,680 processor cores in 400 cabinets. • The Echelon system is supported by a self-aware OS and runtime system. • The Echelon system is also designed to preserve locality with the support of compiler and autotuner.

CUDA Support for GPU Clusters • The CUDA version 3.2 is used for a single GPU module in 2010. • The CUDA version 4.0 allows using multiple GPUs with unified virtual address space of shared memory.

Applications on GPU Clusters • Distributed calculations to predict the native conformation of proteins • Medical analysis simulations based on CT and MRI scan images • Physical simulations in fluid dynamics and environment statistics • Accelerated 3D graphics, cryptography, compression, and interconversion of video file formats • Building the single-chip cloud computer (SCC) through virtualization in many-core architecture.

Chapter 2 Computer Clusters