Echelon: NVIDIA & Team’s UHPC Project

Echelon: NVIDIA & Team’s UHPC Project STEVE KECKLER DIRECTOR OF ARCHITECTURE RESEARCH, NVIDIA

GPU Supercomputing Tsubame 2.0 4224 GPUs Linpack 1.3 PFlops Tianhe-1A 7168 GPUs Linpack: 2.5 PFlops Dawning Nebulae 4640 GPUs Linpack: 1.3 PFlops 8 more GPU accelerated machines in the November Top500 Many (corporate) machines not listed • NVIDIA GPU Module

Top 5 Performance and Power

Existing GPU Application Areas

Key Challenges • Energy to Solution is too large • Programming parallel machines is too difficult • Programs are not scalable to billion-fold parallelism • Resilience (AMTTI) is too low • Machines are vulnerable to attacks/undetected program errors

Echelon Team

System Sketch Dragonfly Interconnect (optical fiber) High-Radix Router Module (RM) DRAMCube DRAMCube NV RAM Self-Aware OS L20 L21023 MC NIC NoC Self-Aware Runtime L0 L0 LC0 LC7 C0 C7 SM0 SM127 Processor Chip (PC) Locality-AwareCompiler & Autotuner Node 0 (N0) 16TF, 1.6TB/s, 256GB N7 Module 0 (M)) 128TF, 12.8TB/s, 2TB M15 Cabinet 0 (C0) 2 PF, 205TB/s, 32TB CN Echelon System

Execution Model Object Thread A B Global Address Space A Abstract Memory Hierarchy Bulk Xfer B B Load/Store Active Message

Two (of many) Fundamental Challenges

The High Cost of Data MovementFetching operands costs more than computing on them 20mm 64-bit DP 20pJ DRAM Rd/Wr 26 pJ 256 pJ 16 nJ Efficient off-chip link 256-bit buses 500 pJ 50 pJ 256-bit access 8 kB SRAM 1 nJ 28nm

Magnitude of Thread Count Billion-fold parallel fine-grained threads for Exascale

Echelon Disruptive Technologies • Locality, Locality, Locality 1 • Fine-grained concurrency 2 APIs for resilience, memory safety 3 4 Order of magnitude improvement in efficiency

Data Locality(Central to performance and efficiency) • Programming System • Abstract expression of spatial, temporal, and producer-consumer locality • Programmer expresses locality • Programming system maps threads and objects to locations to exploit locality • Architecture • Hierarchical global address space • Configurable memory hierarchy • Energy-provisioned bandwidth

Fine-Grained Concurrency(How we get to 1010 threads) • Programming System • Programmer expresses ALL of the concurrency • Programming system decides how much to exploit in space and how much to iterate over in time • Architecture • Fast, low-overhead thread-array creation and management • Fast, low-overhead communication and synchronization • Message-driven computing (active messages)

Dependability(How we get to AMTTI of one day) • Programming System • API to express • State to preserve • When to preserve it • Computations to check • Assertions • Responsibilities • Preserves state • Generates recovery code • Generates redundant computation where appropriate • Architecture • Error Checking on all memories and communication paths • Hardware configurable to run duplex computations • Hardware support for error containment and recovery

Security(From attackers and ourselves) • Key challenge: Memory safety • Malicious attacks • Programmer memory bugs (bad pointer dereferencing, etc.) • Programming System • Express partitioning of subsystems • Express privileges on data structures • Architecture: Guarded pointers primitive • Hardware can check all memory references/address computations • Fast, low-overhead subsystem entry • Errors reported through error containment features

> 10x Energy Efficiency Gain (GFlops/Watt) • Contemporary GPU: ~300pJ/Flop • Future parallel systems: ~20pJ/Flop • In order to get anywhere near Exascale in 2018 • ~4x can come from process scaling to 10nm • Remainder from architecture/programming system • Locality – both horizontal and vertical • Reduce data movement, migrate fine-grain tasks to data • Extremely energy-efficient throughput cores • Efficient instruction/data supply • Simple hardware: static instruction scheduling, simple instruction control • Multithreading and hardware support for thread arrays

An NVIDIA ExaScale Machine

Lane – 4 DFMAs, 16 GFLOPS

Streaming Multiprocessor • 8 lanes – 128 GFLOPS L1$ Switch P P P P P P P P

Echelon Chip - 128 SMs + 8 Latency Cores • 16 TFLOPS 1024 SRAM Banks, 256KB each MC MC NI SRAM SRAM SRAM NoC SM SM SM SM SM LC LC 128 SMs 128GF each

Node MCM – 16 TF + 256GB 160GB/s Network BW GPU Chip 16TF DP 256MB 1.6TB/s DRAM BW DRAM Stack DRAM Stack DRAM Stack NVMemory

Cabinet – 128 Nodes – 2 PF – 38 kW NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE MODULE MODULE MODULE MODULE MODULE 32 Modules, 4 Nodes/Module, Central Router Module(s), Dragonfly Interconnect

System – to ExaScale and Beyond Dragonfly Interconnect 500 Cabinets is ~1EF and ~19MW

GPU Technology Conference 2011Oct. 11-14 | San Jose, CA The one event you can’t afford to miss • Learn about leading-edge advances in GPU computing • Explore the research as well as the commercial applications • Discover advances in computational visualization • Take a deep dive into parallel programming Ways to participate • Speak – share your work and gain exposure as a thought leader • Register – learn from the experts and network with your peers • Exhibit/Sponsor – promote your company as a key player in the GPU ecosystem www.gputechconf.com

Questions

NVIDIA Parallel Developer Program All GPGPU developers should become NVIDIA Registered Developers Benefits include: • Early Access to Pre-Release Software • Beta software and libraries • Submit & Track Issues and Bugs • Announcing new benefits • Exclusive Q&A Webinars with NVIDIA Engineering • Exclusive deep dive CUDA training webinars • In depth engineering presentations on beta software Sign up Now:www.nvidia.com/paralleldeveloper

Echelon: NVIDIA & Team’s UHPC Project