1 / 27

Echelon: NVIDIA & Team’s UHPC Project

Echelon: NVIDIA & Team’s UHPC Project. STEVE KECKLER DIRECTOR OF ARCHITECTURE RESEARCH, NVIDIA. GPU Supercomputing. Tsubame 2.0 4224 GPUs Linpack 1.3 PFlops. Tianhe-1A 7168 GPUs Linpack : 2.5 PFlops. Dawning Nebulae 4640 GPUs Linpack : 1.3 PFlops.

dalton
Download Presentation

Echelon: NVIDIA & Team’s UHPC Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Echelon: NVIDIA & Team’s UHPC Project STEVE KECKLER DIRECTOR OF ARCHITECTURE RESEARCH, NVIDIA

  2. GPU Supercomputing Tsubame 2.0 4224 GPUs Linpack 1.3 PFlops Tianhe-1A 7168 GPUs Linpack: 2.5 PFlops Dawning Nebulae 4640 GPUs Linpack: 1.3 PFlops 8 more GPU accelerated machines in the November Top500 Many (corporate) machines not listed • NVIDIA GPU Module

  3. Top 5 Performance and Power

  4. Existing GPU Application Areas

  5. Key Challenges • Energy to Solution is too large • Programming parallel machines is too difficult • Programs are not scalable to billion-fold parallelism • Resilience (AMTTI) is too low • Machines are vulnerable to attacks/undetected program errors

  6. Echelon Team

  7. System Sketch Dragonfly Interconnect (optical fiber) High-Radix Router Module (RM) DRAMCube DRAMCube NV RAM Self-Aware OS L20 L21023 MC NIC NoC Self-Aware Runtime L0 L0 LC0 LC7 C0 C7 SM0 SM127 Processor Chip (PC) Locality-AwareCompiler & Autotuner Node 0 (N0) 16TF, 1.6TB/s, 256GB N7 Module 0 (M)) 128TF, 12.8TB/s, 2TB M15 Cabinet 0 (C0) 2 PF, 205TB/s, 32TB CN Echelon System

  8. Execution Model Object Thread A B Global Address Space A Abstract Memory Hierarchy Bulk Xfer B B Load/Store Active Message

  9. Two (of many) Fundamental Challenges

  10. The High Cost of Data MovementFetching operands costs more than computing on them 20mm 64-bit DP 20pJ DRAM Rd/Wr 26 pJ 256 pJ 16 nJ Efficient off-chip link 256-bit buses 500 pJ 50 pJ 256-bit access 8 kB SRAM 1 nJ 28nm

  11. Magnitude of Thread Count Billion-fold parallel fine-grained threads for Exascale

  12. Echelon Disruptive Technologies • Locality, Locality, Locality 1 • Fine-grained concurrency 2 APIs for resilience, memory safety 3 4 Order of magnitude improvement in efficiency

  13. Data Locality(Central to performance and efficiency) • Programming System • Abstract expression of spatial, temporal, and producer-consumer locality • Programmer expresses locality • Programming system maps threads and objects to locations to exploit locality • Architecture • Hierarchical global address space • Configurable memory hierarchy • Energy-provisioned bandwidth

  14. Fine-Grained Concurrency(How we get to 1010 threads) • Programming System • Programmer expresses ALL of the concurrency • Programming system decides how much to exploit in space and how much to iterate over in time • Architecture • Fast, low-overhead thread-array creation and management • Fast, low-overhead communication and synchronization • Message-driven computing (active messages)

  15. Dependability(How we get to AMTTI of one day) • Programming System • API to express • State to preserve • When to preserve it • Computations to check • Assertions • Responsibilities • Preserves state • Generates recovery code • Generates redundant computation where appropriate • Architecture • Error Checking on all memories and communication paths • Hardware configurable to run duplex computations • Hardware support for error containment and recovery

  16. Security(From attackers and ourselves) • Key challenge: Memory safety • Malicious attacks • Programmer memory bugs (bad pointer dereferencing, etc.) • Programming System • Express partitioning of subsystems • Express privileges on data structures • Architecture: Guarded pointers primitive • Hardware can check all memory references/address computations • Fast, low-overhead subsystem entry • Errors reported through error containment features

  17. > 10x Energy Efficiency Gain (GFlops/Watt) • Contemporary GPU: ~300pJ/Flop • Future parallel systems: ~20pJ/Flop • In order to get anywhere near Exascale in 2018 • ~4x can come from process scaling to 10nm • Remainder from architecture/programming system • Locality – both horizontal and vertical • Reduce data movement, migrate fine-grain tasks to data • Extremely energy-efficient throughput cores • Efficient instruction/data supply • Simple hardware: static instruction scheduling, simple instruction control • Multithreading and hardware support for thread arrays

  18. An NVIDIA ExaScale Machine

  19. Lane – 4 DFMAs, 16 GFLOPS

  20. Streaming Multiprocessor • 8 lanes – 128 GFLOPS L1$ Switch P P P P P P P P

  21. Echelon Chip - 128 SMs + 8 Latency Cores • 16 TFLOPS 1024 SRAM Banks, 256KB each MC MC NI SRAM SRAM SRAM NoC SM SM SM SM SM LC LC 128 SMs 128GF each

  22. Node MCM – 16 TF + 256GB 160GB/s Network BW GPU Chip 16TF DP 256MB 1.6TB/s DRAM BW DRAM Stack DRAM Stack DRAM Stack NVMemory

  23. Cabinet – 128 Nodes – 2 PF – 38 kW NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE MODULE MODULE MODULE MODULE MODULE 32 Modules, 4 Nodes/Module, Central Router Module(s), Dragonfly Interconnect

  24. System – to ExaScale and Beyond Dragonfly Interconnect 500 Cabinets is ~1EF and ~19MW

  25. GPU Technology Conference 2011Oct. 11-14 | San Jose, CA The one event you can’t afford to miss • Learn about leading-edge advances in GPU computing • Explore the research as well as the commercial applications • Discover advances in computational visualization • Take a deep dive into parallel programming Ways to participate • Speak – share your work and gain exposure as a thought leader • Register – learn from the experts and network with your peers • Exhibit/Sponsor – promote your company as a key player in the GPU ecosystem www.gputechconf.com

  26. Questions

  27. NVIDIA Parallel Developer Program All GPGPU developers should become NVIDIA Registered Developers Benefits include: • Early Access to Pre-Release Software • Beta software and libraries • Submit & Track Issues and Bugs • Announcing new benefits • Exclusive Q&A Webinars with NVIDIA Engineering • Exclusive deep dive CUDA training webinars • In depth engineering presentations on beta software Sign up Now:www.nvidia.com/paralleldeveloper

More Related