HTMT-class Latency Tolerant Parallel Architecture for Petaflops-scale Computation

HTMT-class Latency Tolerant Parallel Architecture for Petaflops-scale Computation Dr. Thomas Sterling California Institute of Technology and NASA Jet Propulsion Laboratory October 1, 1999

Dr. Thomas Sterling - HTMT Petaflops Architecture

Rational Drug Design Nanotechnology Tomographic Reconstruction Phylogenetic Trees Biomolecular Dynamics Neural Networks Crystallography Fracture Mechanics MRI Imaging Reservoir Modelling Molecular Modelling Biosphere/Geosphere Diffraction Inversion Problems Distribution Networks Chemical Dynamics Atomic Scattering Electrical Grids Flow in Porous Media Pipeline Flows Data Assimilation Signal Processing Condensed Matter Electronic Structure Plasma Processing Chemical Reactors Cloud Physics Electronic Structure Boilers Combustion Actinide Chemistry Radiation CVD Graph Theoretic Fourier Methods Quantum Chemistry Reaction-Diffusion Chemical Reactors Cosmology Transport n-body Astrophysics Multiphase Flow Manufacturing Systems CFD Basic Algorithms & Numerical Methods Discrete Events Weather and Climate PDE Air Traffic Control Military Logistics Structural Mechanics Seismic Processing Population Genetics Monte Carlo ODE Multibody Dynamics Geophysical Fluids VLSI Design Transportation Systems Aerodynamics Raster Graphics Economics Fields Orbital Mechanics Nuclear Structure Ecosystems QCD Pattern Matching Symbolic Processing Neutron Transport Economics Models Genome Processing Virtual Reality Cryptography Astrophysics Electromagnetics Computer Vision Virtual Prototypes Intelligent Search Multimedia Collaboration Tools Computer Algebra Databases Magnet Design Computational Steering Scientific Visualization Data Minning Automated Deduction Number Theory CAD Dr. Thomas Sterling - HTMT Petaflops Architecture Intelligent Agents

A 10 Gflops Beowulf Center for Advance Computing Research 172 Intel Pentium Pro microprocessors California Institute of Technology Dr. Thomas Sterling - HTMT Petaflops Architecture

Emergence of Beowulf Clusters Dr. Thomas Sterling - HTMT Petaflops Architecture

1st printing: May, 1999 2nd printing: Aug. 1999 MIT Press Dr. Thomas Sterling - HTMT Petaflops Architecture

Beowulf Scalability Dr. Thomas Sterling - HTMT Petaflops Architecture

2nd LEVEL CACHE 96 MBYTES 64 bytes wide 160 gbytes/sec VLIW/RISC CORE 24 GFLOPS 6 ghz INTEGRATED SMP - WDM DRAM - 4 GBYTES - HIGHLY INTERLEAVED MULTI-LAMBDA AON CROSS BAR coherence 640 GBYTES/SEC 2nd LEVEL CACHE 96 MBYTES 64 bytes wide 160 gbytes/sec VLIW/RISC CORE 24 GFLOPS 6 ghz ...

COTS PetaFlop System 128 die/box 4 CPU/die 3 4 ... 5 2 16 1 17 64 ALL-OPTICAL SWITCH 18 63 ... ... 32 49 48 Multi-Die Multi-Processor ... 33 47 46 I/O 10 meters= 50 NS Delay Dr. Thomas Sterling - HTMT Petaflops Architecture

COTS PetaFlops System • 8192 Dies (4 CPU/die-minimum) • Each Die is 120 GFlops • 1 PetaFlop Peak • Power 8192 x200 Watts = 1.6 MegaWatts • Extra Main Memory >3 MegaWatts (512 TBytes) • 15.36 TFlops/Rack (128 die) • 30 KWatts/Rack - thus 64 racks - 30 inch • Common System I/O • 2 Level Main Memory • Optical Interconnect • OC768 Channels (40 GHz) • 128 Channels per Die (DWDM)-5.12 THz • ALL Optical Switching • Bisection Bandwidth of 50 TBytes/sec • 15 TFlops/rack*.1bytes/flop/sec*32 racks • Rack Bandwidth - 15 TFlops*.1= 12 THz Dr. Thomas Sterling - HTMT Petaflops Architecture

The SIA CMOS Roadmap Dr. Thomas Sterling - HTMT Petaflops Architecture

Requirements for High End Systems • Bulk capabilities • performance • storage capacities • throughput/bandwidth • cost, power, complexity • Efficiency • overhead • latency • contention • starvation/parallelism • Usability • generality • programmability • reliability Dr. Thomas Sterling - HTMT Petaflops Architecture

Points of Inflection in the History of Computing • Heroic Era (1950) • technology: vacuum tubes, mercury delay lines, pulse transformers • architecture: accumulator based • model: von-Neumann, sequential instruction execution • examples: Whirlwind, EDSAC • Mainframe (1960) • technology: transistors, core memory, disk drives • architecture: register bank based • model: virtual memory • examples: IBM 7090, PDP-1 Dr. Thomas Sterling - HTMT Petaflops Architecture

Points of Inflection in the History of Computing • Supercomputers (1980) • technology: ECL, semiconductor integration, RAM • architecture: pipelined • model: vector • example: Cray-1 • Massively Parallel Processing (1990) • technology: VLSI, microprocessor, • architecture: MIMD • model: Communicating Sequential Processes, Message passing • examples: TMC CM-5, Intel Paragon • ? (2000) Dr. Thomas Sterling - HTMT Petaflops Architecture

HTMT Objectives • Scalable architecture with high sustained performance in the presence of disparate cycle times and latencies • Exploit diverse device technologies to achieve substantially superior operating point • Execution model to simplify parallel system programming and expand generality and applicability Dr. Thomas Sterling - HTMT Petaflops Architecture

DRAM PIM 3D Mem I/O FARM • Compress/Decompress • ECC/Redundancy • Compress/Decompress • Spectral Transforms OPTICAL SWITCH SRAM PIM • Compress/Decompress • Routing • Data Structure Initializations • “In the Memory” Operations • RSFQ Thread Management • Context Percolation • Scatter/Gather Indexing • Pointer chasing • Push/Pull Closures • Synchronization Activities RSFQ Nodes Hybrid Technology MultiThreaded Architecture Dr. Thomas Sterling - HTMT Petaflops Architecture

Storage Capacity by Subsystem 2007 Design Point Dr. Thomas Sterling - HTMT Petaflops Architecture

HTMT Strategy • High performance • Superconductor RSFQ logic • Data Vortex optical interconnect network • PIM smart memory • Low power • Superconductor RSFQ logic • Optical holographic storage • PIM smart memory Dr. Thomas Sterling - HTMT Petaflops Architecture

HTMT Strategy (cont) • Low cost • reduce wire count through chip-to-chip fiber • reduce processor count through x100 clock speed • reduce memory chips by 3-2 holographic memory layer • Efficiency • processor level multithreading • smart memory managed second stage context pushing multithreading • fine grain regular & irregular data parallelism exploited in memory • high memory bandwidth and low latency ops through PIM • memory to memory interactions without processor intervention • hardware mechanisms for synchronization, scheduling, data/context migration, gather/scatter Dr. Thomas Sterling - HTMT Petaflops Architecture

HTMT Strategy (cont) • Programmability • Global shared name space • hierarchical parallel thread flow control model • no explicit processor naming • automatic latency management • automatic processor load balancing • runtime fine grain multithreading • automatic context pushing for process migration (percolation) • configuration transparent, runtime scalable Dr. Thomas Sterling - HTMT Petaflops Architecture

RSFQ Roadmap(VLSI Circuit Clock Frequency) Dr. Thomas Sterling - HTMT Petaflops Architecture

JJ1 JJ2 RSFQ Building Block L1 Dr. Thomas Sterling - HTMT Petaflops Architecture

Advantages • X100 clock speeds achievable • X100 power efficiency advantage • Easier fabrication • Leverage semiconductor fabrication tools • First technology to encounter ultra-high speed operation Dr. Thomas Sterling - HTMT Petaflops Architecture

SuperconductorProcessor • 100 GHz clock, 33 GHz inter-chip • 0.8 micron Niobium on Silicon • 100K gates per chip • 0.05 watts per processor • 100Kwatts per Petaflops Dr. Thomas Sterling - HTMT Petaflops Architecture

Data Vortex Optical Interconnect Dr. Thomas Sterling - HTMT Petaflops Architecture

DATA VORTEX LATENCY DISTRIBUTION network height = 1024 Dr. Thomas Sterling - HTMT Petaflops Architecture

Single-mode rib waveguides on silicon-on-insulator wafers‡ Hybrid sources and detectors Mix of CMOS-like and ‘micromachining’-type processes for fabrication ‡ e.g: R A Soref, J Schmidtchen & K Petermann, IEEE J. Quantum Electron. 27 p1971 (1991) A Rickman, G T Reed, B L Weiss & F Navamar, IEEE Photonics Technol. Lett. 4 p.633 (1992) B Jalali, P D Trinh, S Yegnanarayanan & F Coppinger IEE Proc. Optoelectron. 143 p.307 (1996) Dr. Thomas Sterling - HTMT Petaflops Architecture

Sense Amps Sense Amps Memory Stack Memory Stack Decode Basic Silicon Macro Sense Amps Sense Amps Node Logic Sense Amps Sense Amps Memory Stack Memory Stack Sense Amps Sense Amps Single Chip PIM Provides Smart Memory • Merge logic and memory • Integrate multiple logic/mem stacks on single chip • Exposes high intrinsic memory bandwidth • Reduction of memory access latency • Low overhead for memory oriented operations • Manages data structure manipulation, context coordination and percolation Dr. Thomas Sterling - HTMT Petaflops Architecture

Multithreaded Control of PIM Functions multiple operation sequences with low context switching overhead maximize memory utilization and efficiency maximize processor and I/O utilization Boolean ALU Memory Stack Row Registers GP - ALU Context Registers Row Buffers Node Logic Hi Speed Links (Firewire) Memory Bus I/F (PCI) FP FP Multithreaded PIM DRAM • multiple banks of row buffers to hold data, instructions, and addr • data parallel basic operations at row buffer • manages shared resources such as FP Direct PIM to PIM Interaction • memory communicates with memory within and across chip boundaries without external control processor intervention by “parcels” • exposes fine grain parallelism intrinsic to vector and irregular data structures • e.g. pointer chasing, block moves, synchronization, data balancing Dr. Thomas Sterling - HTMT Petaflops Architecture

32MB 32MB FtPt ASAP FtPt ASAP 32MB 32MB FtPt ASAP FtPt ASAP Silicon Budget for HTMT DRAM PIM • Designed to provide proper balance of memory & support for fiber bandwidth • Different Vortex configurations => different #s • In 2004, 16 TB = 4096 groups of 64 chips • Each Chip: Fiber WDM Optical Receiver Interface HRAM & Vortex Output SuperScalar Core Memory Logic By Area Dr. Thomas Sterling - HTMT Petaflops Architecture

Holographic 3/2 Memory Performance Scaling Advantages • petabyte memory • competitive cost • 10 sec access time • low power • efficient interface to DRAM Disadvantages • recording rate is slower than the readout rate for LiNbO3 • recording must be done in GB chunks • long term trend favors DRAM unless new materials and lasers are used Dr. Thomas Sterling - HTMT Petaflops Architecture

0.3 m 1.4 m 4oK 50 W 77oK SIDE VIEW 1 m Fiber/Wire Interconnects 1 m 3 m Dr. Thomas Sterling - HTMT Petaflops Architecture 0.5 m

SIDE VIEW Nitrogen Helium Tape Silo Array (400 Silos) Hard Disk Array (40 cabinets) 4oK 50 W 77oK Fiber/Wire Interconnects Front End Computer Server 3 m 3 m Console Cable Tray Assembly 0.5 m 220Volts 220Volts WDM Source Generator Generator 980 nm Pumps (20 cabinets) Optical Amplifiers Dr. Thomas Sterling - HTMT Petaflops Architecture

15 m 27 m Cryogenics Refrigeration Room 27 m 25 m HTMT Facility (Top View) Dr. Thomas Sterling - HTMT Petaflops Architecture

Floor Area Dr. Thomas Sterling - HTMT Petaflops Architecture

Power Dissipation by Subsystem Petaflops Design Point Dr. Thomas Sterling - HTMT Petaflops Architecture

Subsystem Interfaces 2007 Design Point • Same colors indicate a connection between subsystems • Horizontal lines group interfaces within a subsystem Dr. Thomas Sterling - HTMT Petaflops Architecture

HTMT-class Latency Tolerant Parallel Architecture for Petaflops-scale Computation

HTMT-class Latency Tolerant Parallel Architecture for Petaflops-scale Computation

Presentation Transcript

Parallel Computation Architecture, Algorithm and Programming

Potential for Parallel Computation, part 2

Can Commodity Linux Clusters Scale to Petaflops?

Discretized Streams: Fault-Tolerant Streaming Computation at Scale

Fault Tolerant Quantum Computation

Vector Machines Model for Parallel Computation

MCMC Using Parallel Computation

The Threshold for Fault-Tolerant Quantum Computation

The PRAM Model for Parallel Computation

Complexity Measures for Parallel Computation

Models and Languages for Parallel Computation

Parallel Architecture

Local Fault-tolerant Quantum Computation

Potential for Parallel Computation

Parallel Computation Models

Parallel computation models

Survey of Parallel Computation

The PRAM Model for Parallel Computation

Matrix Eigensystem Tutorial For Parallel Computation

Complexity Measures for Parallel Computation