Stream Register Files with Indexed Access

Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally

NV35 NV10 Scaling Trends • ILP increasingly harder and more expensive to extract • Graphics processors exploit data parallelism CPU data courtesy of Francois Labonte, Stanford University NSJ

Renewed Interest in Data Parallelism • Data parallel application classes • Media, signal, network processing, scientific simulations, encryption etc. • High-end vector machines • Have always been data parallel • Academic research • Stanford Imagine, Berkeley V-IRAM, programming GPUs etc. • “Main-stream” industry • Sony Emotion Engine, Tarantula etc. NSJ

DRAM Cache Stream/vector storage + x + x + x Storage Hierarchy • Bandwidth taper • Only supports sequential streams/vectors • But many data parallel apps with • Data reorderings • Irregular data structures • Conditional accesses NSJ

a00 a01 a02 a03 a10 a11 a12 a13 a33 a13 a12 a11 a10 a03 a02 a01 a00 a20 a21 a22 a23 a30 a31 a32 a33 b00 b01 b02 b03 b33 b13 b12 b11 b10 b03 b02 b01 b00 b10 b11 b12 b13 Reorder b20 b21 b22 b23 b33 b31 b21 b11 b01 b30 b20 b10 b00 b30 b31 b32 b33 c00 c01 c02 c03 c10 c11 c12 c13 c33 c31 c21 c11 c01 c30 c20 c10 c00 c20 c21 c22 c23 c30 c31 c32 c33 Sequential Streams/Vectors Inefficient  Evaluate arbitrary order access to streams Memory/cache Stream/vector storage Compute units Row major Time Column major NSJ

Outline • Stream processing overview • Applications • Implementation • Results • Conclusion NSJ

in1 Out FFT_stage FFT_stage FFT_stage Output in2 Stream Programming • Streams of records passing through compute kernels • Parallelism • Across stream elements • Across kernels • Locality • Within kernels • Between kernels NSJ

Memory Stream register file (SRF) Compute units FFT_stage Time FFT_stage FFT_stage Bandwidth Hierarchy • Stream programming is well matched to bandwidth hierarchy NSJ

Memory system Memory switch SRF SRF bank bank 0 (N-1) Compute Compute cluster cluster 0 (N-1) Lane 0 Inter-cluster network Stream Processors • Several lanes • Execute in SIMD • Operate on records • Inter-cluster network NSJ

Stream data reuse Sequential (in-order) reuse e.g.: linear streams Non-sequential reuse Reordered reuse e.g.: 2-D, 3-D accesses, multi-grid Intra-stream reuse e.g.: irregular neighborhoods, table lookups Stream-Level Data Reuse • Sequential streams only capture in-order reuse • Arbitrary access patterns in SRF capture more of available temporal locality NSJ

Memory/cache Stream register file (SRF) Compute clusters a00 a01 a02 a03 a10 a11 a12 a13 a33 a13 a12 a11 a10 a03 a02 a01 a00 a20 a21 a22 a23 1D FFT a30 a31 a32 a33 Time a00 a01 a02 a03 b33 b13 b12 b11 b10 b03 b02 b01 b00 a10 a11 a12 a13 Reorder a20 a21 a22 a23 b33 b31 b21 b11 b01 b30 b20 b10 b00 a30 a31 a32 a33 1D FFT c00 c01 c02 c03 c10 c11 c12 c13 c33 c31 c21 c11 c01 c30 c20 c10 c00 c20 c21 c22 c23 c30 c31 c32 c33 Reordered Reuse • Indexed SRF access eliminates reordering through memory NSJ

Memory/cache Stream register file (SRF) Compute clusters a00 a01 a02 a03 a10 a11 a12 a13 a33 a13 a12 a11 a10 a03 a02 a01 a00 a20 a21 a22 a23 1D FFT a30 a31 a32 a33 Time a00 a01 a02 a03 b33 b13 b12 b11 b10 b03 b02 b01 b00 a10 a11 a12 a13 Reorder Reorder a20 a21 a22 a23 b33 b31 b21 b11 b01 b30 b20 b10 b00 a30 a31 a32 a33 1D FFT c00 c01 c02 c03 c10 c11 c12 c13 c33 c31 c21 c11 c01 c30 c20 c10 c00 c20 c21 c22 c23 c30 c31 c32 c33 Reordered Reuse • Indexed SRF access eliminates reordering through memory NSJ

Memory/cache Stream register file (SRF) Compute clusters A D C B A B D Compute Time C Replicate B A B D C A D B E F H H G F E G Intra-stream Reuse • Indexed SRF access eliminates • Replication in SRF • Redundant memory transfers NSJ

Replicate D C B A Intra-stream Reuse Memory/cache • Indexed SRF access eliminates • Replication in SRF • Redundant memory transfers Stream register file (SRF) Compute clusters A D C B A B D Compute Time C Replicate B A B D C A D B E F H H G F E G NSJ

Conditional Accesses • Fine-grain conditional accesses • Expensive in SIMD architectures • Translate to conditional address computation NSJ

SRF bank (N-1) SRF bank 0 b*W Compute cluster 0 Compute cluster (N-1) Inter-cluster network Base Architecture • Each SRF bank accesses block of b contiguous words NSJ

SRF bank (N-1) SRF bank 0 Address FIFOs Compute cluster 0 Compute cluster (N-1) Inter-cluster network Indexed SRF Architecture • Address path from clusters • Lower indexed access bandwidth NSJ

Local word -line drivers • Sub array 0 SRF bank • Sub array 1 • Sub array 2 Compute cluster • Sub array 3 Base SRF Bank • Several SRAM sub-arrays • Each access is to one sub-array NSJ

Pre-decode & row dec. • Sub array 0 SRF bank mux Pre-decode & row dec. • Sub array 1 Pre-decode & row dec. • Sub array 2 Compute cluster Pre-decode & row dec. • Sub array 3 Indexed SRF Bank • Extra 8:1 mux at sub-array output • Allows 4x 1-word accesses NSJ

SRF bank 0 SRF bank 0 SRF address network Inter-cluster network Address FIFOs Compute cluster 0 Compute cluster 0 Cross-lane Indexed SRF • Address switch added • Inter-cluster network used for cross-lane SRF data NSJ

Overhead - Area • In-lane indexing overheads • 11% over sequential SRF • Per-sub-array independent addressing overheads • Cross-lane indexing overheads • 22% over sequential SRF • Address switch • 1.5% to 3% increase in die area (Imagine processor) NSJ

Overhead - Energy • 0.1nJ (0.13mm) per indexed SRF access • ~4x sequential SRF access • > order of magnitude lower than DRAM access • 0.25nJ per cache access • Each indexed access replaces many SRF and DRAM/cache accesses NSJ

Benchmarks • 64x64 2D FFT • 2D accesses • Rijndael (AES) • Table lookups • Merge-sort • Fine-grain conditionals • 5x5 convolution filter • Regular neighborhood • Irregular graph • Irregular neighborhood access • Parameterized (IG_SML/DMS/DCS/SCL): Sparse/Dense graph, Memory/Compute-limited, Short/Long strips NSJ

DRAM DRAM DRAM Cache Memory switch Memory switch Memory switch SRF address net Inter-cluster net Inter-cluster net Inter-cluster net Machine Organizations Base (Sequential SRF) SRF banks Compute clusters Base + Cache Indexed SRF NSJ

Machine Parameters NSJ

Off-chip Memory Bandwidth NSJ

Execution Time NSJ

Conclusions • Data parallelism increasingly important • Current data parallel architectures inefficient for some application classes • Irregular accesses • Indexed SRF accesses • Reduce memory traffic • Reduce SRF data replication • Efficiently support complex/conditional stream accesses • Performance improvements • 3% to 410% for target application classes • Low implementation overhead • 1.5% to 3% die area NSJ

Backups NSJ

Indexed Access Instruction Overhead • Excludes address issue instructions NSJ

LUT.index << a; Indep. instructions; LUT >> b; • 2 separate instructions • Address issue • Data read • Address-data separation • May require loop unrolling, software pipelining etc. Kernel C API while(!eos(in)) { in >> a; LUT[a] >> b; c = foo(a, b); out << c; } NSJ

Sensitivity to SRF Access Latency (1) NSJ

Sensitivity to SRF Access Latency (2) NSJ

NV35 NV30 Pentium 4 Why Graphics Hardware? Pentium 4 SSE theoretical* 3GHz * 4 wide * .5 inst / cycle = 6 GFLOPS GeForce FX 5900 (NV35) fragment shader observed: MULR R0, R0, R0: 20 GFLOPS equivalent to a 10 GHz P4 and getting faster: 3x improvement over NV30 (6 months) Slide from Ian Buck, Stanford University NSJ *from Intel P4 Optimization Manual

NVIDIA Graphics growth (225%/yr) Essentially Moore’s Law Cubed. • 1: Dual textured • 2: Programmable NSJ Slide from Pat Hanrahan, Kurt Akeley

NVIDIA Historicals 1.8 2.4 Slide from Pat Hanrahan, Kurt Akeley NSJ

SRF bank 0 SRF bank 7 128b 128b Stream buffers 32b 32b Compute cluster 0 Compute cluster 7 Inter-cluster network Base Architecture • Stream buffers match SRF bandwidth to compute needs NSJ

SRF bank 0 SRF bank 7 128b Stream buffers Address FIFOs 32b Compute cluster 0 Compute cluster 7 Inter-cluster network Indexed SRF Architecture • Address path from clusters • Lower indexed access bandwidth NSJ

Local WL drivers Sub array 0 SRF bank 256 128 Compute cluster • Sub array 1 • Sub array 2 • Sub array 3 Base SRF Bank • Several SRAM sub-arrays NSJ

Pre-decode & row dec. • Sub array 0 256 SRF bank 128 8:1 mux Pre-decode & row dec. • Sub array 1 Pre-decode & row dec. Compute cluster Pre-decode & row dec. • Sub array 2 • Sub array 3 Indexed SRF Bank • Extra 8:1 mux at sub-array output • Allows 4x 1-word accesses NSJ

SRF bank 0 SRF bank 7 SRF address network Inter-cluster network Stream buffers Address FIFOs 32b Compute cluster 0 Compute cluster 7 Cross-lane Indexed SRF • Address switch added • Inter-cluster network used for cross-lane SRF data NSJ

Stream Register Files with Indexed Access

Stream Register Files with Indexed Access

Presentation Transcript

Register Files and Memories

Multiple Banked Register Files

Indexed Files

Lecture 21: Indexed Files

Lecture 11: Indexed Files

Files Access

Separable 2D Convolution with Polymorphic Register Files

Indexed Files

Register Files and Memories

Multiple Banked Register Files

ISAM: Indexed-Sequential-Access-Method

Learning Epinfo with stream files

Random Access Files

Register Files and Memories

Introduction to File Processing with PHP - Part 2 Indexed Files

Indexed Files continued

Register Files and Memories

Indexing Stream Register Files

Indexed Files.

Access files from anywhere

Stream Access

Indexed Files continued