470 likes | 595 Views
Stream Register Files with Indexed Access. Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally. NV35. NV10. Scaling Trends. ILP increasingly harder and more expensive to extract. Graphics processors exploit data parallelism. CPU data courtesy of Francois Labonte, Stanford University.
E N D
Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally
NV35 NV10 Scaling Trends • ILP increasingly harder and more expensive to extract • Graphics processors exploit data parallelism CPU data courtesy of Francois Labonte, Stanford University NSJ
Renewed Interest in Data Parallelism • Data parallel application classes • Media, signal, network processing, scientific simulations, encryption etc. • High-end vector machines • Have always been data parallel • Academic research • Stanford Imagine, Berkeley V-IRAM, programming GPUs etc. • “Main-stream” industry • Sony Emotion Engine, Tarantula etc. NSJ
DRAM Cache Stream/vector storage + x + x + x Storage Hierarchy • Bandwidth taper • Only supports sequential streams/vectors • But many data parallel apps with • Data reorderings • Irregular data structures • Conditional accesses NSJ
a00 a01 a02 a03 a10 a11 a12 a13 a33 a13 a12 a11 a10 a03 a02 a01 a00 a20 a21 a22 a23 a30 a31 a32 a33 b00 b01 b02 b03 b33 b13 b12 b11 b10 b03 b02 b01 b00 b10 b11 b12 b13 Reorder b20 b21 b22 b23 b33 b31 b21 b11 b01 b30 b20 b10 b00 b30 b31 b32 b33 c00 c01 c02 c03 c10 c11 c12 c13 c33 c31 c21 c11 c01 c30 c20 c10 c00 c20 c21 c22 c23 c30 c31 c32 c33 Sequential Streams/Vectors Inefficient Evaluate arbitrary order access to streams Memory/cache Stream/vector storage Compute units Row major Time Column major NSJ
Outline • Stream processing overview • Applications • Implementation • Results • Conclusion NSJ
in1 Out FFT_stage FFT_stage FFT_stage Output in2 Stream Programming • Streams of records passing through compute kernels • Parallelism • Across stream elements • Across kernels • Locality • Within kernels • Between kernels NSJ
Memory Stream register file (SRF) Compute units FFT_stage Time FFT_stage FFT_stage Bandwidth Hierarchy • Stream programming is well matched to bandwidth hierarchy NSJ
Memory system Memory switch SRF SRF bank bank 0 (N-1) Compute Compute cluster cluster 0 (N-1) Lane 0 Inter-cluster network Stream Processors • Several lanes • Execute in SIMD • Operate on records • Inter-cluster network NSJ
Outline • Stream processing overview • Applications • Implementation • Results • Conclusion NSJ
Stream data reuse Sequential (in-order) reuse e.g.: linear streams Non-sequential reuse Reordered reuse e.g.: 2-D, 3-D accesses, multi-grid Intra-stream reuse e.g.: irregular neighborhoods, table lookups Stream-Level Data Reuse • Sequential streams only capture in-order reuse • Arbitrary access patterns in SRF capture more of available temporal locality NSJ
Memory/cache Stream register file (SRF) Compute clusters a00 a01 a02 a03 a10 a11 a12 a13 a33 a13 a12 a11 a10 a03 a02 a01 a00 a20 a21 a22 a23 1D FFT a30 a31 a32 a33 Time a00 a01 a02 a03 b33 b13 b12 b11 b10 b03 b02 b01 b00 a10 a11 a12 a13 Reorder a20 a21 a22 a23 b33 b31 b21 b11 b01 b30 b20 b10 b00 a30 a31 a32 a33 1D FFT c00 c01 c02 c03 c10 c11 c12 c13 c33 c31 c21 c11 c01 c30 c20 c10 c00 c20 c21 c22 c23 c30 c31 c32 c33 Reordered Reuse • Indexed SRF access eliminates reordering through memory NSJ
Memory/cache Stream register file (SRF) Compute clusters a00 a01 a02 a03 a10 a11 a12 a13 a33 a13 a12 a11 a10 a03 a02 a01 a00 a20 a21 a22 a23 1D FFT a30 a31 a32 a33 Time a00 a01 a02 a03 b33 b13 b12 b11 b10 b03 b02 b01 b00 a10 a11 a12 a13 Reorder Reorder a20 a21 a22 a23 b33 b31 b21 b11 b01 b30 b20 b10 b00 a30 a31 a32 a33 1D FFT c00 c01 c02 c03 c10 c11 c12 c13 c33 c31 c21 c11 c01 c30 c20 c10 c00 c20 c21 c22 c23 c30 c31 c32 c33 Reordered Reuse • Indexed SRF access eliminates reordering through memory NSJ
Memory/cache Stream register file (SRF) Compute clusters A D C B A B D Compute Time C Replicate B A B D C A D B E F H H G F E G Intra-stream Reuse • Indexed SRF access eliminates • Replication in SRF • Redundant memory transfers NSJ
Replicate D C B A Intra-stream Reuse Memory/cache • Indexed SRF access eliminates • Replication in SRF • Redundant memory transfers Stream register file (SRF) Compute clusters A D C B A B D Compute Time C Replicate B A B D C A D B E F H H G F E G NSJ
Conditional Accesses • Fine-grain conditional accesses • Expensive in SIMD architectures • Translate to conditional address computation NSJ
Outline • Stream processing overview • Applications • Implementation • Results • Conclusion NSJ
SRF bank (N-1) SRF bank 0 b*W Compute cluster 0 Compute cluster (N-1) Inter-cluster network Base Architecture • Each SRF bank accesses block of b contiguous words NSJ
SRF bank (N-1) SRF bank 0 Address FIFOs Compute cluster 0 Compute cluster (N-1) Inter-cluster network Indexed SRF Architecture • Address path from clusters • Lower indexed access bandwidth NSJ
Local word -line drivers • Sub array 0 SRF bank • Sub array 1 • Sub array 2 Compute cluster • Sub array 3 Base SRF Bank • Several SRAM sub-arrays • Each access is to one sub-array NSJ
Pre-decode & row dec. • Sub array 0 SRF bank mux Pre-decode & row dec. • Sub array 1 Pre-decode & row dec. • Sub array 2 Compute cluster Pre-decode & row dec. • Sub array 3 Indexed SRF Bank • Extra 8:1 mux at sub-array output • Allows 4x 1-word accesses NSJ
SRF bank 0 SRF bank 0 SRF address network Inter-cluster network Address FIFOs Compute cluster 0 Compute cluster 0 Cross-lane Indexed SRF • Address switch added • Inter-cluster network used for cross-lane SRF data NSJ
Overhead - Area • In-lane indexing overheads • 11% over sequential SRF • Per-sub-array independent addressing overheads • Cross-lane indexing overheads • 22% over sequential SRF • Address switch • 1.5% to 3% increase in die area (Imagine processor) NSJ
Overhead - Energy • 0.1nJ (0.13mm) per indexed SRF access • ~4x sequential SRF access • > order of magnitude lower than DRAM access • 0.25nJ per cache access • Each indexed access replaces many SRF and DRAM/cache accesses NSJ
Outline • Stream processing overview • Applications • Implementation • Results • Conclusion NSJ
Benchmarks • 64x64 2D FFT • 2D accesses • Rijndael (AES) • Table lookups • Merge-sort • Fine-grain conditionals • 5x5 convolution filter • Regular neighborhood • Irregular graph • Irregular neighborhood access • Parameterized (IG_SML/DMS/DCS/SCL): Sparse/Dense graph, Memory/Compute-limited, Short/Long strips NSJ
DRAM DRAM DRAM Cache Memory switch Memory switch Memory switch SRF address net Inter-cluster net Inter-cluster net Inter-cluster net Machine Organizations Base (Sequential SRF) SRF banks Compute clusters Base + Cache Indexed SRF NSJ
Execution Time NSJ
Outline • Stream processing overview • Applications • Implementation • Results • Conclusion NSJ
Conclusions • Data parallelism increasingly important • Current data parallel architectures inefficient for some application classes • Irregular accesses • Indexed SRF accesses • Reduce memory traffic • Reduce SRF data replication • Efficiently support complex/conditional stream accesses • Performance improvements • 3% to 410% for target application classes • Low implementation overhead • 1.5% to 3% die area NSJ
Backups NSJ
Indexed Access Instruction Overhead • Excludes address issue instructions NSJ
LUT.index << a; Indep. instructions; LUT >> b; • 2 separate instructions • Address issue • Data read • Address-data separation • May require loop unrolling, software pipelining etc. Kernel C API while(!eos(in)) { in >> a; LUT[a] >> b; c = foo(a, b); out << c; } NSJ
NV35 NV30 Pentium 4 Why Graphics Hardware? Pentium 4 SSE theoretical* 3GHz * 4 wide * .5 inst / cycle = 6 GFLOPS GeForce FX 5900 (NV35) fragment shader observed: MULR R0, R0, R0: 20 GFLOPS equivalent to a 10 GHz P4 and getting faster: 3x improvement over NV30 (6 months) Slide from Ian Buck, Stanford University NSJ *from Intel P4 Optimization Manual
NVIDIA Graphics growth (225%/yr) Essentially Moore’s Law Cubed. • 1: Dual textured • 2: Programmable NSJ Slide from Pat Hanrahan, Kurt Akeley
NVIDIA Historicals 1.8 2.4 Slide from Pat Hanrahan, Kurt Akeley NSJ
SRF bank 0 SRF bank 7 128b 128b Stream buffers 32b 32b Compute cluster 0 Compute cluster 7 Inter-cluster network Base Architecture • Stream buffers match SRF bandwidth to compute needs NSJ
SRF bank 0 SRF bank 7 128b Stream buffers Address FIFOs 32b Compute cluster 0 Compute cluster 7 Inter-cluster network Indexed SRF Architecture • Address path from clusters • Lower indexed access bandwidth NSJ
Local WL drivers Sub array 0 SRF bank 256 128 Compute cluster • Sub array 1 • Sub array 2 • Sub array 3 Base SRF Bank • Several SRAM sub-arrays NSJ
Pre-decode & row dec. • Sub array 0 256 SRF bank 128 8:1 mux Pre-decode & row dec. • Sub array 1 Pre-decode & row dec. Compute cluster Pre-decode & row dec. • Sub array 2 • Sub array 3 Indexed SRF Bank • Extra 8:1 mux at sub-array output • Allows 4x 1-word accesses NSJ
SRF bank 0 SRF bank 7 SRF address network Inter-cluster network Stream buffers Address FIFOs 32b Compute cluster 0 Compute cluster 7 Cross-lane Indexed SRF • Address switch added • Inter-cluster network used for cross-lane SRF data NSJ