410 likes | 418 Views
The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group July 12, 2006. Outline. Cell Architecture Programming Cell Benchmarks & Performance Stencils on Structured Grids
E N D
The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group July 12, 2006
Outline • Cell Architecture • Programming Cell • Benchmarks & Performance • Stencils on Structured Grids • Sparse Matrix-Vector Multiplication • Matrix-Matrix Multiplication • FFTs • Summary
Cell Chip Architecture • 9 Core SMP • One core is a conventional cache based PPC • The other 8 are local memory based SIMD processors (SPEs) • Cores connected via 4 rings (4 x 128b @ 1.6GHz) • 25.6GB/s memory bandwidth (128b @ 1.6GHz) to XDR • I/O channels can be used to build multichip SMPs
Cell Chip Characteristics • Core frequency up to 3.2GHz • Limited access to 2.1GHz hardware • Faster (2.4GHz & 3.2GHz) became available after the paper. • ~221mm2 chip, • Much smaller than Itanium2 (1/2to 1/3) • About the size of a Power5 • Opteron/Pentium before shrink • ~500W blades (2 chips + DRAM + network)
SPE Architecture • 128b SIMD • Dual issue (FP/ALU + load/store/permute/etc…), but not DP • Statically Scheduled, in-order • 4 FMA SP datapaths, 6 cycle latency • 1 FMA DP datapath, 13 cycle latency (including 7 stall cycles) • Software managed BTB (branch hints) • Local memory based architecture (two steps to access DRAM) • DMA commands are queued in the MFC • 128b @ 1.6GHz to local store (256KB) • Aligned loads from local store to RF (128 x 128b) • ~3W @ 3.2GHz
Estimation, Simulation and Exploration • Perform analysis before writing any code • Conceptualize Algorithm with • Double buffering + long DMAs + in order machine • use static timing analysis + memory traffic modeling • Model Performance • For regular data structures, spreadsheet modeling works • SpMV requires more advanced modeling • Full System Simulator • Cell+ • How severely does DP throughput of 1 SIMD instruction every 7 or 8 cycles impair performance? • 1 SIMD instruction every 2 cycles • Allows dual issuing of DP instructions with loads/stores/permutes/branch
Cell Programming • Modified SPMD (Single Program Multiple Data) • Dual Program Multiple Data • Hierarchical SPMD • Kind of clunky approach compared to MPI or pthreads • Power core is used to: • Load/initialize data structures • Spawn SPE threads • Parallelize data structures • Pass pointers to SPEs • Synchronize SPE threads • Communicate with other processors • Perform other I/O operations
SPE Programming • Software controlled Memory • Use pointers from PPC to construct global addresses • Programmer handles transfers from global to local • Compiler handles transfers from local to RF • Most applicable when address stream is long and independent • DMAs • Granularity is 1,2,4,8,16N bytes • Issued with very low level intrinsics (no error checking) • GETL = Stanza Gather/Scatter = distributed in global, packed in local • Double buffering • SPU and MFC operate in parallel • Operate on current data set while transferring the next/last • Time is max of computation time and communication time • Prologue and epilogue penalties • Although more verbose, intrinsics are an effective way of delivering peak performance.
Benchmark Kernels • Stencil Computations on Structured Grids • Sparse Matrix-Vector Multiplication • Matrix-Matrix Multiplication • 1D FFTs • 2D FFTs
Processors Used Note: Cell performance does not include the Power core
[x,y,z+1] [x,y-1,z] [x-1,y,z] [x,y,z] [x+1,y,z] [x,y+1,z] [x,y,z-1] Stencil Operations • Simple Example - The Heat Equation • dT/dt = k2T • Parabolic PDE on 3D discretized scalar domain • Jacobi Style (read from current grid, write to next grid) • 7 FLOPs per point, typically double precision • Next[x,y,z] = Alpha*Current[x,y,z] + Current[x-1,y,z] + Current[x+1,y,z] + Current[x,y-1,z] + Current[x,y+1,z] + Current[x,y,z-1] + Current[x,y,z+1] • Doesn’t exploit the FMA pipeline well • Basically 6 streams presented to the memory subsystem • Explicit ghost zones bound grid
Optimization - Planes • Naïve approach (cacheless vector machine) is to load 5 streams and store one. • This is 7 flops per 48 bytes • memory limits performance to 3.7 GFLOP/s • A better approach is to make each DMA the size of a plane • cache the 3 most recent planes (z-1, z, z+1) • there are only two streams (one load, one store) • memory now limits performance to 11.2 GFLOP/s • Still must compute on each plane after it is loaded • e.g. forall Current_local[x,y] update Next_local[x,y] • Note: computation can severely limit performance
Optimization - Double Buffering • Add a input stream buffer and and output stream buffer (keep 6 planes in local store) • Two phases (transfer & compute) are now overlapped • Thus it is possible to hide the faster of DMA transfer time and computation time
Optimization - Cache Blocking • Domains can be quite large (~1GB) • A single plane, let alone 6, might not fit in the local store • Partition domain (and thus planes) into cache blocks so that 6 can fit in the local store. • Has the added benefit that cache blocks are independent and thus can be parallelized across multiple SPEs • Memory efficiency can be diminished as an intra grid ghost zone is implicitly created.
Optimization - Register Blocking • Instead of computing on pencils, compute on ribbons (4x2) • Hides functional unit & local store latency • Minimizes local store memory traffic • Minimizes loop overhead • May not be beneficial / noticeable for cache based machines
Time t Time t+1 Time t+2 Optimization - Time Skewing • If the application allows it, perform multiple time steps within the local store • Only appropriate on memory bound implementations • Improves computational intensity • single precision, or • improved double precision • Simple approach • Overlapping trapezoids in time-space plot • Can be inefficient due to duplicated work • If performing two steps, local store must now hold 9 planes (thus smaller cache blocks) • If performing n steps, the local store must hold 3(n+1) planes
Stencil Performance • Notes: • Performance model when accounting for limited dual issue, matches well with FSS and hardware • Double precision runs don’t exploit time skewing • In single precision time skewing = 4 • Problem was best of 1283 and 2563
Sparse Matrix Vector Multiplication • Most of the matrix entries are zero, thus the nonzeros are sparsely distributed • Can be used for unstructured grid problems • Issues • Like DGEMM, can exploit a FMA well • Very low computational intensity (1 FMA for every 12+ bytes) • Non FP instructions can dominate • Can be very irregular • Row lengths can be unique and very short
Compressed Sparse Row • Compressed Sparse Row (CSR) is the standard format • Array of nonzero values • Array of corresponding column for each nonzero value • Array of row starts containing index (in the above arrays) of first nonzero in the row
Optimization - Double Buffer Nonzeros • Computation and Communication are approximately equally expensive • While operating on the current set of nonzeros, load the next (~1K nonzero buffers) • Need complex (thought) code to stop and restart a row between buffers • Can nearly double performance
Optimization - SIMDization • BCSR • Nonzeros are grouped into r x c blocks • O(nnz) explicit zeros are added • Choose r&c so that it meshes well with 128b registers • Performance can hurt especially in DP as computing on zeros is very wasteful • Can hide latency and amortize loop overhead • Only used in initial performance model • Row Padding • Pad rows to the nearest multiple of 128b • Might requires O(N) explicit zeros • Loop overhead still present • Generally works better in double precision
Optimization - Cache Blocked Vectors • Doubleword DMA gathers from DRAM can be expensive • Cache block source and destination vectors • Finite LS, so what’s the best aspect ratio? • DMA large blocks into local store • Gather operations into local store • ISA vs. memory subsystem inefficiencies • Exploits temporal and spatial locality within the SPE • In effect, the sparse matrix is explicitly blocked into submatrices, and we can skip, or otherwise optimize empty submatrices • Indices are now relative to the cache block • half words • reduces memory traffic by 16%
Optimization - Load Balancing • Potentially irregular problem • Load imbalance can severely hurt performance • Partitioning by rows is clearly insufficient • Partitioning by nonzeros is inefficient when average number of nonzeros per row per cache block is small • Define a cost function of number of row starts and number of nonzeros. • Determine the parameters via static timing analysis or tuning.
Benchmark Matrices • 4 nonsymmetric SPARSITY matrices • 6 symmetric SPARSITY matrices • 7pt Heat equation matrix
Other Approaches • BeBop / OSKI on the Itanium2 & Opteron • uses BCSR • auto tunes for optimal r x c blocking • Cell implementation is similar • Cray’s routines on the X1E • Report best of CSRP, Segmented Scan & Jagged Diagonal
Double Precision SpMV Performance • 16 SPE version needs broadcast optimization • Frequency helps (mildly computationally bound) • Cell+ doesn’t help much more (non FP instruction BW)
Dense Matrix Matrix Multiplication (performance model only)
Dense Matrix-Matrix Multiplication • Blocking • Explicit (BDL) or implicit blocking (gather stanzas) • Hybrid method would be to convert and store to DRAM on the fly • Choose a block size so that kernel is computationally bound • 642 in single precision • much easier in double precision (14x computational time, 2x transfer time) • Parallelization • Partition A & C among SPEs • Future work - cannon’s algorithm
Fourier Transforms (performance model only)
1D Fast Fourier Transforms • Naïve Algorithm • Load roots of unity • Load data (cyclic) • Local work, on-chip transpose, local work • i.e. SPEs cooperate on a single FFT • No overlap of communication or computation
2D Fast Fourier Transforms • Each SPE performs 2 * (N/8) FFTs • Double buffer rows • overlap communication and computation • 2 incoming, 2 outgoing • Straightforward algorithm (N2 2D FFT): • N simultaneous FFTs, transpose, N simultaneous FFTs, transpose. • Long DMAs necessitate transposes • transposes represent about 50% of total SP execution time • SP Simultaneous FFT’s run at ~ 75 GFLOP/s
Conclusions • Far more predictable than conventional OOO machines • Even in double precision, it obtains much better performance on a surprising variety of codes. • Cell can eliminate unneeded memory traffic, hide memory latency, and thus achieves a much higher percentage of memory bandwidth. • Instruction set can be very inefficient for poorly SIMDizable or misaligned codes. • Loop overheads can heavily dominate performance. • Programming Model is clunky
Acknowledgments • This work (paper in CF06, poster in EDGE06) is a collaboration with the following FTG members: • John Shalf, Lenny Oliker, Shoaib Kamil, Parry Husbands, and Kathy Yelick • Additional thanks to • Joe Gebis and David Patterson • X1E FFT numbers provided by: • Bracy Elton, and Adrian Tate • Cell access provided by: • Mike Perrone (IBM Watson) • Otto Wohlmuth (IBM Germany) • Nehal Desai (Los Alamos National Lab)