230 likes | 338 Views
The Potential of the Cell Processor for Scientific Computing. Leonid Oliker Samuel Williams, John Shalf Shoaib Kamil, Parry Husbands, Katherine Yelick Computational Research Division Lawrence Berkeley National Laboratory. Motivation.
E N D
The Potential of the Cell Processorfor Scientific Computing Leonid Oliker Samuel Williams, John ShalfShoaib Kamil, Parry Husbands, Katherine Yelick Computational Research Division Lawrence Berkeley National Laboratory
Motivation • Stagnating application performance is well-know problem in scientific computing • By end of decade numerous mission critical applications expected to have 100X computational demands of current levels • Many HEC platforms are poorly balanced for demands of leading applications • Memory-CPU gap, deep memory hierarchies, poor network-processor integration, low-degree network topology • Traditional superscalar trends slowing down • Mined most benefits of ILP and pipelining, clock frequency limited by power wall • Specialized HPC market cannot support huge tech investments • HPC community looking for alternative high-performance architectures with healthy market outside scientific computing (approx 0$B market) • Sophistication of gaming technology is demanding more floating-point intense computation • However, ultimately limited by level human recognition not scientific fidelity • Recently-released Cell processor has tremendous computational capability • This work examines four key scientific algorithms on Cell
Introduction to Cell • Cell will be used in the PS3: compelling because it will be produced in high volume • Radical departure from conventional designs including the XBOX 360’s Xenon Cell/PS3 XBOX 360 Heterogeneous Homogeneous PPC + 8 SIMD cores 3 x PPC Software-controlled Conventional Cache-based memory architecture memory hierarchy (1MB) 221mm2 (+30%) 168mm2 • Key limiting factors current platforms: off-chip memory bandwidth and power usage • Memory hurdles: latency and bandwidth utilization • Homogenous cache-based CMP do not address these deficiencies • Software controlled memory improves memory bandwidth and power usage: • Allows finely tuned deep prefetching • Data double buffering: hides memory latency, potential fully utilizing bandwidth • More efficient cache utilization policies (smaller “caches” required) • Memory fetching takes advantage of application-level information • More predictable performance for performance modeling, real time systems, etc • Less architectural complexity vs. automatic caching memory = less power • However, software controlled memory increases programming complexity
Cell Processor Architecture • All units connected via EIB • 4 x 128b rings @ 1.6GHz • PPC core @ 3.2GHz • 8 x SPE’s (128b SIMD core) • Off-load engine between a P and a coP • 128 x 128b single cycle register file • 256KB Local store • 16K x 128b 6-cycle local store • Private address space • access to global store via DMA • No unaligned access (only via permutes) • Dual SIMD issue (private PC) • one arithmetic/float/etc… • one load/store/permute/branch/channel • Statically scheduled • Execution: in-order 7 cycle pipelines • 3W @ 3.2 GHz • Memory controller (25.6GB/s dual XDR) • ~40W total if PPC is idle Cell architecture benefits include: • Reduced pipeline length • Lower branch mis-predict penalty • Lower memory latency • Numerous in-flight DMAs • Hide memory latency • Utilize available mem BW • Overlap comp with comm • Impressive performance/power ratio
Microarchitectural Issues • PPE (PowerPC Processing Element) • 512KB cache: coherent with DMAs, not Local Store (LS) • Dual thread, dual issue, in order • VMX unit + scalar DP FMA • SPE (Synergistic Processing Element) • 7 cycle in order dual SIMD pipelines • Single Precision • 4 FMA datapaths, 6 cycle latency • 25.6 Gflop/s per SPE, 204.8 Gflop/s overall peak • Double Precision • 1 FMA datapath, 13 cycle latency • 13 cycle pipeline doesn’t fit in a 7 cycle forwarding network, so 6 cycle stall after issuing for correctness = 1.83 GFLOP/s, 14.6 Gflop/s overall peak • One DP issue every 7 cycles (1/14th peak of SP) • Prohibit dual issuing DP instructions = 1.6 GFLOP/s (12.8 Gflop/s overall) • For streaming apps (loads, permutes, etc) one issue every 8 cycles • DP is obviously not a first-class citizen in Cell microarchitecture • Software managed branch hints
Programming Cell • Local store appears to be the SPU’s entire memory space. • However, with DMAs, it can be programmed as a software (user) controlled memory. • A series of DMAs can be issued to transfer data from global store (DRAM) to local store (SRAM) remote get • Allows expressing a list of addresses & sizes for amenable algorithms • Analogous to vector loads from DRAM to a large SRAM (vector register file) • Local store has constant 6 cycle latency (no cache misses) • Greatly simplifies performance estimation vs. out-of-order, cache-based • Double buffering allows explicit overlapping of computation and communication • Our work does not benchmark the compiler’s ability to SIMDize • For all critical sections, wrote code with SIMD/quadword intrinsics • Ideal performance, somewhat more work than C, far less work than assembly • Programming overhead: • SpMV: 1 month, 600 lines • Learning programming model, architecture, compiler, tools, algorithm choice • Stencil versions: about 1 weeks, 450 lines (2 versions) - original 15 lines • Required significant loop unrolling and intrinsics use • SP time-skewed Stencil version: one day, total rewrite 450 lines attained 65 Gflop/s!
Parallel Programming Model • Possible programming paradigms include • Task parallelism, with independent tasks on each SPE • Pipelined parallelism, where large data blocks are passed from one SPE to next • Similar to Streaming model • Data parallelism: identical operations on distinct data (SPMD) • We examine as hierarchical SPMD • Simplest and most direct way to decompose the problem • Data parallel paradigm good match for many scientific algorithms • Similar to OpenMP or Multistreaming Cray X1(E) parallelism • PPE is used to partition and load balance • PPE is not used for any computation • Allows us to treat system as homogenous parallel machine
Estimation, Simulation and Exploration Performance Modeling (PM): • Double buffered + long DMAs + in order machine • Use MAX( static timing analysis, memory traffic modeling) • Latency of operation, issue width limits, operand alignment of SIMD/quadword • DMA load/store overheads (including constraints such as single DRAM controller) • For regular data structures, “spreadsheet” modeling works • Some kernels (SPMV, FFT) requires more advanced modeling to capture input data pattern • Iteration performance varies based on matrix non-zero format Full System Simulator (FSS): • Based on IBM’s mambo, cycle accurate, includes static timing analyzer, compilers, etc… Cell+ • DP pipeline is not very important for video games • Redesigned pipeline helps HPC - but increases design complexity and power consumption • Propose modest modification: alternate design forwarding network • How severely does DP throughput of 1 SIMD instruction every 7/8 cycles impair execution? • Cell+ model fully utilizes the DP datapath: 1 SIMD instruction every 2 cycles • Allows dual issuing of DP instructions with loads/stores/permutes/branch • Same SP throughput, frequency, bandwidth and power as Cell
Comparing Processors • Cray X1E world’s most powerful vector processor • Cell performance does not include the Power core • Cell+ 51.2 Gflop/s peak (DP) - 3.5x Cell performance • Impressive potential in performance and power efficiency
Dense Matrix-Matrix Multiplication • GEMM characterized by high comp intensity and regular data access • Expect to reach close to peak on most platforms • Explored two blocking formats: Column major and Block data layout • Column major: implicit blocking via gather stanzas • Issues: tile size within SPE Local store, multiple short DMAs • Maximizes FLOP/byte, reduce TLB misses based on size of blocks • Block data layout (BDL): explicit blocking • Two stage addressing scheme, requires single long DMA • Choose a block size large enough so that kernel is computationally bound • 642 in single precision • Much easier in DP (14x computational time, 2x transfer time) • Future work - cannon’s algorithm • Reduce DRAM BW by using EIB • Could significantly increase number of pages touched
GEMM - Results • CellPM represents results from our analytical performance model • IBM’s published hardware numbers come very close to these (within 3%) • Cell results compared with highly optimized vendor GEMM libraries • Impressive performance results and power efficiency (>200x Power vs IA64!) • SP: 69x, 26x, 7x faster X1E, IA64, AMD64 • DP: 0.9x, 3,7x, 2.7x faster X1E, IA64, AMD64 • Cell+ approach improves DP performance 3.5x w/ modest architectural mods • 50Gflops!
Sparse Matrix-Vector Multiplication • SPMV most expense step in iteratively solving PDE of sparse linear/eigen systems • Poses performance challenge for cache based system due to: • Low computational intensity and irregular (indexed) data accesses • Potentially challenge on Cell: no caches or word-granularity gather/scatter support • Potential advantages: • Low functional unit and local store latency • Task parallelism of 8 SPEs, 8 independent load/store units • Ability to stream nonzeros via DMA • Local store is not write-back cache: overwrite temps without DRAM BW usage • Cell implementation work examines CSR or Block CSR • SIMDization • Requires all row lengths to be a multiple of 4 (simplifies quadword alignment) • Explicitly cache block columns • Exploit spatial locality within the local store • Implicitly cache block the rows • Cache block parallelization strategies: • Partition by rows: potential load imbalance (depends on matrix structure) • Partition by nonzeros: each SPE contains copy of source + reduction across SPEs • Double buffer nonzeros • Overlaps computation and communication • Requires restarting in the middle of a row
SPU0 SPU1 SPU2 SPU3 SPU4 SPU5 SPU6 SPU7 SpMV - example figure • Explicitly choose column blocking via cost function • Cache block perimeter is fixed (Local Store) • What is optimal r x c? • Parallelize across SPUs • Cost function of execution time • Rows + NZ • Partially double buffer row pointers to find structure • Completely eliminate empty blocks • Prune empty rows
SpMV - FSS Implementation • Use performance model estimates to guide actual implementation • In DP row lengths must be even (QW aligned) • No BCSR software implementation yet • Parallelization • Dynamically analyze costFunction(rows,NZs) ~ execution time • Runtime blocking • Cost function based • LS=256KB=32K doubles, max column block = 32K • Only need a 15b relative index to store absolute column index (not 32b) • Runtime search for structure • Empty cache blocks,search for first non empty row • Itanium/AMD version: highly optimized OSKI used auto tuner • Cray version best know to date: optimized CSRP and Jagged Diagonal • Examined suite (un)symmetric matrices from real numerical calculations
SpMV - Results *Unsymmetric kernel used on symmetric matrix, FSS = Full system simulator, PM = Performance Model • Cell DP achieves impressive 6-8x speedup vs AMD/Itanium, 20x power efficiency • Even though mem BW 4x, Cell achieves higher performance via double buffering • Multicore systems will not see a performance increase w/o improved mem bandwidth • Cell outperforms X1E by around 2x, and is 5x more power efficient • X1 performance much more sensitive to #NNZ (affects vector length) • PM very close to FSS performance using static implementation - confirming PM accuracy • However FSS is 30% faster due to dynamic partitioning
SpMV - Future Optimizations • Auto-tuning • Other parallelization strategies • BCSR (better for SIMD, worse for memory traffic) • Other storage formats (DIA/JAG/etc…) • Symmetry (currently only present in the performance model) • Easier to exploit in single precision & w/BCSR • Cache blocking limits benefit (~50%) • Segmented Scan • Reduces loop overhead at the expense of nonzero processing time • Good if NZ/Row (within a cache block) is small • Single segment (Vector Length=1) would be beneficial • Make runtime decision for a given cache block • Complicated by presence of empty rowswithin a cache block
Stencils on Structured Grids • Stencil computations codes represent wide array of scientific applications • Each point in multidimensional grid is updated from subset of neighbors • Finite difference operations used to solver complex numerical systems • We examine simple heat equation and 3D hyperbolic PDE • Relatively low computational intensity results in low % of peak on superscalars • Memory bandwidth bound • Algorithm requires keeping 4 planes in local store to optimize performance • (Z-1,t), (Z,t), (Z+1,t) -> (Z,t+1) • Cell approach utilizing double buffering: previous output with next input • (Z-1,t+1) & (Z+2,t) • Cell algorithm virtually identical to traditional architectures • ISA forces explicit memory loads/stores rather than cache misses and evictions • Parallelization - process one plane at a time • Break middle loop up among SPEs (divide each plane 8 ways) • Maintains long DMAs (unit-stride direction) and double buffering in Z direction • Computational intensity drops to decrease complexity • SIMDization: permutes required to pack left & right neighbors into a SIMD register • Neighbor communication poor fit for aligned quadword loads requirements • Potential Cell bottleneck: Unaligned loads emulated w/ permute instruction • Problem can be partially avoided with data padding • For SP permute is second bottleneck after BW, not the FPU
Stencils - Time Skewing • Low computational intensity limits performance - memory bandwidth • Time Skewing: multiple steps combined to increase performance • Increased computational intensity with almost no additional memory traffic • Note some numerical methods are not amenable to merging multiple timesteps
Stencils - Results • In DP Cell is computationally bound in single time step • In SP 2 timesteps are required before computationally bound • Permute unit quickly becomes utilized (Quadword alignment) • In SP Cell achieves 6.5x, 11x, 20x compared w/ X1E, Itanium2, Opteron • Using time skewing achieves 66Gflops! 60x faster and 130x power efficient vs Opteron • Note stencil code highly optimized on cache-based platforms • In DP Cell achieves 2x, 7x, 14x compared w/X1E, Itanium2, Opteron • Even though DP peak throughput is only 1/14th (!) compared with SP • Note Cell prohibits double issue of DP with loads or permutes • For codes with streaming behavior one DP SIMD instruction each 8 cycles • Unlike scalar systems - time skewing can at least double performance on Cell • Cell performance: software controlled memory for codes w/ predictable memory accesses
1D Fast Fourier Transforms • Fast Fourier transform (FFT) - is of great importance to a wide variety of applications • One of the main techniques for solving PDEs • Relatively low computational intensity with non-trivial volume data movement • 1D FFT: Naïve Algorithm - cooperatively executed across the SPEs • Load roots of unity, load data (cyclic) • 3 stages: local work, on-chip transpose, local work • No double buffering (ie no overlap of communication or computation) • 2D FFT: 1D FFTs are each run on single SPE • Each SPE performs 2 * (N/8) FFTs • Double buffer (2 incoming and 2 outgoing) • Straightforward algorithm (N2 2D FFT): • N simultaneous FFTs, transpose, • Transposes represent about 50% of SP execution time, but only 20% of DP • Cell performance compared with highly optimized FFTW and vendor libraries
FFT - Results • FSS implementation in progress • In SP Cell is unparalleled - 91x faster and 300x more power efficient vs Itanium2! • Cell+ offers significant performance advantage (2.5X versus Cell) • Cell+ 2D FFT: 30x faster than Itanium2, 23x AMD2, >2x X1E • Cell DP performance approx equal to X1E (simple Cell implementation) • Cell performance underscores advantage of software controlled memory • Does not suffer from associatively issues of cache architectures (powers of 2) • Effectively fully-associative cache • Opportunity to explicitly overlap communication with computation
Summary I • Our work presents broadest quantitative study of scientific kernels on Cell to date • Developed analytic framework to predict performance and validated accuracy • Kernel times predicable, as load time from LS is constant • Results show tremendous potential of Cell for scientific computing • Proposed Cell+: modest microarch variant designed to improve DP performance • Fully utilizable DP pipeline greatly improves power and efficiency • Cell’s heterogenous multicore seems more promising than emerging CMP designs • CMP w/ 2 cores < 2x performance, compare with 10-20x on Cell • Would need serious increase in power and memory bandwidth to scale up • Of course ultimately need to evaluate full application performance
Summary II • Cell’s 3 level Software controlled memory decouples L/S from computation: • Extremely predictable kernel performance • Long DMA transfers achieve high % of mem BW (like fully engaged prefetch) • Ability to decoupled gather (ex. stream nonzeros) - large # concurrent mem fetches • For predictable memory access can overlap comp with memory (double buffer) • Future designs benefit from larger local store and lower DMA startup • Disadvantages: • Increased programming complexity to specify local memory movement • Lack unaligned load support: additions instructions necessary to permute • Permute pipeline can become bottleneck (Stencil SP example) • Future work: Real Cell hardware, more algorithms, comparisons with modern CMPs