230 likes | 402 Views
High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel. Electrical & Computer Engineering Carnegie Mellon University. Sponsors: DARPA-DESA, NSF, ARO, and Mercury Inc. Cell Broadband Engine. Cell BE Chip. EIB. SPE. LS.
E N D
High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel Electrical & Computer Engineering Carnegie Mellon University Sponsors: DARPA-DESA, NSF, ARO, and Mercury Inc.
Cell Broadband Engine Cell BE Chip EIB SPE LS SPE LS SPE SPE LS LS SPE LS SPE LS SPE SPE LS LS Main Mem How do we harness the Cell’s impressive peak performance? Multicore cpu (8 SPEs+1 PPE) SPEs: SIMD cores designed for numerical computing 256KB “local store” per SPE (scratchpad-like) Programmer-driven DMA 204 Gflop/s peak
DFT on the Cell BE Spiral generated (this paper) 350x FFTC FFTW Numerical Recipes • Platform-tuned code is 350x faster. But hard to write!
Overview Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong,Franz Franchetti, AcaGacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo:SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE93(2), 2005 Background, Spiral Overview Generating DFTs for the Cell Performance Results Concluding Remarks
“Fitting” Dataflow to Hardware Stage 1 Stage 4 Stage 5 Stage 3 Stage 2 Stage 1 Core 0 Stage 2 Stage 3 Core 1 Stage 4 Parallel execution (multicore) Iterative Algorithm (programming ease) Recursive algorithm (memory hierarchy) • How to map dataflowto architecture automatically? To “fit” DFT to architecture: Various traversals Various factorizations
“Fitting” Dataflow to Platform (contd.) 1 4 5 3 2 1 Core 0 2 3 Core 1 4 • Intuition: rewrite formulas to obtain suitable dataflow
Optimization at allabstraction levels parallelizationvectorization loop optimizations constant folding scheduling …… Program Generation in Spiral Transformuser specified Fast algorithmin SPLmany choices ∑-SPL Iteration of this process to search for the fastest But that’s not all … C Code
Common Abstraction: SPL SPL: Tensor-product representation Eg.: Cooley-Tukey fast Fourier transform (FFT): • Tensor products in SPL represent loop structures
Overview Background, Spiral Overview Generating DFTs for the Cell Performance Results Concluding Remarks
Mapping DFTs to the Cell Objective: High-performance transform library for Cell BE Cell BE Chip EIB SPE LS SPE LS SPE SPE LS LS DFT SPE LS SPE LS Parallelize DFT across p SPEs, and use a DMA packet size of Optimize DFT for throughput (s DFTs required) Vectorize DFT for vector length SPE SPE LS LS Cell’s architectural paradigms: Main Mem Vectorization Parallelization Multibuffering Tags guide formula rewriting
A A A A x y SPL to Parallel Code • Natural parallel construct in SPL: Processor 0 Processor 1 Processor 2 Processor 3 Independent, load-balanced, communication-free operation • Parallelizing other constructs in SPL: • Permutations require message exchange (on-chip DMA comm.) x y Idea: rewrite all SPL constructs to parallel constructs + on-chip DMA
SPL to Streaming Code i'th iteration Write Ai-1 Compute Ai Read Ai+1 A A A (Trickier for other SPL constructs) x y Idea: rewrite algorithm at SPL level to achieve largest DMA packets • Streaming: Overlapping computation with communication • On-chip (SPE ↔ SPE) and off-chip (SPE ↔ Main memory) • Idea: tensor loops become multi-buffered loops • Useful for: • Throughput-optimized code • Large, out-of-chip sizes
Generating Cell Code Transformuser specified Rewriting Fast algorithmin SPLtag guided All-to-all communication (on-chip) SIMD kernel optimized for memory hierarchy Load balanced across p SPEs Streamed from memory for throughput Loop operations in ∑-SPL Cell-specific optimized C code (intrinsics, DMA etc.)
Generated Code Sample vectorized DMA parallelized • DFT 216: 4,000+ lines of code! /* Complex-to-complex DFT size 64 on 2 SPEs */ dft_c2c_64(float *X, float *Y, intspuid) { // Block 1 (IxA)L for(i:=0; i<=7; i++) // Right most gather { DMA_GATHER(gath_func(X,i), gath_func(T1,i), 4) } // uses spu_mfcdma() spu_mfcstat(MFC_TAG_UPDATE_ALL); // Wait on gather // compute vectorized DFT kernel of size m for(i:=0; i<=7; i++) // Scatter at interface { DMA_SCATTER(scat_func(T1,i), scat_func(T2,i), 4) } all_to_all_synchronization_barrier(); // uses mailbox msgs // Block 2 (AxI) /* Gather is a no operation since the scatter above accounted for it */ // compute vectorized DFT kernel of size n for(i:=0; i<=7; i++) // Left most scatter { DMA_SCATTER(scat_func(T1,i), scat_func(Y,i), 4) } all_to_all_synchronization_barrier(); }
Problem Space: Options SPE SPE SPE SPE SPE SPE SPE DFT DFT DFT DFT DFT DFT DFT DFT Parallelization Base (Vectorized) Vectorization assumed Single DFT parallelized across multiple SPEs Main Memory Operations (Only for small DFTs) SPE SPE SPE SPE SPE SPE Multiple independent DFTs on multiple SPEs DFT DFT DFT Latency optimized (default) SPE SPE SPE SPE SPE SPE Multiple parallelized independent DFTs Throughput, multibuffered
Problem Space: Combinations SPE SPE SPE SPE DFT DFT DFT DFT DFT DFT DFT DFT Throughput-optimized usage scenarios Latency-optimized usage scenarios Single DFT from main memory Parallel, multibuffered DFT SPE SPE SPE SPE DFT DFT DFT SPE SPE SPE SPE Independent DFTs multibuffered in parallel • Devise rewrite rules for tags. Nestings describe all scenarios
Overview Background, Spiral Overview Generating DFTs for the Cell Performance Results Concluding Remarks
8-SPEs 4-SPEs 2-SPEs SPE SPE DFT SPE SPE 1-SPE
Spiral: 8-SPEs FFTW FFTC Spiral: 1-SPE SPE SPE DFT SPE SPE • 4.5x faster than FFTW, 1.63x faster than FFTC
More Performance Results • Single-SPE DFT code • Split/interleaved complex formats • Non-2-power sizes • Double precision (PowerXCell 8i) Mercury Spiral Chow IBM SDK
Other Linear Transforms • Discrete Sine, Cosine transforms, DFT with real inputs (single-SPE) • 2-D DFTs • Out-of-core sizes • Limited to 2D DFTs on 1-SPE (for now) More performance results: Srinivas Chellappa, Franz Franchetti , and Markus Püschel:Computer Generation of fast Fourier Transforms for the Cell Broadband Engine Proceedings of International Conference on Supercomputing (ICS) 2009
Overview Background, Spiral Overview Generating DFTs for the Cell Performance Results Concluding Remarks
Conclusion architecture space algorithm space • Automatic generation of transform libraries • High performance • Variety of scenarios, formats • High performance on Cell requires: • Vectorization multi-core parallelization, streaming, DMA code • Future processors likely to have similar paradigms, tradeoffs • Spiral approach: • Common abstraction of transform, algorithm, architecture (SPL) • Rewrite rules to go from transform to architecture