1 / 11

Programming the Cell Processor: Achieving High Performance and Efficiency

Learn how to achieve high performance and efficiency on the Cell Broadband Engine processor, including details on architecture, processing elements, and optimization techniques. Explore the use of SIMD, DMA, and math libraries for enhanced results.

Download Presentation

Programming the Cell Processor: Achieving High Performance and Efficiency

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming the Cell Processor:Achieving High Performance and Efficiency Jeremy S. Meredith Future Technologies Computer Science and Mathematics Division Research supported by the Department of Energy’s Office of ScienceOffice of Advanced Scientific Computing Research

  2. Cell broadband engine processor:An overview • One POWER architecture processing element (PPE) • Eight synergistic processing elements (SPEs) • All connected through a high-bandwidth element interconnect bus (EIB) • Over 200 GFLOPS (single precision) on one chip

  3. Cell broadband engine processor:Details • Dual-threaded • Vector instructions (VMX) One64-bit PPE • Dual-issue pipeline • Simple, instruction set heavily focused on single instruction multiple data (SIMD) • Capability for double precision, but optimization for single • Uniform 128-bit 128-register file • 256-k fixed-latency local store • Memory flow controller with direct memory access (DMA) engine to access main memory or other SPE local stores EightSPEs

  4. Replacing this testwith extra arithmeticresults in a large speedup 2.5 2.0 Single precision Double precision 1.5 Run time (sec) 1.0 0.5 0.0 Pentium 4 2.8 GHz PPE only 1 SPE 1 SPE,avoiding “if”test Genetic algorithm, traveling salesman:Single vs double precision performance • The cell processor has a much higher latency fordouble precision results • The "if" test in the sorting predicate is highly penalized for double precision

  5. Genetic algorithm, Ackley’s function:Using SPE-optimized math libraries Ackley’s function involvescos, exp, sqrt Switching to a math library optimized for SPEs results in a more than 10x improvement for single precision 1 0.681 0.248 Run time (sec) [log scale] 0.1 0.064 0.047 0.01 Original Fastcosine Fast exp/sqrt SIMD

  6. Synchronous DMA 35 Overlapped DMA 30 25 20 Run time using 1 SPE (sec) 15 10 5 0 Scalar SIMD Optimizations Covariance matrix creation:DMA communication overhead • The cell processor can overlap communication with computation • Covariance matrix creation has a low ratio of communication to computation • However, even with an SIMD-optimized implementation, the high bandwidth of the cell’s EIB makes this overhead negligible 6Meredith_Cell_0611

  7. Stochastic boolean SAT solver:Hiding latency in logic-intensive apps PPE: attempting to hide latency manually works against the compiler SPE: manual loop unrolling and instruction reordering achieved speedups although no computational SIMD optimizations were possible

  8. Full execution Launch + DMA Thread launch Support vector machine:Parallelism and concurrent bandwidth • As more simultaneous SPE threads are added, total run time decreases • Total DMA time also decreases, showing that the concurrent bandwidth to all SPEs is higher than to any one SPE • Thread launch time increases, but the latest Linux kernel for the cell system does reduce this overhead 8Meredith_Cell_0611

  9. Threads launched only on first call Molecular dynamics:SIMD intrinsics and PPE-to-SPE signaling • Using SIMD intrinsics easily achieved 2x speedups in total run time • Using PPE-SPE mailboxes, SPE threads can be re-used across iterations • Thus, thread launch overheads will be completely amortized on longer runs 2.5 Full run time 2.0 Thread launch overhead 1.5 Run time (sec) 1.0 0.5 0.0 1 SPE, original 1 SPE, SIMDized 1 SPE 2 SPEs 4 SPEs 8 SPEs

  10. Conclusion • Be aware of arithmetic costs • Use the optimized math libraries from the SDK if it helps • Double precision requires different kinds of optimizations • The cell has a very high bandwidth to the SPEs • Use asynchronous DMA to overlap communication and computation for applications that are still bandwidth-bound • Amortize expensive SPE thread launch overheads • Launch once, and signal SPEs to start the next iteration • Use of SIMD intrinsics can result in large speedups • Manual loop unrolling and instruction reordering can help even if no other SIMDization is possible 10Meredith_Cell_0611

  11. Contact Jeremy Meredith Computer Scientist, Computer Science and Mathematics Division (865) 241-5842 jsmeredith@ornl.gov 11Meredith_Cell_0611

More Related