70 likes | 202 Views
J. A. Kahle , M. N. Day, H. P. Hofstee , C. R. Johns, T. R. Maeurer , and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development Vol. 49, No. 4/5, Pg. 589 (Jul-Sep 2005) Presented by John Ingalls ECE 259 - April 8, 2010.
E N D
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development Vol. 49, No. 4/5, Pg. 589 (Jul-Sep 2005) Presented by John Ingalls ECE 259 - April 8, 2010 Introduction to the Cell Multiprocessor
ISA: 64-bit IBM Power Architecture with SIMD. • 1 PPE, 8 SPEs, 1 memory and 1 I/O controller all on coherent bus (single address space). • PowerPE: 2-issue in-order 2-thread-SMT, 32KB L1 I$/D$, 512KB L2$ with software management hooks, 128-bit total SIMD width, separate Vector/SIMD issue queue from scalar execute. Design Summary: PPE
SynergisticPE: in-order SIMD. 128-bit total width, like PPE. • Local Store (LS): 256KB, single port for either 128-bit SIMD-word access, or 128-byte insns fetch or DMA I/O. • 128-entry regfile for static (compiler) insn reordering • area efficient: 15% control, rest is Execute & Local Store Design Summary: SPE
I/O supports direct connection to another Cell to easily build a cache-coherent multiprocessor. • Native binary compatibility with Power-ISA apps. • Modular design, but still fully custom. • Extensive test and monitoring circuitry. Other Features
Challenges: • SPE Local Store is software managed. • Each SPE supports one thread context, and context switches are expensive. • Models: • Function Offload: function call from PPE • Device Extension: SPE isolated, like a device • Compute Acceleration: PPE aggregates SPE results • Streaming: each SPE is a step in software pipeline • Shared Memory Multiprocessor: conventional • Asymmetric Thread Runtime: p-threads Programming
Good Bad • Paper is easy to follow and doesn’t throw too much complicated stuff at reader. • Built and shipped on time by a joint venture of IBM, Sony, and Toshiba. • Many applications in media and supercomputing. • They keep listing static limitations imposed by their models as advantages, such as explicitly managed caches. • No hard performance data or comparison to competition. Only “anecdotal evidence” shows that it is possible to fully utilize Cell.
Keywords: • Heterogeneous multi-core SIMD processor. • Single address space across all cores on chip • 1x conventional PPE for control. • 8x SPEs for streaming SIMD are very fast and power efficient if used. • Several programming models are feasible. • Questions: • How could the programming models be easier? • What direction should this architecture grow in? Conclusion / Questions