IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

Overview of the Architecture, Circuit Design, andPhysical Implementation of a First-Generation Cell Processor IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006

First Consumer Product • Play Station 3!

Introduction • Developed through partnership of • SONY Computer Entertainment. • Toshiba. • IBM. • Aim • Highly tuned for media processing. • Expected demands for complex and larger data handling.

What is Cell? • Cell is an architecture for high performance distributed computing. • It is comprised of hardware and software cells. • Implementation of a wide range of single or multiple processor and memory configurations.

“Supercomputer” in daily life • Parallelism with high frequency. • Real time response. • Supports Multiple operating system. • 10 simultaneous threads. • 128 memory requests. • Optimally address many different system and application requirements.

Architecture Overview • 8 SPE’s with Local Storage (LS). • PPE with its L2 cache. • Internal element interconnect bus (EIB). • Memory Interface Controller (MIC). • Bus Interface Controller (BIC). • Power Management Unit (PMU). • Thermal Management Unit (TMU). • Pervasive Unit.

High Level Diagram

Die Photograph

Synergistic Processing Elements (SPE) (1/2) • Share system memory with PPE through DMA. • Data and instructions in a private real address space supported by a 256 K LS. • According to IBM a single SPE can perform as well as a top end (single core) desktop CPU given the right task.

Synergistic Processing Elements (SPE) (2/2) • Access main storage by issuing DMA commands to the associated MFC block (asynchronous transfer). • Fully pipelined 128 bit wide dual issue SIMD. • SPE’s in a Cell can be chained together to act as a stream processor.

Power Processor Element (PPE) (1/2) • 32-kB instruction and data cache. • 64 bit “Power Architecture” with 512kB L2 cache.

Power Processor Element (PPE) (2/2) • Through MMIO control registers can intiate DMA for SPE. • Hyepervisor extension. • Moderate length of pipeline.

Element Interconnect Bus(EIB) • Can transfer upto 96bytes per cycle. • 4 16byte wide rings • Two rings going clockwise. • Two rings going counterclockwise. • Separate address and command network. • 12on/off ramps.

Memory Interface Controller (MIC) • Two 36 bit wide XDR memory banks. • Can also support just a single bank. • Speed matching SRAM and two clocks.

Power Reduction • Power Management Unit. • PMU allows software controls to reduce chip power. • Can cause OS to throttle, pause or stop for single or multiple units.

Thermal Monitoring • Thermal Sensors and Thermal Monitoring Unit. • One sensor located at relatively constant temp. location, for external cooling. • 10 DTS at various critical locations.

Optimum Point (1/3) • Triple constraint : Power, Performance, Area. • Gate Oxide thickness • Thinner oxide • Higher performance. • Higher gate tunneling too. • Reliabilty concerns.

Optimum Point (2/3) • Channel Length • Short channel length • Improved performance. • Increased leakage current too. • Supply Voltage • Higher voltage • Improved performance. • Higher AC/DC power.

Optimum Point (3/3) • Wire Levels • Few levels • Increased chip area. • Many levels • More cost.

Final Technology Parameters

Chip Integration • 241M transistors. • 8912 discrete flour planned blocks. • Custom tailored nets. • 20 separate power domains.

POWER-CONSCIOUS DESIGN OFTHE CELL PROCESSOR’SSPE Osamu Takahashi IBM Systems and Technology Group Scott Cottier Sang H. Dhong Brian Flachs Joel Silberman IBM T.J. Watson Research Center

The CELL Processor - Properties • Mostly CMOS static gates. • Dynamic gates used for time critical paths. • Tight coupling of • ISA • uArchitecture • Physical implementation achieves Compact and Power efficient design.

APPLICATIONS • To name a few (list goes endless) • Image processing for high definition TV • Image processing for medical usages • High performance computing • Gaming • Flexible enough to be a GP uP that supports HLL programming.

Cell processor - Architecture • 64-bit power core • Eight Synergistic Processor Elements(SPEs) • L2 Cache • Interconnection bus • I/O Controller • Rambus Flex I/O

Architecture contd. • SPE has two clock domains: • one with an 11FO4 cycle time. • other with a 22FO4 cycle time. • Implementation using custom design - high-frequency domain. • The SPE contains • 256 Kbytes of dedicated local store memory. • The 128-bit, 128-entry general-purpose register file with six read ports and two write ports.

SPE • The SMF operates at half the SPE’s frequency. • The SPE operates at operations of up to 5.6 GHz at a 1.4 V supply and 56° C. • The SPE’s measured power consumption is in the range of 1 W to 11 W, depending on • Operating clock frequency. • Temperature. • Workload.

Triple design constraints • Cell contains eight copies of the SPE. • Optimization of the SPE’s power and area is critical to the overall chip design. • Conscious effort to reduce SPE area and power while meeting the 11 FO4 cycle time performance objectives. • Optimized design to balance three constraints of • Power. • Area. • Performance. • Tradeoffs to achieve the overall best results • Some techniques used • latch selection. • fine-grained clock-gating scheme. • multiclock-domain design. • use of dual-threshold voltage. • Selective use of dynamic circuits.

Latch selection • Logic has 8-9FO4 time. • Rest of the time used by latches. • Several Latches with various insertion delays used.

Transmission Gate Latch • SPE’s main workhorse latch. • Come in two varieties • Scannable. • Non scannable. • Each has several power levels. • Used almost throughout the SPE.

Pulsed Clock Latch • Non scannable. • Small insertion delay. • Small Area. • Relatively low power consumption. • Used in • Most timing. • Power critical areas.

Dynamic multiplexer latch • Scannable. • Multiplexing widths from 4-10. • Small insertion delay. • Used in • Time critical. • Multiplexing requiring areas. • Typical use in dataflow operand latches.

Dynamic PLA Latch • Scannable latch. • Used to generate control signals (clock gating signals). • The last two latches use slightly higher power. • Complete complex task in critical time. • Example of a tradeoff among triple constraints.

Fine-grained clock gating • Effective method of reducing power -used extensively in the CELL. • Use of local clock buffer (LCB) • Supplies clock to bank of latches. • If enable signal fired LCB buffers the global clock and sends to the bank of latches. • SPE activates only necessary pipeline stages. • Registers are turned off normally. • Functional blocks were simulated and verified. • 50% active power reduction using this design process.

Multiple clock frequency domains • High frequency increases performance. • Has some penalties • Higher clock power. • Higher percentage of clock insertion delays. • Shorter distance that a signal can travel. • SPE has some units whose performance does not solely depend on frequency. • SMF operates at half the frequency.

Multiple clock frequency domains • 11 FO4 blocks • Register file. • Fixed point unit. • Floating point unit. • Data forwarding. • Load/Store. • 22 FO4 blocks • Direct memory access unit. • Bus control. • Distribution of one clock to both domains. • SMF activated every second clock cycle.

Multiple clock frequency domains • Avoids physical implementation difficulties. • Helps escape • Latch insertion delay. • Travel distance penalties. • Advantages • Large percentage of clock dedicated to logic. • Most of SMF paths become non-critical. • Smaller transistors can be used. • SMF optimized for both area and power without sacrificing performance.

Dual-threshold-voltage devices • Leakage – significant portion of power consumption for deep micron technology. • Cannot be solved by clock gating or two clock domains. • Use high-threshold-voltage transistors. • Penalty – slower switching time. • Used in paths with enough timing slack. • Non critical paths from SMF because of two clock domains were replaced with these.

Selective use of dynamic circuits • Advantages of static circuits over dynamic • Design ease. • Low switching factor. • Tool compatibility. • Technology independence. • Advantages of dynamic circuits over static counterparts • Faster speed due to low cap at dynamic nodes. • Larger gains because of invertors after logic. • Micro architecture efficiency – fewer stages. • Smaller area.

Selective use of dynamic circuits • Dynamic logic requires a clock – higher power consumption. • Requires both true and complementary signals. • Static implementation tends to hit speed wall earlier. • Approach for design • Implement logic circuits in static CMOS as much as possible. • Alternatives when static did not meet the speed requirements.

Selective use of dynamic circuits • Dynamic circuits have static interfaces. • 19 percent of the non-SRAM area. • Include the following macros • Dataflow forwarding. • Multiport register file. • Floating point unit. • Dynamic PLL. • Multiplexer latch. • Instruction line buffer.

SPE hardware measurements • Tested for complicated 3D picture rendering. • The fastest operation ran at 5.6 GHz with a 1.4 V supply at 56° C. • The global clock mesh’s measured power is 1.3 W per SPE at a 1.2V supply and 2.0-GHz clock frequency. • The Cell architecture is compatible with the 64b Power architecture so that applications can be built on the Power investments. • It can be considered as a non-homogenous coherent chip multiprocessor. • High design frequency has been achieved through highly optimized implementation. • Its streaming DMA architecture helps to enhance memory effectiveness of a processor. • Refer to shmoo plot for power analysis

SPE shmoo plot

Applications of the CELL ProcessorAnd Its Potential For Scientific Computing

FOLDING@HOME Broke the Guinness world record for the “worlds most powerful distributed network” with computing power of > 1 PF(thousand trillion floating point operations per second). Blue Gene is 500 TF THE POWER!

Cell combines the considerable floating point resources required for demanding numerical algorithms with a power efficient software-controlled memory hierarchy. • Contains a powerful 64-bit Dual-threaded IBM PowerPC core and eight proprietary 'Synergistic Processing Elements' (SPEs), - eight more highly specialized mini-computers on the same die. • Cell’s peak double precision performance is very impressive relative to its commodity peers (14.6Gflop/s@3.2GHz), WHY THE POWER?

Quantitative Performance comparison of the cell to AMD Opteron(superscalar), Intel Itanium 2(VLIW) and Cray X1E(vector)‏ Minor Architectural Changes (CELL +) to improve DP performance. Complexity of mapping scientific algorithms onto the CELL. A few interesting Applications OVERVIEW

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006