310 likes | 415 Views
Emotion Engine™. AKA the “Playstation 2” Architecture Or The progeny of a MIPS and a DSP By Idan Gazit – June 2002. Overview. Based around a modified and extended MIPS R3000 core.
E N D
Emotion Engine™ AKA the “Playstation 2” Architecture Or The progeny of a MIPS and a DSP By Idan Gazit – June 2002
Overview • Based around a modified and extended MIPS R3000 core. • Designed from the ground up to run “media applications” (read: games) VERY fast – but can function as a general purpose CPU • Bears much resemblence to “DSP’s” (Digital Signal Processors) – more on this later.
Basic Layout – Parallelism is Key! • MIPS R3K CPU • 1 FPU (Floating Point) coprocessor • 2 VU (Vector Units) – more on this later • Graphics Interface Unit (GIF) – passes on rendered data to the Graphics Synth, which does the work of actually “drawing” it to the screen. • 128b wide main bus • 10 Channel DMA controller
The Nitty-Gritty • The main job of the EE is to render entire frames, the product of which is a “display list”, i.e. a list of geometry (points, polygons, textures) and where they need to be placed on the screen. • All of this needs to be done very fast, so note the very wide data paths (128b main bus, and additional “private” links between certain units). • Also 10 channel DMA controller – CPU shouldn’t waste time on I/O. Multiple connections between different units allow for more than one I/O transaction at once, so long as they’re on different buses
The CPU • Honest, it’s just a plain MIPS with some minor extensions. • 32x128b general purpose regs • 2 x 64b ALU (Arithmetic Logic Units) • 1 x 128b Load/Store unit (Parallelism again – load/store 4 words at once) • 1 Branch execution unit • 2 Coprocessors: FPU and VU0 – proper MIPS coprocessors controlled by COP instructions!
The CPU • Able to do 2 64b integer ops per cycle, or one 64b int op and one 128b load/store. • ALUs are interesting: they are pipelined, but can be used two ways: • Separately, as in normal CPUs (2 x 64b op) • Locked, to perform a 128b instruction: • 16 x 8b ops in one cycle • 8 x 16b ops in one cycle • 4 x 32b ops in one cycle
The CPU • Example Supported instructions: • MUL/DIV instructions • 3-op MUL/MADD instructions • Arithmetic ADD/SUB instructions • Pack and extend instructions • Min/Max instructions • Absolute instructions • Shift instructions • Logical instructions • Compare instructions • Quadword Load/Store (remember, 128b L/S unit)
The CPU • 8k data / 16k instruction cache, 2-way set associative • 6-stage pipeline (shallow, compared to modern PC architectures) • Speculative execution possible, but the penalty for a branch miss isn’t bad because it’s a short pipeline. • Pipeline Stages:1. PC select 2. Instruction fetch 3. Instruction decode and register read 4. Execute 5. Cache access 6. Writeback
The CPU • 16k of SPRAM – “Scratch Pad” RAM – VERY VERY FAST. • In the CPU core. • What is this stuff? This is actually a very fast data cache shared by the CPU and VU0. • The 128b “private” link between the CPU and VU0 allows VU0 to use the SPRAM and the CPU to directly reference the VU’s registers. • Which leads us nicely to the fact that the really difficult work is performed by…
Vector Units: The heart of EE • FMAC: Floating-Point Multiply-Accumulate • As it turns out, this operation is critical to 3D rendering, and is performed many times in tight loops. • An obvious candidate for parallelism and pipelining! • Between both VU’s and the FPU, a total of 10 FMAC units able to do 1 FMAC per cycle, but also other useful instructions.
Example VU “Useful Instructions” • FMAC: 1 cycle • Min/Max: 1 cycle • FDIV – another logical unit, 1 per VU • Floating-Point divide: 7 cycles • Square Root: 7 cycles • Inv Square Root: 13 cycles
Vector Units • However, there are differences to the two VU’s and how they are utilized. • Both are VLIW – take long instructions with multiple pieces of data. • Processing units are split into two “working groups”: • Group 1: CPU + FPU + VU0“Emotion Synthesis” on diagram • Group 2: VU1 + GIF“Geometry Processing” on diagram
Group 1 • Here, the FPU and VU0 act as proper MIPS coprocessors, and are linked to the CPU by a private 128b wide bus to avoid crowding the main bus. • FPU is nothing special, just another FPU coprocessor. 1 FMAC unit, 1 FDIV unit, each identical to VU FMAC/FDIV. • VU0 does the real heavy lifting when it comes to the math; the CPU acts as more of a traffic director in feeding data as fast as it can to the VU for processing.
Group 1 • Although group 1 does geometry processing, it is also responsible for more general-purpose calculations, such as enemy AI, game physics, etc. • Therefore group 1 has the (more generalized) CPU, whereas group 2 focuses only on geometry (and has only VU1 and the GIF) • Definite hierarchy of control in group 1 – CPU controls FPU and VU0.
Group 1 – Vector Unit 0 • 32 x 128b FP registers, each holds 4 x 32b single-precision FP numbers. • 16 x 16b integer regs for int math • Instructions are just standard 32b “COP” (coprocessor) instructions • Data is passed from CPU in 128b bundles, which the VIF (VU Interface) “unpacks” into 4x32b data words. • 8k each for data cache/inst cache
Group 2 • Consists of VU1 and the GIF (Graphics Interface). • VU1 acts like a standalone VLIW processor, and is not directly controlled by the CPU. • Perhaps a proper name for VU1 is the “Geometry Processor” for the GIF – this is pure data processing and it has to happen quick to keep the GIF saturated with graphics to draw out to your TV.
Group 2 – Vector Unit 1 • Same general features as VU0, but some differences according to VU1’s role: • Addition of an “EFU” (elementary functional unit) – basically one FMAC and FDIV unit doing the more rudimentary geometry calculations. Note a striking resemblence to the FPU from group 1… • 16k each of data & inst cache, up from 8k – since VU1 must handle geometry independently of the CPU, it ends up handling much more data than VU0.
Group 2 – Vector Unit 1 • Special direct connection between data cache and the GIF. • Why is this special? VU1 can work on a display list in cache and have it sent over to the GIF by DMA. Quicker than using the main bus to shuttle data around, less dependent on CPU, and leaves the main bus free for load instructions.
Vector Unit Comparison • Designers opted for flexibility in design, and thus the architecture is slightly confusing: • VU0 is a coprocessor, VU1 is a VLIW mini-processor. • BUT… VU0 can be switched into VLIW-mode, where the CPU then communicates with it like VU1. (E.G. receiving 64b instruction “bundles” and parsing them with the VIF).
Vector Unit Instructions • We really should treat the VU’s as limited processors. • Each 64b VLIW breaks down into two 32b COP instructions, an “upper” instruction and a “lower” instruction. • The upper/lower distinction is important; the types of work they do are different
Vector Unit Instructions • Upper Instructions: SIMD (Single Instruction – Multiple Data) instructions • Aptly named – these are the “fast” multimedia instructions that do the same operation on lots and lots of data. • Logically, these types of instructions are handled by the special VU units: FMAC, FDIV, etc. • Note that these instructions ONLY use the “special” units in each VU.
Vector Unit Instructions • Lower Instructions: non SIMD type • More “utility” than processing: • Load/store instructions • Jump/Branch instructions • Random Number Generation • EFU instructions (only in VU1, remember 1 FMAC and 1 FDIV). • Note that these instructions use units in the VU’s that I didn’t mention (RNG unit, Load/Store unit, etc) – they’re the more “mundane” units for the more “mundane” tasks.
Flow of Execution • So with all of this confusing flexibility, what do we get? • Two ways of doing work: • Group 1 & Group 2 both render in parallel, both passing on display lists to the GIF • Group 1 (CPU,VU0,FPU) prepares instructions for VU1 – load/store, branching, etc – which VU1 renders and passes on to the GIF.
Flow of Execution • Method 1: (parallel) • Method 2: (serial)
DSP’s, PS2’s and PC’s, oh my! • Essentially, the PS2 (like DSP’s), is performing a small amount of instructions on a large amount of “uniform” data. • Exactly the opposite of PC’s – performing large amounts of instructions on varying data. • Side-effect bonus: good “Locality of Reference” – instructions in PS2 don’t jump around much like in PC’s, therefore less chance of cache misses or branch mispredictions.
DSP’s, PS2’s and PC’s, oh my! • Note design decisions that promote data-intensive computing: • Wide buses, and private connections between units that move a lot of data. • VLIW – instructions come packaged with lots and lots of data. • Large registers and load/store units. Instructions geared towards SIMD-style (e.g. 128 bit loads 4 words of data at once.) • MASSIVE ability to calculate inner-loop instructions (FMAC) in ONE CYCLE – 10 FMAC’s, therefore 10 of these can be done in 1 cycle. Even FDIV’s are fast (7 cycles).
Conclusion • Entire EE design centered around specialized-purpose: games! It can run generalized apps but with a penalty. • How much of a penalty? Interesting question. Perhaps not much, because there is a general-purpose MIPS at the core. • More similar in design to a DSP – fixed small amount of instructions to be done on large amounts of uniform data.
The End & References • http://www.arstechnica.com/reviews/1q00/playstation2/ee-1.html • http://www.arstechnica.com/cpu/2q00/ps2/ps2vspc-1.html • http://www.scea.com/news/press_example.asp?ps2=ps2&ReleaseID=9 • http://users.ece.gatech.edu/~scotty/7102/pres/5 • http://www.eecg.toronto.edu/~stoodla/processors/Sony/EmotionEngine.html • http://ntsrv2000.educ.ualberta.ca/nethowto/examples/m_ho/ps2eengine.html • http://www.geocities.com/SiliconValley/Bay/6114/cpu2.html