470 likes | 482 Views
Explore high-performance CPUs like DEC Alpha 21164 and Marvell Embedded CPU in this lecture covering data hazards, floating-point pipelines, and solutions for structural hazards in modern processors.
E N D
www-inst.eecs.berkeley.edu/~cs152/ CS 152 Computer Architecture and Engineering Lecture 15 -- Advanced CPUs 2014-3-11 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love Play:
DEC Alpha 21164 Top performing microprocessor in its day (1995). 300 MFLOPS in 0.5µ CMOS, @ 300 MHz.
DEC Alpha 21164 Uses techniques we cover in Part I of lecture. Lockup-free cache integration. Use of many functional units. Many instructions issued per cycle (superscalar)
DEC Alpha 21164 Most of chip is cache (in blue). This 4-issue chip was the high watermark for in-order designs. In 2014, in-order superscalar lives in the cost-sensitive sector ...
Marvell Embedded CPU: In-order dual-core superscalar 2 GB Flash ARM CPU (Marvell) 512 MB DRAM Wi-Fi $35 retail implies Bill of Materials (BOM) in the $20 range ... Chromecast: Web browser in a flash-drive form factor. Plugs into the HDMI port on a TV. Includes a Wi-Fi chip so you can control the browser from your cell phone.
Key Issue: Overcoming data hazards Read After Write (RAW) hazards. Instruction I2 expects to read a data value written by an earlier instruction, but I2 executes “too early” and reads the wrong copy of the data. Write After Read (WAR) hazards. Instruction I2 expects to write over a data value after an earlier instruction I1 reads it. But instead, I2 writes tooearly, and I1 sees the new value. Write After Write (WAW) hazards. Instruction I2 writes over data an earlier instruction I1 also writes. But instead, I1 writes after I2, and the final data value is incorrect.
Insufficient register write ports to service all sources every clock cycle. Not every arithmetic unit is fully pipelined. Key issue: Structural Hazards ... FloatingPointPipeline of Alpha 21164:
Topic #1: CPU side of our hit-over-miss cache ... Queue 1 Queue 2 From CPU To CPU CPU requests a read by placing MTYPE, TAG, MADDR in Queue 1. “We” == L1 D-Cache controller We do a normal cache access. If there is a hit, we put place load result in Queue 2 ... In the case of a miss, we use the Inverted Miss Status Holding Register.
Queue 1 Queue 2 CPU uses 5 bits of TAG to encode the target/sourceregister for LW/SW. Integrating queues into the pipeline ... A memorypipe splits off from the main pipeline, after ALU calculates index.
Each register has a lockbit, initialized to 0. An example of a scoreboard datastructure. LockBits 5 rs 1 5 ws rd 1 wd WE LockBits: a scoreboard data structure In decode stage, we stall any instruction that reads or writes a locked register. In decode stage, we lock target register of any LW we issue.
Queue 1 Queue 2 How lock bits are cleared ... When data is returned to CPU via Queue 2, CPU writes data into registerfile, and clears the associated lockbit. LockBits 5 rs 1 5 ws rd 1 wd WE Dedicated write ports are needed to avoid structural hazards. From CPU To CPU
Queue 1 Queue 2 Memory semantics and lock-free caches The CPU expects that loads and stores to the same memory location are applied in queued order. The simple (low-performance) approach for the data cache is to “snoop” Queue 1, and delay accepting writes to addresses that are being read. Finally, note the lack of sequential consistency. From CPU To CPU
Topic #2: Pipelines and latency ... This pipeline splits after the RF stage, feeding functional units with different latencies.
Solution: SUB detects R1 clash in decode stage and stalls, via a pipe-write scoreboard. WAW Hazard DIV R1, R2, R3 SUB R1, R2, R3 If long latency DIV and short latency SUB are sent to parallel pipes, SUB may finish first. Split pipelines: a write-after-write hazard. The pipeline splits after the RF stage, feeding functional units with different latencies.
Solution: A scoreboard structure to reserve future slots of the write port. Stall SUB in decode until slot opens. Structural Hazard DIV R1, R2, R3 [...] SUB R5, R2, R3 DIV and SUB may need to writeregisterfile at the same time. Register write port: a structural hazard Other solutions possible ... above, solution of separate write ports.
Solution: A scoreboard structure to detect busy functional units. Stall DIV R5, ... in decode until divider is ready. Structural Hazard DIV R1, R2, R3 DIV R5, R2, R3 Divide is usually not fully pipelined, and cannot accept new operands everycycle. Functional unit input: a structural hazard The pipeline splits after the RF stage, feeding functional units with different latencies.
Solutions: Too complicated for a slide. See page C-58 in CA-AQA Exceptions DIV R1, R2, R3 SUB R4, R2, R3 If DIV throws an exception after SUB writes back, the contract with the programmer breaks. Imprecise exceptions: A difficult issue The pipeline splits after the RF stage, feeding functional units with different latencies.
Example: CPU with floating point ALUs: Issue 1 FP + 1 Integer instruction per cycle. Superscalar: Multiple issues per cycle Goal: Improve CPI by issuing several instructions per cycle. Difficulties: Load and branch delays affect more instructions. Ultimate Limiter: Programs may be a poor match to issue rules.
opcode opcode Syntax: ADD $8 $9 $10 rs rs rt rt rd Semantics:$8 = $9 + $10 rd shamt shamt funct funct Semantics:$7 = $8 + $9 Syntax: ADD $7 $8 $9 Recall VLIW: Super-sized Instructions Example: All instructions are 64-bit. Each instruction consists of two 32-bit MIPS instructions, that execute in parallel. A 64-bit VLIW instruction But what if we can’t change ISA execution semantics ?
RegFile 64 rs1 R rd1 Data rs2 Instr Mem Instruction Issue Logic ws1 rd2 wd1 Addr rs3 rd3 32 rs4 PC and Sequencer ws2 rd4 R wd2 WE1 WE2 IF (Fetch) ID (Decode) EX (ALU) MEM WB Superscalar R machine IR IR IR IR A IR Y B A Y Y B M IR IR IR IR IF (Fetch) ID (Decode) EX (ALU) MEM WB
ADD R8,R0,R0 ADD R9,R8,R7 ADD R15, R14,R13 ADD R21,R20,R19 ADD R27 ADD R11,R0,R0 ADD R27,R26,R25 ADD R30,R29,R28 ADD R21,R20,R19 ADD R24,R23,R22 ADD R15,R14,R13 ADD R18,R17,R16 ADD R9,R8,R7 ADD R12,R11,R10 ADD R18, R17,R16 ADD R24,R23,R22 ADD R12,R11,R10 ADD R30 IF (Fetch) ID (Decode) EX (ALU) MEM WB Sustaining Dual Instr Issues (no forwarding) IR IR IR IR RegFile A rs1 Y R rd1 rs2 B ws1 rd2 wd1 rs3 A rd3 rs4 Y R ws2 rd4 B wd2 WE1 WE2 IR IR IR IR It’s rarely this good ... ID (Decode) EX (ALU) MEM WB
ADD R8, ADD R10, R9,R0 ADD R11,R10,R0 ADD R9,R8,R0 Dependencies force “serialization” Worst-Case Instruction Issue ADD R8,R0,R0 ADD R9,R8,R0 ADD R10,R9,R0 ADD R11,R10,R0 NOP NOP NOP NOP IF (Fetch) ID (Decode) EX (ALU) MEM WB We add 12 forwarding buses (not shown). (6 to each ID from stages of both pipes). IR IR IR IR RegFile A rs1 Y R rd1 rs2 B ws1 rd2 wd1 rs3 A rd3 rs4 Y R ws2 rd4 B wd2 WE1 WE2 IR IR IR IR ID (Decode) EX (ALU) MEM WB
Two issues per cycle One issue per cycle Superscalar: A simple example ... Why is the control for this CPU not so hard to do? Example: Superscalar MIPS. Fetches 2 instructions at a time. If first integer and second floating point, issue in same cycle Integer instruction FP instruction LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) ADDD F4,F0,F2 LD F14,-24(R1) ADDD F8,F6,F2 LD F18,-32(R1) ADDD F12,F10,F2 SD 0(R1),F4 ADDD F16,F14,F2 SD -8(R1),F8 ADDD F20,F18,F2 SD -16(R1),F12 SD -24(R1),F16
Superscalar: Visualizing the pipeline Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Three instructions potentially affected by a single cycle of load delay, as FP register loads done in the “integer” pipeline).
Extending scheme to speed up general apps (Microsoft Office, ...) is complicated. If one accepts building a complicated machine, there are better ways to do it. Dynamic Scheduling: After spring break. Limitations of “lockstep” superscalar Gets 0.5 CPI only for a 50/50 float/int mix with no hazards. For games/media, may be OK.
DEC Alpha 21164 This 4-issue chip was the high watermark for in-order superscalar designs.
Final paragraph DEC was sold off to Compaq a few years later ... who sold of Digital Semiconductor to Intel ... who still makes Alpha chips in small batches for HP (who bought Compaq).
Break Play:
The CDC 6600 was the world’s fastest computer for 5 years (1964-1969). The design team was located in a small town in Wisconsin, the home town of its leader, Seymour Cray. The lab was placed far from CDC headquarters in Minneapolis, to limit interference from upper management.
Tape Drives Punched card reader Main frame Top view: a “+” sign Operator Console
Top-down view: Entire main frame was liquid-cooled with Freon. Transistor-based design, running at 100 ns clock speed. Bus wires: twisted wire pairs that were trimmed by hand to meet cycle time. 64K of 60-bit words, implemented with magnetic core memory.
First commercial use of display consoles ... ran “space wars” vector games.
Twisted pair bus wires. Trimmed by hand.
Memory modules were hand-woven by former textile workers ... this is why machine cost $7M in 1962 dollars!
Logic gate circuit modules ... 50 transistors: 2.5 x 2.5 x 0.8 inch
Architecture The first RISC machine 10 functional units Out-of-order execution. Long, variable latency “Scoreboard” Register File Includes eight 60-bit floating point registers Peripheral processor invented multithreading
Instruction Fetch and the Scoreboard The scoreboard controls the execution flow of all instructions. It’s goal is to maintain a CPI of 1. The instruction fetch unit is decoupled. It’s goal is to pass one decoded instruction to the scoreboard every cycle. The scoreboard holds decoded copies of all in-flight instructions, and tracks the status of all elements cycle-by-cycle.
Newly arrived instructions placed in this state, until (1) a functional unit becomes free, and (2) no other issued instructions want to write the register it wants to write. Lifecycle of an instruction in the scoreboard (part 1) Awaiting operands Execution in progress Pending Issue Execution has completed Prevents WAW hazards. Result is written If an instruction is in pending issue, the scoreboard stalls the instruction fetch unit.
Instructions remain in this state, until both of its operand registers are not waiting to be written by a functional unit. Lifecycle of an instruction in the scoreboard (part 2) Awaiting operands Execution in progress Pending Issue Execution has completed Prevents RAW hazards. Result is written
This state can last many cycles, as functional units have long latency. Lifecycle of an instruction in the scoreboard (part 3) Awaiting operands Execution in progress Pending Issue Execution has completed Result is written
Instructions may pass though this state, unless there is an instruction is Pending or Awaiting mode that (1) preceded it in the instruction stream, (2) Pending/Awaiting instruction needs to read the register this instruction plans to write. Lifecycle of an instruction in the scoreboard (part 4) Awaiting operands Execution in progress Pending Issue Execution has completed Result is written Prevents WAR hazards.
What the scoreboard keeps score of. The full status of each functional unit. (1) Is it running an instruction? Which one? (2) What are its source/destination registers? (3) For each source: waiting/ready-to-read/read. (4) For each source: who will be writing it? For each register, which functional unit is planning to write it? Current state of all in-flight instructions.
Limitations of scoreboard control ... If one accepts building a complicated machine, there are better ways to do it. Dynamic Scheduling: After spring break.
On Thursday Midterm Review Lecture