Flow Path Model of Superscalars

Flow Path Model of Superscalars I-cache Instruction Branch FETCH Flow Predictor Instruction Buffer DECODE Memory Integer Floating-point Media Memory Data Flow EXECUTE Reorder Buffer Register (ROB) Data COMMIT Flow D-cache Store Queue

Out-of-order Core Fetch Unit Instruction Fetch Buffer • Fetch buffer smoothes out the rate mismatch between fetch and execution • neither the fetch bandwidth nor the execution bandwidth is consistent • Fetch bandwidth should be higher than execution bandwidth • we prefer to have a stockpile of instructions in the buffer to hide cache miss latencies. This requires both raw cache bandwidth + control flow speculation

Instruction Flow Bandwidth

00 01 10 11 Instruction Cache Basic 000 001 Row Decoder 111 PC=..xxRRRCC00 Mutiplexer Instruction example: 4 instructions per cache line

Spatial Locality and Fetch Bandwidth 00 01 10 11 000 001 Row Decoder 111 PC=..xxRRRCC00 Inst0 Inst1 Inst2 Inst3

Fetch Group Miss Alignment 00 01 10 11 000 001 Row Decoder 111 PC=..xx0000100 Inst0 Inst1 Inst2 Cycle i Cycle i+1 Inst3??

IFAR T T T logic logic logic Odd Directory 0 A0 B0 0 A1 B1 0 A2 B2 0 A3 B3 Sets 1 A4 B4 1 A5 B5 1 A6 B6 1 A7 B7 A & B A11 2 A8 B8 2 A9 B9 2 A10 B10 2 B11 TLB hit 3 A12 B12 3 A13 B13 3 A14 B14 3 A15 B15 and buffer control Even 255 255 255 255 Directory logic Sets mux mux mux mux A & B Instruction buffer network 1 2 3 n + + + n n n n o n i n n t o o o c i D Interlock, t i i u t t c r c c u dispatch, t u u s r D r r t b r a n c h , n t t s I s s n execution, n n I D I I logic D IBM RS/6000 Auto-alignment - 2-way set associative I-Cache, 8 256-inst SRAM modules - 16 instruction per cache line (**What is a cache line?)

Instruction Decoding Issues • Primary tasks: • Identify individual instructions • Determine instruction types • Detect inter-instruction dependences • Two important factors: • Instruction set architecture • Width of parallel pipeline

Intel Pentium Pro Fetch/Decode Unit x86 Macro-Instruction Bytes from IFU To Next Instruction Buffer 16 bytes Address Calc. Decoder Decoder Decoder uROM 0 1 2 Branch Address Calc. 4 uops 1 uop 1 uop uop Queue (6) Up to 3 uops Issued to dispatch

Byte1 Byte2 Byte8 • • • 5 Bits 5 Bits 5 Bits Byte2 Byte1 Byte8 • • • Predecoding in the AMD K5 From Memory 8 Instruction Bytes 64 Predecode Logic 8 Instr. Bytes + 64 + 40 Predecode Bits I-Cache 16 Instr. Bytes + 128 + 80 Predecode Bits Decode, Translate and Dispatch ROP2 ROP1 ROP3 ROP4 Predecoding is also useful for RISC ISAs!! Cost: cache size, refill time Up to 4 ROP’s

Control Dependence

IBM’s Experience on Pipelined Processors [Agerwala and Cocke 1987] • Code Characteristics (dynamic) • loads - 25% • stores - 15% • ALU/RR - 40% • branches - 20% • 1/3 unconditional (always taken) unconditional - 100% schedulable • 1/3 conditional taken • 1/3 conditional not taken conditional - 50% schedulable

Control Flow Graph • Shows possible paths of control flow through basic blocks • Control Dependence • Node X is control dependant on Node Y if the computation in Y determines whether X executes

A B C D Mapping CFG toLinear Instruction Sequence A A C B D D B C

Branch Types and Implementation • Types of Branches • Conditional or Unconditional? • Subroutine Call (aka Link), needs to save PC? • How is the branch target computed? • Static Target e.g. immediate, PC-relative • Dynamic targets e.g. register indirect • Conditional Branch Architectures • Condition Code ‘N-Z-C-V’ e.g. PowerPC • General Purpose Register e.g. Alpha, MIPS • Special Purposes register e.g. Power’s Loop Count

Condition Resolution

Target Address Generation

What’s So Bad About Branches? • Performance Penalties • Use up execution resources • Fragmentation of I-Cache lines • Disruption of sequential control flow • Need to determine branch direction (conditional branches) • Need to determine branch target Robs instruction fetch bandwidth and ILP

Riseman and Foster’s Study • 7 benchmark programs on CDC-3600 • Assume infinite machine: • Infinite memory and instruction stack, register file, fxn units Consider only true dependency at data-flow limit • If bounded to single basic block, i.e. no bypassing of branches  maximum speedup is 1.72 • Suppose one can bypass conditional branches and jumps (i.e. assume the actual branch path is always known such that branches do not impede instruction execution) Br. Bypassed: 0 1 2 8 32 128 Max Speedup: 1.72 2.72 3.62 7.21 24.4 51.2

Flow Path Model of Superscalars

Flow Path Model of Superscalars

Presentation Transcript

THE CIRCULAR FLOW MODEL

Circular Flow Model

Circular Flow Model

Circular Flow Model

The Ribosome Flow Model

Cash Flow Model (70% model)

superscalars

Empirical Model of CSF Flow

5. Microarchitecture of Superscalars (3) Branch Prediction

CIRCULAR FLOW MODEL

Max Flow: Shortest Augmenting Path

The Circular Flow Model

Microarchitecture of Superscalars (3) Branch Prediction

Career Path Model

The Circular Flow Model

Network Flow Problems – Shortest Path Problem

Circular Flow Model

6. Microarchitecture of Superscalars (4) Decoding

The Circular Flow Model

Microarchitecture of Superscalars (2) Instruction Fetching

The Circular Flow Model

Product Flow Model