1 / 19

Flow Path Model of Superscalars

Flow Path Model of Superscalars. I-cache. Instruction. Branch. FETCH. Flow. Predictor. Instruction. Buffer. DECODE. Memory. Integer. Floating-point. Media. Memory. Data. Flow. EXECUTE. Reorder. Buffer. Register. (ROB). Data. COMMIT. Flow. D-cache. Store. Queue.

ida
Download Presentation

Flow Path Model of Superscalars

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Flow Path Model of Superscalars I-cache Instruction Branch FETCH Flow Predictor Instruction Buffer DECODE Memory Integer Floating-point Media Memory Data Flow EXECUTE Reorder Buffer Register (ROB) Data COMMIT Flow D-cache Store Queue

  2. Out-of-order Core Fetch Unit Instruction Fetch Buffer • Fetch buffer smoothes out the rate mismatch between fetch and execution • neither the fetch bandwidth nor the execution bandwidth is consistent • Fetch bandwidth should be higher than execution bandwidth • we prefer to have a stockpile of instructions in the buffer to hide cache miss latencies. This requires both raw cache bandwidth + control flow speculation

  3. Instruction Flow Bandwidth

  4. 00 01 10 11 Instruction Cache Basic 000 001 Row Decoder 111 PC=..xxRRRCC00 Mutiplexer Instruction example: 4 instructions per cache line

  5. Spatial Locality and Fetch Bandwidth 00 01 10 11 000 001 Row Decoder 111 PC=..xxRRRCC00 Inst0 Inst1 Inst2 Inst3

  6. Fetch Group Miss Alignment 00 01 10 11 000 001 Row Decoder 111 PC=..xx0000100 Inst0 Inst1 Inst2 Cycle i Cycle i+1 Inst3??

  7. IFAR T T T logic logic logic Odd Directory 0 A0 B0 0 A1 B1 0 A2 B2 0 A3 B3 Sets 1 A4 B4 1 A5 B5 1 A6 B6 1 A7 B7 A & B A11 2 A8 B8 2 A9 B9 2 A10 B10 2 B11 TLB hit 3 A12 B12 3 A13 B13 3 A14 B14 3 A15 B15 and buffer control Even 255 255 255 255 Directory logic Sets mux mux mux mux A & B Instruction buffer network 1 2 3 n + + + n n n n o n i n n t o o o c i D Interlock, t i i u t t c r c c u dispatch, t u u s r D r r t b r a n c h , n t t s I s s n execution, n n I D I I logic D IBM RS/6000 Auto-alignment - 2-way set associative I-Cache, 8 256-inst SRAM modules - 16 instruction per cache line (**What is a cache line?)

  8. Instruction Decoding Issues • Primary tasks: • Identify individual instructions • Determine instruction types • Detect inter-instruction dependences • Two important factors: • Instruction set architecture • Width of parallel pipeline

  9. Intel Pentium Pro Fetch/Decode Unit x86 Macro-Instruction Bytes from IFU To Next Instruction Buffer 16 bytes Address Calc. Decoder Decoder Decoder uROM 0 1 2 Branch Address Calc. 4 uops 1 uop 1 uop uop Queue (6) Up to 3 uops Issued to dispatch

  10. Byte1 Byte2 Byte8 • • • 5 Bits 5 Bits 5 Bits Byte2 Byte1 Byte8 • • • Predecoding in the AMD K5 From Memory 8 Instruction Bytes 64 Predecode Logic 8 Instr. Bytes + 64 + 40 Predecode Bits I-Cache 16 Instr. Bytes + 128 + 80 Predecode Bits Decode, Translate and Dispatch ROP2 ROP1 ROP3 ROP4 Predecoding is also useful for RISC ISAs!! Cost: cache size, refill time Up to 4 ROP’s

  11. Control Dependence

  12. IBM’s Experience on Pipelined Processors [Agerwala and Cocke 1987] • Code Characteristics (dynamic) • loads - 25% • stores - 15% • ALU/RR - 40% • branches - 20% • 1/3 unconditional (always taken) unconditional - 100% schedulable • 1/3 conditional taken • 1/3 conditional not taken conditional - 50% schedulable

  13. Control Flow Graph • Shows possible paths of control flow through basic blocks • Control Dependence • Node X is control dependant on Node Y if the computation in Y determines whether X executes

  14. A B C D Mapping CFG toLinear Instruction Sequence A A C B D D B C

  15. Branch Types and Implementation • Types of Branches • Conditional or Unconditional? • Subroutine Call (aka Link), needs to save PC? • How is the branch target computed? • Static Target e.g. immediate, PC-relative • Dynamic targets e.g. register indirect • Conditional Branch Architectures • Condition Code ‘N-Z-C-V’ e.g. PowerPC • General Purpose Register e.g. Alpha, MIPS • Special Purposes register e.g. Power’s Loop Count

  16. Condition Resolution

  17. Target Address Generation

  18. What’s So Bad About Branches? • Performance Penalties • Use up execution resources • Fragmentation of I-Cache lines • Disruption of sequential control flow • Need to determine branch direction (conditional branches) • Need to determine branch target Robs instruction fetch bandwidth and ILP

  19. Riseman and Foster’s Study • 7 benchmark programs on CDC-3600 • Assume infinite machine: • Infinite memory and instruction stack, register file, fxn units Consider only true dependency at data-flow limit • If bounded to single basic block, i.e. no bypassing of branches  maximum speedup is 1.72 • Suppose one can bypass conditional branches and jumps (i.e. assume the actual branch path is always known such that branches do not impede instruction execution) Br. Bypassed: 0 1 2 8 32 128 Max Speedup: 1.72 2.72 3.62 7.21 24.4 51.2

More Related