350 likes | 686 Views
Trace Cache . Ka-Ming Keung Swamy D Ponpandi. Lecture Topics. Trace Cache Block based trace cache Optimizations. Motivation. Complex superscalar processor design to extract higher ILP -diminishing returns Issues – Fetch bandwidth. Producer-Consumer View. Branch outcome/Jump address.
E N D
Trace Cache Ka-Ming Keung Swamy D Ponpandi
Lecture Topics • Trace Cache • Block based trace cache • Optimizations CPRE 585 Fall 2004
Motivation • Complex superscalar processor design to extract higher ILP -diminishing returns • Issues – Fetch bandwidth CPRE 585 Fall 2004
Producer-Consumer View Branch outcome/Jump address Instruction Fetch & Decode Instruction Execution Instruction Buffers CPRE 585 Fall 2004
Fetch Bandwidth • Branch Outcome • Cache Hit rate • Branch Throughput • Instruction Alignment CPRE 585 Fall 2004
The Trace Cache • Capture dynamic instruction sequences – Snapshot • Cache line – N instructions and M basic blocks Dynamic instruction stream trace{A:taken,taken} trace{A:taken,taken} later …. A A t t t t Fill New Trace from Instruction cache Access existing trace using A and predictions (t,t) TRACE CACHE TRACE CACHE t t t t A A 1st basic block 3rd basic block (still filling) TO DECODER 2nd basic block CPRE 585 Fall 2004
Pros and Cons • Deliver multiple blocks of instructions in the same cycle without support from a compiler and modifying instruction set • Not on critical path, complexity is moved out of the fetch-issue pipeline where additional latency impacts performance CPRE 585 Fall 2004
Pros and Cons • No need to rotate, shift basic blocks to create dynamic instruction sequence • Redundant Information CPRE 585 Fall 2004
Trace Cache – Dissection Fetch Address Trace Cache Core Fetch Unit n instructions n instructions Line fill 2:1 Mux Instruction latch To Decoder CPRE 585 Fall 2004
Trace Cache – More details Fetch address from decoder Trace Cache Core fetch unit Branch flags Target addr Merge logic Tag Branch mask Instruction Cache Fill Control Branch Target Buffer Line fill buffer Return Address Stack Predictor 1st branch BTB Logic 2nd branch fall through 3rd branch Hit Logic mask/shift/interchange To TC next fetch address n instructions n instructions Fetch address from predictor CPRE 585 Fall 2004 To instruction latch for decoding
Trace Cache Component • Valid bit • Tag • Branch flags • Branch mask • Trace fall-through address • Trace target address • Line-Fill Buffer CPRE 585 Fall 2004
Trace Cache Hit • Conditions 1. The fetch address match the tag 2. The branch predictions match the branch flags • If Hits, an entire trace of instructions is fed into the instruction CPRE 585 Fall 2004
Trace Cache Miss • If Misses, fetching proceeds normally from the instruction cache CPRE 585 Fall 2004
Design Issues • Associativity • Multiple paths • Partial matches • Indexing methods • Cache line size • Filling • Trace selection • Victim trace cache CPRE 585 Fall 2004
Results CPRE 585 Fall 2004
Trace cache effectiveness CPRE 585 Fall 2004
Pentium 4 Trace Cache • 6 uops per line (what is the benefit of storing as uops?) • Max 12 k uops (21 KB) • Uses virtual addresses so no need for address translation CPRE 585 Fall 2004
Block-Based Trace Cache • Fetch Address renaming • Advantage: Lesser bits • Unique renamed pointer is assigned to each address • No Tag Comparison, Faster instruction fetch • Rename table which is maintained at completion • Move the complexity and latency from instruction fetch time to completion time CPRE 585 Fall 2004
Block • Instructions are stored on a block. • Each block contains a trace of instruction • Block is updated at the end of each instruction execution • Each block has a block ID CPRE 585 Fall 2004
How to find out the trace? • Original: Fetched address • Now: Trace id • Trace id is determined by • the branch history, and • Past block id predictions • Trace table stores the block id predictions CPRE 585 Fall 2004
Rename Table • Check if the predicted block exists • Give a block id to a new trace CPRE 585 Fall 2004
Dynamic Optimizations • Register Moves • Register move does not require calculation • Can be performed in ROB and renaming unit • Trace Cache can do it in a better way CPRE 585 Fall 2004
Register Moves • Advantages • By Placing detection logic in fill unit, the decode and rename logic execute move instruction without having to pay for latency of detecting them • Requires no execution resources, thus incur no delays due to artifacts. CPRE 585 Fall 2004
Re-association addi Rx,Ry,4 addi Rz,Rx,4 • Decreases the dependency addi Rx, Ry,4 addi Rz, Ry,8 CPRE 585 Fall 2004
Scaled Adds shifti Rw, Rx << 1 shifti Rw, Rx << 1 add Ry, Rw, Rz scaled Add Ry, (Rx<<1), Rz • Decreases the dependency CPRE 585 Fall 2004
Instruction Placement • Many instructions are often unable to execute the cycle after their source operands are produced. • Reorder the instructions within the trace cache CPRE 585 Fall 2004
Conclusion • Increases IPC • New area for code optimization CPRE 585 Fall 2004