Trace Cache

Trace Cache Ka-Ming Keung Swamy D Ponpandi

Lecture Topics • Trace Cache • Block based trace cache • Optimizations CPRE 585 Fall 2004

Motivation • Complex superscalar processor design to extract higher ILP -diminishing returns • Issues – Fetch bandwidth CPRE 585 Fall 2004

Producer-Consumer View Branch outcome/Jump address Instruction Fetch & Decode Instruction Execution Instruction Buffers CPRE 585 Fall 2004

Fetch Bandwidth • Branch Outcome • Cache Hit rate • Branch Throughput • Instruction Alignment CPRE 585 Fall 2004

The Trace Cache • Capture dynamic instruction sequences – Snapshot • Cache line – N instructions and M basic blocks Dynamic instruction stream trace{A:taken,taken} trace{A:taken,taken} later …. A A t t t t Fill New Trace from Instruction cache Access existing trace using A and predictions (t,t) TRACE CACHE TRACE CACHE t t t t A A 1st basic block 3rd basic block (still filling) TO DECODER 2nd basic block CPRE 585 Fall 2004

Pros and Cons • Deliver multiple blocks of instructions in the same cycle without support from a compiler and modifying instruction set • Not on critical path, complexity is moved out of the fetch-issue pipeline where additional latency impacts performance CPRE 585 Fall 2004

Pros and Cons • No need to rotate, shift basic blocks to create dynamic instruction sequence • Redundant Information CPRE 585 Fall 2004

Trace Cache – Dissection Fetch Address Trace Cache Core Fetch Unit n instructions n instructions Line fill 2:1 Mux Instruction latch To Decoder CPRE 585 Fall 2004

Trace Cache – More details Fetch address from decoder Trace Cache Core fetch unit Branch flags Target addr Merge logic Tag Branch mask Instruction Cache Fill Control Branch Target Buffer Line fill buffer Return Address Stack Predictor 1st branch BTB Logic 2nd branch fall through 3rd branch Hit Logic mask/shift/interchange To TC next fetch address n instructions n instructions Fetch address from predictor CPRE 585 Fall 2004 To instruction latch for decoding

Trace Cache Component • Valid bit • Tag • Branch flags • Branch mask • Trace fall-through address • Trace target address • Line-Fill Buffer CPRE 585 Fall 2004

Trace Cache Hit • Conditions 1. The fetch address match the tag 2. The branch predictions match the branch flags • If Hits, an entire trace of instructions is fed into the instruction CPRE 585 Fall 2004

Trace Cache Miss • If Misses, fetching proceeds normally from the instruction cache CPRE 585 Fall 2004

Design Issues • Associativity • Multiple paths • Partial matches • Indexing methods • Cache line size • Filling • Trace selection • Victim trace cache CPRE 585 Fall 2004

Results CPRE 585 Fall 2004

CPRE 585 Fall 2004

Trace cache effectiveness CPRE 585 Fall 2004

Pentium 4 Trace Cache • 6 uops per line (what is the benefit of storing as uops?) • Max 12 k uops (21 KB) • Uses virtual addresses so no need for address translation CPRE 585 Fall 2004

Block-Based Trace Cache • Fetch Address renaming • Advantage: Lesser bits • Unique renamed pointer is assigned to each address • No Tag Comparison, Faster instruction fetch • Rename table which is maintained at completion • Move the complexity and latency from instruction fetch time to completion time CPRE 585 Fall 2004

Block • Instructions are stored on a block. • Each block contains a trace of instruction • Block is updated at the end of each instruction execution • Each block has a block ID CPRE 585 Fall 2004

How to find out the trace? • Original: Fetched address • Now: Trace id • Trace id is determined by • the branch history, and • Past block id predictions • Trace table stores the block id predictions CPRE 585 Fall 2004

Rename Table • Check if the predicted block exists • Give a block id to a new trace CPRE 585 Fall 2004

Dynamic Optimizations • Register Moves • Register move does not require calculation • Can be performed in ROB and renaming unit • Trace Cache can do it in a better way CPRE 585 Fall 2004

Register Moves • Advantages • By Placing detection logic in fill unit, the decode and rename logic execute move instruction without having to pay for latency of detecting them • Requires no execution resources, thus incur no delays due to artifacts. CPRE 585 Fall 2004

Re-association addi Rx,Ry,4 addi Rz,Rx,4 • Decreases the dependency addi Rx, Ry,4 addi Rz, Ry,8 CPRE 585 Fall 2004

Scaled Adds shifti Rw, Rx << 1 shifti Rw, Rx << 1 add Ry, Rw, Rz scaled Add Ry, (Rx<<1), Rz • Decreases the dependency CPRE 585 Fall 2004

Instruction Placement • Many instructions are often unable to execute the cycle after their source operands are produced. • Reorder the instructions within the trace cache CPRE 585 Fall 2004

Conclusion • Increases IPC • New area for code optimization CPRE 585 Fall 2004

Trace Cache

Trace Cache

Presentation Transcript

Trace elements

Cache

Trace Elements

Trace Evidence

Cache

Trace

Cache

Trace Fossils

Cache

Web/Trace

Trace -

Using Trace Cache In SMT

What is Trace? Ultra-Trace?

Exploring P4 Trace Cache Features

Trace

Trace

Cache

trace

Feature-level Phase Detection for Execution Trace Using Object Cache

Cache?

Cache