Trace Caches

Trace Caches Michele Co CS 451

Motivation • High performance superscalar processors • High instruction throughput • Exploit ILP • Wider dispatch and issue paths • Execution units designed for high parallelism • Many functional units • Large issue buffers • Many physical registers • Fetch bandwidth becomes performance bottleneck

Fetch Performance Limiters • Cache hit rate • Branch prediction accuracy • Branch throughput • Need to predict more than one branch per cycle • Non-contiguous instruction alignment • Fetch unit latency

Problems with Traditional Instruction Cache • Contain instructions in compiled order • Works well for sequential code with little branching, or code with large basic blocks

Suggested Solutions • Multiple branch target address prediction • Branch address cache (1993, Yeh, Marr, Patt) • Provides quick access to multiple target addresses • Disadvantages • Complex alignment network, additional latency

Suggested Solutions (cont’d) • Collapsing buffer • Multiple accesses to btb (1995, Conte, Mills, Menezes, Patel) • Allows fetching non-adjacent cache lines • Disadvantages • Bank conflicts • Poor scalability for interblock branches • Significant logic added before and after instruction cache • Fill unit • Caches RISC-like instructions derived from CISC instruction stream • (1988, Melvin, Shebanow, Patt)

Problems with Prior Approaches • Need to generate pointers for all noncontiguous instruction blocks BEFORE fetching can begin • Extra stages, additional latency • Complex alignment network necessary • Multiple simultaneous access to instruction cache • Multiporting is expensive • Sequencing • Additional stages, additional latency

Potential Solution – Trace Cache • Rotenberg, Bennett, Smith (1996) • Advantages • Caches dynamic instruction sequences • Fetches past multiple branches • No additional fetch unit latency • Disadvantages • Redundant instruction storage • Between trace cache and instruction cache • Within trace cache

Trace Cache Details • Trace • Sequence of instructions potentially containing branches and their targets • Terminate on branches with indeterminate number of targets • Returns, indirect jumps, traps • Trace identifier • Start address + branch outcomes • Trace cache line • Valid bit • Tag • Branch flags • Branch mask • Trace fall-through address • Trace target address

Next Trace Prediction (NTP) • History register • Correlating table • Complex history indexing • Secondary Table • Indexed by most recently committed trace ID • Index generating function

NTP Index Generation

Return History Stack

Trace Cache vs. Existing Techniques

Trace Cache Optimizations • Performance • Partial matching [Friendly, Patel, Patt (1997)] • Inactive issue [Friendly, Patel, Patt (1997)] • Trace preconstruction [Jacobson, Smith (2000)] • Power • Sequential access trace cache [Hu, et al., (2002)] • Dynamic direction prediction based trace cache [Hu, et al., (2003)] • Micro-operation cache [Solomon, et al., 2003]

Trace Processors • Trace Processor Architecture • Processing elements (PE) • Trace-sized instruction buffer • Multiple dedicated functional units • Local register file • Copy of global register file • Use hierarchy to distribute execution resources • Addresses superscalar processor issues • Complexity • Simplified multiple branch prediction (next trace prediction) • Elimination of local dependence checking (local register file) • Decentralized instruction issue and result bypass logic • Architectural limitations • Reduced bandwidth pressure on global register file (local register files)

Trace Processor

Trace Cache Variations • Block-based trace cache (BBTC) • Black, Rychlik, Shen (1999) • Less storage capacity needed

Trace Table: BBTC Trace Prediction

Block Cache

Rename Table

BBTC Optimization • Completion time multiple branch prediction (Rakvic, et al., 2000) • Improvement over trace table predictions

Tree-based Multiple Branch Prediction

Tree-PHT

Tree-PHT Update

Trace Cache Variations (cont’d) • Software trace cache • Ramirez, Larriba-Pey, Navarro, Torrellas (1999) • Profile-directed code reordering to maximize sequentiality • Convert taken branches to not-taken • Move unused basic blocks out of execution path • Inline frequent basic blocks • Map most popular traces to reserved area of i-cache

Trace Caches

Trace Caches

Presentation Transcript

Caches

CS6290 Caches

Caches

Caches (Writing)

Advanced Caches

Caches

Caches (Writing)

Caches

Caches

Caches 2

Caches

Caches

Caches

Caches 2

Caches (Writing)

Buffer Caches

Caches

Virtual Caches

Caches

Practical Caches

Caches

Caches (Writing)