CS 7810 Lecture 7

CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29 1996

Fetching Multiple Blocks • Aggressive o-o-o processors will perform poorly • if they only fetch a single basic block every cycle • Solution: • Predict multiple branches and targets in a cycle • Fetch multiple cache lines in the cycle • Initiate the next set of fetches in the next cycle

Without the Trace Cache • Stage 1 requires identification of predictions and • target addresses • Stage 2 requires multi-ported access of the I-cache • Stage 3 requires shifting and alignment

Trace Cache A 0 1 A C F 1 0 B C A 1 0 A B E 0 1 0 1 0 1 A B D 0 0 D E F G • Takes advantage of temporal locality and biased branches • Does not require multiple I-cache accesses

Base Case • In each cycle, fetch up to three sequential basic • blocks

Multiple Branch Predictor k-bit global history M U X / k-1 / k PHT

Trace Cache Design • The branch predictions can be used to index • into the trace cache or for tag comparison (Fig.4) • Keep track of next address (taken and not-taken) • Line buffer and merge logic assembles traces

Trace Cache

Design Alternatives • Associativity (including paths) • Partial matches – use all instructions till the first • mispredict • Multiple line-fill buffers • Trace selection to reduce conflicts • Multi-cycle trace caches?

Branch Address Cache • The BTB maintains 14 addresses (tree of basic • blocks) • Based on the branch prediction, three addresses • are forwarded to the I-Cache • BTB extension that allows multiple target • prediction • adds pipeline stages • can still have I-Cache bank contention

Collapsing Buffer • Can detect taken branches within a single cache • line • Also suffers from merge logic and bank contention

Methodology • Very aggressive o-o-o processor – large window • (2048 instrs), unlimited resources, no artificial • dependences, no cache misses • SPEC92-Int and Instruction Benchmark Suite (IBS) • Trace cache – 64 entries, 16 instrs and 3 branches • per entry – 712 tag bytes and 4KB worth of • instructions – ICache is 128KB

Results • Fetching three sequential basic blocks (SEQ.3) • is not much more complex than fetching one – • IPC improvement of ~15% • Trace cache outperforms BAC and CB – note • that the latter can’t handle all kinds of trace • patterns and suffer from ICache bank contention • TC outperforms SEQ.3 by 12% • BAC and CB do worse than SEQ.3 if they increase • front-end latency

Ideal Fetch • The trace cache is within 20% of ideal fetch • The trace miss rate is fairly high – 18-76% • Up to 60% of instructions do not come from the • trace cache • A larger trace cache comes within 10% of ideal • fetch – note that the front-end is the bottleneck • in this processor

Title • Bullet

CS 7810 Lecture 7

CS 7810 Lecture 7

Presentation Transcript

CS 7810 Lecture 19

CS 7810 Lecture 25

CS 7810 Lecture 9

CS 7810 Lecture 2

CS 7810 Lecture 14

CS 7810 Lecture 8

CS 7810 Lecture 23

CS 7810 Lecture 3

CS 7810 Lecture 9

CS 7810 Lecture 21

CS 7810 Lecture 13

CS 7810 Lecture 3

CS 7810 Lecture 25

CS 7810 Lecture 8

CS 7810 Lecture 16

CS 7810 Lecture 5

CS 7810 Lecture 10

CS 7810 Lecture 12

CS 7810 Lecture 19

CS 7810 Lecture 22

CS 7810 Lecture 2