Data Cache Prefetching using a Global History Buffer

Data Cache Prefetching using a Global History Buffer • Written by: • - Kyle Nesbit • - James Smith • Department of Electrical and Computer Engineering • University of Wisconsin, Madison Presented by: Chuck (Chengyan) Zhao Mar 30, 2004

Introduction • Cache-hierarchy: • CPU: registers, very small number, fastest • L1 Cache: usually 8k, larger than CPU registers, slower than CPU • L2 Cache: usually 256/512k, larger than L1, slower than L1 • L3 Cache (optional): usually 1M/2M, larger than L2, slower than L2 Cache • Main memory: • Usually 256M/512 M or more, • larger than L3, slowest CPU-Memory Cache Hierarchy

Each level on cache hierarchy: • latency is around 10 times • Problem with the cache hierarchy architecture • limited capacity (size) • Limited associativity Solution for the problems: using effective prefetching 2. Pre-fetching technique • Sequential prefetching • What: access cache lines that immediately following the current cache line (for the cache miss) • Algorithm: • early: pre-fetch after each cache miss • mature: Issue prefetch after a sequential access pattern is built Degree of prefetching: • Maximum number of cache lines prefetched in response to a single prefetch request • in order to: completely hide the latency of a miss to main memory

2. Table based prefetching: • What: • record history information related to data access • Operate: • Table is accessed with a key (Program Counter of the load instruction, or the missed address) • Use history information to predict the prefetching behavior • Evaluate: • Pro: simple • Con: inefficient • Fixed amount of history for each prefetching key • Stale happens: data in entry sit for a very long time. When using this information, the memory access behavior has changed 3. Global History Buffer (GHB) prefetching • Organized: Fig 1.b • Features: • FIFO Table: cache misses: enter from bottom, goes up to top • Separate IT and GHB: • Fixed table size: • Circular table: overwrite existing items, when overflow happens

Benefit of GHB: • reduce stale data • more accurate construction of history access patterns • more effective prefetching algorithm • 4. Table-based prefetching techniques • Stride Prefetching: Fig. 2. • the following addresses are fetched: • a + s, where: a: target address • a + 2s, s: detected stride • … … d: degree of prefetching • a + d s, note: in this case, stride s is a const • Correlation Prefetching (Markov Prefetching): Fig. 3. + explain • Use a history table to record cache-misses • missing address: index the correlation table • Each entry: • List of addresses that have immediately followed the current miss address • Most recent miss first • Markov graph: • each node: cache miss address • edge: probabilities that source will be immediately followed by target

3. Distance Prefetching: Fig 4. + explain • Generalized Correlation Prefetching • Use distance (between 2 global miss address) to index correlation table Problems with table-based prefetching: • Table data becomes stale: not used, not refreshed neither • Table entry conflicts: multiple access keys map to the same table entry • Fixed + small history data per entry: Fig 3. 2-piece of history per data item 5. Global History Buffer (GHB) base prefetching: • Table structure: Fig. 1 (b) • IT: Index Table • accessed by key as traditional table-based prefetching • Key: Program Counter, cache missing address or a combination of them • Have pointers to GHB

GHB (cont) • GHB: n-entry FIFO circular table • holding: n most recent misses • each entry: • global miss address • Pointer: chain other GHB entries into address list (access info for the same address) Notions used later: • Prefetching Method: X / Y • X: • PC: Program Counter based indexing • G: global address • Y: • CS: Const Striding • DC: Delta Correlation • AC: Address Correlation • Different combination of X and Y creates different prefetching methods

2. GHB for Correlation Prefetching • Fig. 5. • Explain: breadth first, shaded area 3. GHB for Stride Prefetching • PC / CS • Use again Fig. 5. to explain (depth 1st) 6. Global History Buffer (GHB) error handling: • error can occur: • how: • when GHB array is over-written • Pointers become obsolete, as of information re-written • Solution: • Use low-order extra bits of a pointer to reference entries • Compare: • (head pointer – ref pointer) > table size, then, it is an error

7. GHB evaluation • GHB benefits: • FIFO: • first in, first out buffer • naturally gives table space to the most recent history • Separation of IT + GHB buffer: • IT: Indexing Table • Hold working set of prefetching list • Relatively small • GHB: • Larger than IT • Sized to hold missed address stream • Benefit of this design: • Enable more sophisticated prefetching methods (show later) • GHB drawback: • Multiple access on collecting prefetching info (internal linked-list traversal)

7. GHB evaluation (cont) 3. Types of GHB prefetching: • Width prefetching: • prefetch only the immediate adjacent nodes • E.g. in Fig. 5 • Depth prefetching: • begin with current miss • Follow with a sequence of most likely node on its path • prefetch at each node • E.g. in Fig 5. • Hybrid: • Mix of the width prefetch and depth prefetch 4. New prefetching technique: Global / Delta Correlation • what: non-const step prefetching • Example: Table 1 • Pattern: {0, 1, 1, 62, 1, 1, …}, access 1st 3 elements of a 2-dimensional array • Const stride: prefetching down to incorrect addresses {1, 1, 1, 1, …}

Non-const address stream

4. New prefetching technique: (cont) Using GHB: • Sequence of the load’s missing addresses • Detecting variable stride steps • Use delta pairs (Table-1) to predict 8. Simulation and testing • Simulator + its configuration: • Config: table 4 • Simple Scalar: 3.0 • Other details: • Each access to IT: 1 cycle • Each access to GHB: 1 cycle • Degree of prefetching: 4 • Benchmark under ideal L2 cache: table 2 + table 3 • GHB’s train set • use some benchmarks to decide the optimal table size for • IT • GHB • Table size result: Table 6

4. GHB Testing: Global / Delta Correlation

5. GHB Testing: PC / Local Prefetching • GHB PC / CS, GHB PC / DC with table-based PC /CS

Conclusion Global History Buffer based prefetching: • 2-level table hierarchy: • IT: Index table • GHB: Global History Buffer • Performance improvements: • Generally: as well as or better than on 14 out of 15 tested benchmarks • Increase IPC • Reduce memory traffic • Advantage: • Reduce stale data • Increase prediction accuracy • Reduce memory traffic • Enable further predicting opportunity: variable step striding • Disadvantage: • Multiple table access on building history information • but, extra delay is relatively small and tolerable

Data Cache Prefetching using a Global History Buffer