Prefetching Using a Global History Buffer

Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith

Outline • Motivation • Related Work • Global History Buffer Prefetching • Results • Conclusion

Motivation • D-Cache misses to main memory are of increasing importance • Main memory is getting farther away (in clock cycles) • Many demanding, memory intensive workloads • Computation is inexpensive compared to data accesses • Good opportunity to reevaluate prefetching data structures • Simple computation can supplement table information • We consider prefetches from main memory to lowest level cache (L2 cache in this study)

Markov Prefetching • Markov prefetching forms address correlations • Joseph and Grunwald (ISCA ‘97) • Uses global memory addresses as states in the Markov graph • Correlation Table approximates Markov graph Miss Address Stream A B C A B C B C . . . Markov Graph Correlation Table 1 1st predict. 2nd predict. miss address A B B A C B .5 1 A B C .5 C

Correlation Prefetching • Distance Prefetching forms delta correlations • Kandiraju and Sivasubramaniam (ISCA ‘02) • Delta-based prefetching leads to much smaller table than “classical” Markov Prefetching • Delta-based prefetching can remove compulsory misses Markov Prefetching Distance Prefetching Miss Address Stream Global Delta Stream 1 1 -2 1 1 -1 1 27 28 29 27 28 29 28 29 1st predict. 2nd predict. 1st predict. 2nd predict. miss address global delta 27 1 28 -2 28 1 29 -1 29 1 -1 -2 28 29

Global History Buffer (GHB) • Holds miss address history in FIFO order • Linked lists within GHB connect related addresses • Same static load • Same global miss address • Same global delta Global History Buffer Index Table Load PC • Linked list walk is short compared with L2 miss latency FO FI miss addresses

GHB - Example Miss Address Stream 29 27 28 29 27 28 29 28 Index Table Global History Buffer pointer miss address pointer 27 27 Global Miss Address 28 28 29 29 27 28 head pointer 29 28 29 Key => Current => Prefetches

1 4 8 1 4 8 8 4 GHB – Deltas Miss Address Stream 71 27 28 36 44 45 49 53 54 62 70 Global Delta Stream 1 1 8 8 1 4 4 1 8 8 Markov Graph Hybrid Width Depth .3 .7 .7 .7 .3 .3 Prefetches 71 + 8 => 79 Key Prefetches Prefetches 71 + 4 => 75 => Current 71 + 8 => 79 71 + 4 => 75 => Prefetches 79 + 8 => 87 79 + 4 => 79

GHB – Hybrid Delta • Width prefetching suffers from poor accuracy and short look-ahead • Depth prefetching has good look-ahead, but may miss prefetch opportunities when a number of “next” addresses have similar probability • The hybrid method combines depth and width

1 8 8 8 4 8 4 GHB - Hybrid Example Miss Address Stream 71 27 28 36 44 45 49 53 54 62 70 Global Delta Stream 1 1 8 8 1 4 4 1 8 8 Index Table Global History Buffer Global Delta pointer miss address pointer 1 27 28 4 36 8 Prefetches 44 71 + 8 => 79 71 + 4 => 75 45 49 79 + 8 => 87 79 + 4 => 79 head pointer 53 54 62 70 Key 71 => Current => Prefetches

Simulation Methodology • Simulated SPEC CPU2000 benchmarks • Fast forwarded 1 billion instructions and simulated 1 billion instructions • Used peak binaries compiled -O4 optimization • Results include all benchmarks that have at least a 5% IPC improvement with an ideal L2 cache

Simulation Methodology • Table walk - one cycle per access • IT size reduces table conflicts • GHB size reflects prefetch history working set • In general, the GHB prefetching requires less history

Results • Our results compare: • IPC Improvement (harmonic mean) vs. Prefetch Degree • Increase in Memory Traffic per instruction (arithmetic mean) vs. Prefetch Degree • Prefetch Accuracy – The percent of prefetches that are used by the program

Distance Prefetching (Performance) 35% Table (width) GHB (width) GHB (depth) GHB (hybrid) 25% IPC Improvement 15% 5% 1 2 4 8 16 Prefetch Degree

Distance Prefetching (Performance) 110% Table (width) (~300%) GHB (width) 90% GHB (depth) GHB (hybrid) 70% 50% IPC Improvement 30% 10% -10% art vpr gap mcf apsi twolf applu swim bzip2 lucas mgrid galgel parser ammp hmean wupwise

Distance Prefetching (Memory Traffic) 180% Table (width) GHB (width) GHB (depth) 150% GHB (hybrid) 120% 90% Increase in Memory Traffic 60% 30% 0% 1 2 4 8 16 Prefetch Degree

Conclusions • More complete picture of history • Allows width, depth, and hybrid • Also can improve other prefetching methods (covered in depth in the paper) • Eliminates stale history in a natural way • FIFO discards old history to make room for new history • In a conventional table, old history can remain for a very long time and trigger inaccurate prefetches

Acknowledgements • This research was funded by: • An Intel Undergraduate Research scholarship. • A University of Wisconsin Hilldale Undergraduate Research fellowship. • The National Science Foundation under grants CCR-0311361 and EIA-0071924.

Backup Slides

Prefetching Metrics • Accuracy is the percent of prefetches that are actually used. • Coverage is the percent of memory references prefetched rather than demand fetched. • Timeliness indicates if prefetched data arrives early enough to prevent the processor from stalling.

1 4 8 1 1 4 4 8 1 8 GHB – Deltas Miss Address Stream 71 27 28 36 44 45 49 53 54 62 70 Global Delta Stream 1 1 8 8 1 4 4 1 8 8 Markov Graph .3 .7 .7 .7 .3 .3 Key => Current => Prefetches

Prefetch Taxonomy • To simplify the discussion and illustrate the relation between prefetching methods we introduce a consistent naming convention. • Each name is a X/Y pair. • X is the key used for localizing the address stream. • Y is the method for detecting address patterns.

Prefetch Taxonomy • We study two localizing methods • No localization or global (G) • Program Counter (PC) • And three pattern detection methods • Address Correlation • Delta Correlation • Constant Stride

Prefetch Taxonomy • Markov Prefetching - G/AC • Distance Prefetching - G/DC • Stride Prefetching - PC/CS

Stride Prefetching • Table tracks the local history of loads. • If a constant stride is detected in a load’s local history, then n + s, n + 2s, …, n + ds are prefetched. • n is the current target address • s is the detected stride • d is the prefetch degree or aggressiveness of the prefetching.

Stride Prefetching Reference Prediction Table Tag Last Address Stride State PC of Load Target Address sub add Prefetch Address

GHB – Stride Prefetching • GHB-Stride uses the PC to access the index table. • The linked lists contain the local history of each load. • Compare the last two local strides. If the same then prefetch n + s, n + 2s, …, n + ds. Index Table Global History Buffer pointer miss address pointer A PC B 1 C A =? B head pointer C 1 B C

GHB – Local Delta Correlation • Form delta correlations within each load’s local history. • For example, consider the local miss address stream:

Prefetching Using a Global History Buffer

Prefetching Using a Global History Buffer

Presentation Transcript

What is a buffer?

Global History

Prefetching

CS7810 Prefetching

Stencil Routed A-Buffer

A Buffer Overflow Example

Prefetching for Mobile Computers Using Shape Graphs

Fetch Directed Prefetching - a Study

Provably Good Global Buffering Using an Available Buffer Block Plan

Software Prefetching

The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms

Data Cache Prefetching using a Global History Buffer

global history

Web Prefetching

A Prefetching Memory System for Mediaprocessors

Provably Good Global Buffering Using an Available Buffer Block Plan

Prefetching Techniques

buffer

Buffer

buffer

Fetch Directed Prefetching - a Study

A Taxonomy of Data Prefetching Mechanisms