240 likes | 377 Views
Predictor-Directed Stream Buffers. Timothy Sherwood Suleyman Sair Brad Calder. Overview. Introduction Past Stream Buffer work Predictor-Directed Stream Buffers Policy Improvements Results Contribution. Introduction. Memory Wall Latency reduction through prefetching
E N D
Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder
Overview • Introduction • Past Stream Buffer work • Predictor-Directed Stream Buffers • Policy Improvements • Results • Contribution Sherwood, Sair, and Calder
Introduction • Memory Wall • Latency reduction through prefetching • without eating too much bandwidth • Stream Buffers are one of the most used • simple to implement • very efficient • Pointer based codes Sherwood, Sair, and Calder
Past Stream Buffer work • Jouppi 1990 • consecutive cache line FIFO • Palacharla and Kessler 1994 • non-unit stride (based on memory chunk) • allocation filters • Farkas et. al. 1997 • PC-based stride • fully associative / non-overlapping Sherwood, Sair, and Calder
tag cache block comparator tag cache block comparator Last Address Predicted Stride Past Stream Buffer work to data cache, register file, and MSHRs store predict_stride in streaming buffer on allocation N buffers from/to next lower level of memory Sherwood, Sair, and Calder
Past Stream Buffer work • Past work targeted at streaming in arrays • either in sequential order • or stride order (multidimensional array) • Could not handle Pointer Codes • repetitive non-striding references • Need a more General Predictor Sherwood, Sair, and Calder
Predictor-Directed Stream Buffer • The Goal: Simple and efficient hardware based prefetching of complex but predictable streams • Approach: Take a general predictor and hook it up to the well established stream buffer front end. • Separate the predictor from the prefetcher • Can use almost any predictor • 2 Delta • Context • Markov Sherwood, Sair, and Calder
load info (PC, address) from write-back stage tag cache block comparator tag cache block comparator Address Predictor PSB Generalized Architecture to data cache, register file, and MSHRs Prediction Info subset of prediction info predicted address Load PC History Stride Confidence Last Address update prediction information predicted address N buffers from/to next lower level of memory Sherwood, Sair, and Calder
PSB Stages • Allocation • Prediction • Probe • Prefetching • Lookup Sherwood, Sair, and Calder
Stage Descriptions • Allocation • Stream Buffer is allocated to a particular load • the buffer is initialized • subject to Allocation Filters • Prediction • an empty buffer entry asks for an address • subject to limited predictor speed. Sherwood, Sair, and Calder
Stage Descriptions (Continued) • Probe • if there are free ports remove useless prefetches • not mandatory • Prefetching • subject to scheduling for ports and priority, prefetches are sent to memory • Lookup • when a load performs an L1 access, the Stream Buffers are checked in parallel Sherwood, Sair, and Calder
PSB Implementation • Tried many different address predictors • Best is Stride Filtered Markov • similar to Joseph and Grunwald’s Predictor • first order Markov • striding behavior is filtered out • Difference is stored to reduce size Sherwood, Sair, and Calder
Difference Storing Sherwood, Sair, and Calder
load info (PC, address) from write-back stage tag cache block comparator tag cache block comparator to data cache, register file, and MSHRs Stride Predictor store predicted stride in streaming buffer on allocation Last Address Predicted Stride predicted address markov hit? last address Markov Predictor MUX predicted markov address if hit, return predicted address predicted stride address 8 buffers from/to next lower level of memory PSB with SFM Sherwood, Sair, and Calder
Methods • SimpleScalar 3.0 • Rewrote memory hierarchy • Model bandwidth between all levels • Added perfect store sets • Ran over set of Pointer Benchmarks • 2K entry predictor table • 8 buffers x 4 entry Stream Buffers • 32k 4-way associative cache Sherwood, Sair, and Calder
Speedup from PSB Sherwood, Sair, and Calder
Allocation Filtering • Farkas et.al. showed how two miss filtering • prevents too many streams requesting resources • Does not work as well for pointer codes • irregular miss patterns • We use Priority and Accuracy Counters • track behavior of Loads • allocate to Loads that are Behaving well Sherwood, Sair, and Calder
Allocation Filtering Speedup Sherwood, Sair, and Calder
Stream Buffer Priority • Round Robin • give each active buffer equal resources • predictor and prefetching • Priority Counters • uses small counters with each buffer • use the counters to rank buffer • more resources to better performing buffers Sherwood, Sair, and Calder
Priority Scheduling Speedup Sherwood, Sair, and Calder
Latency Reduction Sherwood, Sair, and Calder
Contributions • Predictor-Directed Stream Buffers allow decoupling of Stream Buffer front end from address generation • Using accuracy based allocation filtering and priority scheduling can make a large difference in performance • With some simple compression, even small Markov tables can be very effective Sherwood, Sair, and Calder
Accuracy Sherwood, Sair, and Calder
Bus Results Sherwood, Sair, and Calder