Predictor-Directed Stream Buffers

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder

Overview • Introduction • Past Stream Buffer work • Predictor-Directed Stream Buffers • Policy Improvements • Results • Contribution Sherwood, Sair, and Calder

Introduction • Memory Wall • Latency reduction through prefetching • without eating too much bandwidth • Stream Buffers are one of the most used • simple to implement • very efficient • Pointer based codes Sherwood, Sair, and Calder

Past Stream Buffer work • Jouppi 1990 • consecutive cache line FIFO • Palacharla and Kessler 1994 • non-unit stride (based on memory chunk) • allocation filters • Farkas et. al. 1997 • PC-based stride • fully associative / non-overlapping Sherwood, Sair, and Calder

tag cache block comparator tag cache block comparator Last Address Predicted Stride Past Stream Buffer work to data cache, register file, and MSHRs store predict_stride in streaming buffer on allocation N buffers from/to next lower level of memory Sherwood, Sair, and Calder

Past Stream Buffer work • Past work targeted at streaming in arrays • either in sequential order • or stride order (multidimensional array) • Could not handle Pointer Codes • repetitive non-striding references • Need a more General Predictor Sherwood, Sair, and Calder

Predictor-Directed Stream Buffer • The Goal: Simple and efficient hardware based prefetching of complex but predictable streams • Approach: Take a general predictor and hook it up to the well established stream buffer front end. • Separate the predictor from the prefetcher • Can use almost any predictor • 2 Delta • Context • Markov Sherwood, Sair, and Calder

load info (PC, address) from write-back stage tag cache block comparator tag cache block comparator Address Predictor PSB Generalized Architecture to data cache, register file, and MSHRs Prediction Info subset of prediction info predicted address Load PC History Stride Confidence Last Address update prediction information predicted address N buffers from/to next lower level of memory Sherwood, Sair, and Calder

PSB Stages • Allocation • Prediction • Probe • Prefetching • Lookup Sherwood, Sair, and Calder

Stage Descriptions • Allocation • Stream Buffer is allocated to a particular load • the buffer is initialized • subject to Allocation Filters • Prediction • an empty buffer entry asks for an address • subject to limited predictor speed. Sherwood, Sair, and Calder

Stage Descriptions (Continued) • Probe • if there are free ports remove useless prefetches • not mandatory • Prefetching • subject to scheduling for ports and priority, prefetches are sent to memory • Lookup • when a load performs an L1 access, the Stream Buffers are checked in parallel Sherwood, Sair, and Calder

PSB Implementation • Tried many different address predictors • Best is Stride Filtered Markov • similar to Joseph and Grunwald’s Predictor • first order Markov • striding behavior is filtered out • Difference is stored to reduce size Sherwood, Sair, and Calder

Difference Storing Sherwood, Sair, and Calder

load info (PC, address) from write-back stage tag cache block comparator tag cache block comparator to data cache, register file, and MSHRs Stride Predictor store predicted stride in streaming buffer on allocation Last Address Predicted Stride predicted address markov hit? last address Markov Predictor MUX predicted markov address if hit, return predicted address predicted stride address 8 buffers from/to next lower level of memory PSB with SFM Sherwood, Sair, and Calder

Methods • SimpleScalar 3.0 • Rewrote memory hierarchy • Model bandwidth between all levels • Added perfect store sets • Ran over set of Pointer Benchmarks • 2K entry predictor table • 8 buffers x 4 entry Stream Buffers • 32k 4-way associative cache Sherwood, Sair, and Calder

Speedup from PSB Sherwood, Sair, and Calder

Allocation Filtering • Farkas et.al. showed how two miss filtering • prevents too many streams requesting resources • Does not work as well for pointer codes • irregular miss patterns • We use Priority and Accuracy Counters • track behavior of Loads • allocate to Loads that are Behaving well Sherwood, Sair, and Calder

Allocation Filtering Speedup Sherwood, Sair, and Calder

Stream Buffer Priority • Round Robin • give each active buffer equal resources • predictor and prefetching • Priority Counters • uses small counters with each buffer • use the counters to rank buffer • more resources to better performing buffers Sherwood, Sair, and Calder

Priority Scheduling Speedup Sherwood, Sair, and Calder

Latency Reduction Sherwood, Sair, and Calder

Contributions • Predictor-Directed Stream Buffers allow decoupling of Stream Buffer front end from address generation • Using accuracy based allocation filtering and priority scheduling can make a large difference in performance • With some simple compression, even small Markov tables can be very effective Sherwood, Sair, and Calder

Accuracy Sherwood, Sair, and Calder

Bus Results Sherwood, Sair, and Calder

Predictor-Directed Stream Buffers

Predictor-Directed Stream Buffers

Presentation Transcript

Buffers

Temporal Stream Branch Predictor (TS Predictor)

Buffers

Buffers

Buffers

BUFFERS

Buffers

Buffers

Riparian Buffers for Water and Stream Protection

Buffers

Requirements for Stream Buffers and Stream Impacts

Buffers

Buffers

Buffers

Buffers

Buffers

Buffers

Buffers

Erosion Buffers (stream)

Buffers

Buffers