Efficient Memory Prefetching with Perceptron Filtering System

Enhancing Signature Path Prefetching with Perceptron Prefetch Filtering Eshan Bhatia1, Gino Chacon1, Elvira Teran2, Paul V. Gratz1, Daniel A. Jiménez3 1Texas A&M University 2Texas A&M International University 3Texas A&M University / Barcelona Supercomputing Center

Introduction Design Space: Standalone L1D, L2C and LLC Prefetchers Distribution of hardware budget across three prefetchers Interaction among the prefetchers Control over placing the incoming prefetch line (L1D vs L2C vs LLC)

Key Ideas • Aggressive L2C Prefetching • Signature Path Prefetcher (SPP)[Kim, MICRO ‘16] • Perceptron-based Prefetch Filtering (PPF)[Bhatia, ISCA ‘19] • Optimizing Prefetch Queue Sharing • Page based resource sharing • Minimal LLC Prefetching • Lack of information • LLC is a shared resource among cores • Coordination between levels • Minimizing impact of noisy prefetches on lower level prefetchers

Page Based Resource Sharing • Prefetch Queue (PQ) limited in number • Valuable resource for L1D / L2C • Aggressive (but still accurate) prefetching • Takes the current page deep into the speculation path • Blocks PQ resources for other pages • Timing disparity between multiple pages with interleaved accesses • Efficient Resource Utilization • Track number of distinct pages in last few memory accesses • Divide PQ resource over those pages

L1D Prefetcher: Next-N-Lines • Fetches N consecutive lines wrt current demand address • N determined through PQ resource availability • Page level throttling • Tracks per page access pattern for the last two accesses • Scores page as +1 delta friendly or averse • Throttles prefetching for averse pages

L2C Underlying Prefetcher: SPP • Lookahead Prefetcher • Uses previous prefetch suggestion to trigger new speculation • Recursively iterate and keep compounding the confidence • Stop when the confidence falls below a certain threshold • Threshold (hyperparameter) is an indication of aggressiveness • Less threshold -> more aggressive -> more coverage -> less accuracy • Pre-defined trade-off between coverage and accuracy

Enhanced SPP • Decoupled coverage and accuracy concerns • SPP enhanced to its most aggressive extreme • Helps capture complex memory access patterns • Increases coverage • Perceptron Filtering (PPF) takes care of accuracy

Hashed Perceptron Model • Use feature values to index into distinct tables • Example: PC, memory address etc • Prediction: Lookup, summation, threshold • Use xi value to index into table of corresponding Wi • Learning occurs when ground truth known • Positive Outcome: Increment each feature’s partial prediction weight • Negative Outcome: Decrement each feature’s partial prediction weight • No multiplication, no division, no complex back-propagation

PPF Architecture • Baseline prefetcher: SPP • Modified for high coverage • Perceptron Weights Tables • Tables of 5-bit up-down saturating counters • 1 table per feature • Variable depth, independent indexing • Prefetch and Reject Tables • Record prefetches for future training

PPF Design • Subsequent feedback of a prior prefetch • Same perceptron weights re-indexed and updated by +1 / -1 • Prefetch suggestions tested using PPF • Outcome and indexing metadata recorded in Prefetch / Reject Table

Putting Pieces Together Single Core Configuration L1D: Enhanced Next-N-line L2C: PPF with SPP • Triggered on all accesses to L2C • Can place prefetches in L2C or LLC LLC: Next Line prefetcher • Triggered on demand accesses and only last prefetch from L1D reaching LLC • Uses the metadata communication path between the prefetchers Overhead: 49.94 KBs Multi Core Configuration L1D: No Prefetching L2C: PPF with SPP • Triggered on all accesses to L2C • Can place prefetches in L2C or LLC LLC: SPP (without PPF) • Separate tables for each core • Modified to be less aggressive than the original SPP (LLC is a shared resource) Overhead: 62.83 KBs

Results Improvement reported over no prefetching Single Core: 40.4% Multi Core: 20.3%

Future Works • Better baseline prefetchers for PPF • Interaction between the prefetchers • Metadata communication path between the levels

Thank you!

Backup Slides

L2C Underlying Prefetcher: SPP

Efficient Memory Prefetching with Perceptron Filtering System