Improving Cache Management Policies Using Dynamic Reuse Distances

Improving Cache Management Policies Using Dynamic Reuse Distances Nam Duong1, Dali Zhao1, Taesu Kim1, Rosario Cammarota1, Mateo Valero2, Alexander V. Veidenbaum1 1University of California, Irvine 2Universitat Politecnica de Catalunya and Barcelona Supercomputing Center

Cache Management • Have been a hot research topic Cache Management Single-core Shared-cache Replacement Bypass Partitioning Prefetch LRU NRU EELRU DIP RRIP … SPD … UCP PIPP TA-DIP TA-DRRIP Vantage … PDP PDP PDP

Overview • Proposed new cache replacement and partitioning algorithms with a better balance between reuse and pollution • Introduced a new concept, Protecting Distance (PD), which is shown to achieve such a balance • Developed single- and multi-core hit rate models as a function of PD, cache configuration and program behavior • Models are used to dynamically compute the best PD • Showed that PD-based cache management policies improve performance for both single- and multi-core systems

Outline • The concept of Protecting Distance • The single-core PD-based replacement and bypass policy (PDP) • The multi-core PD-based management policies • Evaluation

Definitions • The (line) reuse distance: The number of accesses to the same cache set between two accesses to the same line • This metric is directly related to hit rate • The reuse distance distribution (RDD) • A distribution of observed reuse distances • A program signature for a given cache configuration • RDDs of representative benchmarks • X-axis: the RD (<256)

Future Behavior Prediction • Cache management policies use past reference behavior to predict future accesses • Prediction accuracy is critical • Prediction in some of the prior policies • LRU: predicts that lines are reused after K unique accesses, where K < W (W: cache associativity) • Early eviction LRU (EELRU): Counts evictions in two non-LRU regions (early/late) to predict a line to evict • RRIP: Predicts if a line will be reused in a near, long, or distant future

Balancing Reuse and Cache Pollution • Key to good performance (high hit rate) • Cache lines must be reused as much as possible before eviction • AND must be evicted soon after the “last” reuse to give space to new lines • The former can be achieved by using the reuse distance and actively preventing eviction • “Protecting” a line from eviction • The latter can be achieved by evicting when not reused within this distance • There is an optimal reuse distance balancing the two • It is called a Protecting Distance (PD)

Example: 436.CactusADM • A majority of lines are reused at 64 or fewer accesses • There are multiple peaks at different reuse distances • Reuse maximized if lines are kept in the cache for 64 accesses • Lines may not be reused if evicted before that • Lines kept beyond that are likely to pollute cache • Assume that no lines are kept longer than a given RD

The Protecting Distance (PD) • A distance at which a majority of lines are covered • A single value for all sets • Predicted based on the current RDD • Questions to answer/solve • Why does using the PD achieve the balance? • How to dynamically find the PD for an application and a cache configuration? • How to build the PD-based management policies?

Outline • The concept of Protecting Distance • Single-core PD-based replacement and bypass policy (PDP) • The multi-core PD-based management policies • Evaluation

The Single-core PDP Reused line Inserted line (unused) • A cache tag contains a line’s remaining PD (RPD) • A line can be evicted when its RPD=0 • The RPD of an inserted or promoted line set to the predicted PD • RPDs of other lines in a set are decremented • Example: A 4-way cache, the predicted PD is 7 • A line is promoted on a hit • A set with RPDs before and after the hit access 1 4 6 3 0 6 5 2

The Single-core PDP (Cont.) Reused line Inserted line (unused) • Selecting a victim on a miss • A line with an RPD = 0 can be replaced • Two cases when all RPDs > 0 (no unprotected lines) • Caches without bypass (inclusive): • Unused lines are less likely to be reused than reused lines • Replace unused line with highest RPD first • No unused line: Replace a line with highest RPD • Caches with bypass (non-inclusive): Bypass the new line 0 4 6 3 6 3 5 2 1 4 6 3 0 3 5 6 1 4 6 3 0 3 6 2 1 4 6 3 0 3 5 2

Evaluation of the Static PDP • Static PDP: use the best static PD for each benchmark • PD < 256 • SPDP-NB: Static PDP with replacement only • SPDP-B: Static PDP with replacement and bypass • Performance: in general, DDRIP < SPDP-NB < SPDP-B • 436.cactusADM: a 10% additional miss reduction • Two static PDP policies have similar performance • 483.xalancbmk: 3 different execution windows have different behavior for SPDP-B

436.cactusADM:Explaining the performance difference • How the evicted lines occupy the cache? • DRRIP: • Early evicted lines: 75% of accesses, but occupy only 4% • Late evicted lines: 2% of accesses, but occupy 8% of the cache → pollution • SPDP-NB: Early and late evicted lines: 42% of accesses but occupy only 4% • SPDP-B: Late evicted lines: 1% of accesses, occupy 3% of the cache → yielding cache space to useful lines • PDP has less pollution caused by long RD lines in the cache than RRIP

Case Study: 483.xalancbmk • The best PD is different in different windows • And for different programs • Need a dynamic policy that finds best PD • Need a model to drive the search There is a close relationship between the hit rate, the PD and the RDD

A Hit Rate Model For Non-inclusive Cache • The model estimates the hit rate as a function of dp and the RDD • {Ni}, Nt: The RDD • dp: The protecting distance • de: Experimentally set to W (W: Cache associativity) E RDD Hit rate • Used to find the PD maximizing the hit rate

PDP Cache Organization Higher level Main memory Access address LLC • RD Sampler tracks access to several cache sets • In L2 miss/WB stream, can reduce sampling rate • Measures reuse distance of a new access • RD Counter Array collects # of accesses at RD=i, Nt • To reduce overhead, each counter covers a range of RDs • PD Compute Logic: finds PD that maximizes E • Computed PD used in the next interval (.5M L3 accesses) • Reasonable hardware overhead • 2 or 3 bits per tag to store the RPD PD PD Compute Logic RDD RD RD Sampler RD Counter Array

PDP vs. Existing Policies (*)Originally proposed • EELRU has the concept of late eviction point, which shares some similarities with the protecting distance • However, lines are not always guaranteed to be protected • [1] Y. Smaragdakis, S. Kaplan, and P. Wilson. EELRU: simple and effective adaptive page replacement. In SIGMETRICS’99 • [2] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA’07 • [3] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (RRIP). In ISCA’10 • [4] S. M. Khan, Y. Tian, and D. A. Jimenez. Sampling dead block prediction for last-level caches. In MICRO’10

PD-based Shared Cache Partitioning • Each thread has its own PD (thread-aware) • Counter array replicated per thread • Sampler and compute logic shared • A thread’s PD determines its cache partition • Its lines occupy cache longer if its PD is large • The cache is implicitly partitioned per needs of each thread using thread PDs • The problem is to find a set of thread PDs that together maximize the hit rate

Shared-Cache Hit Rate Model • Extending the single-core approach • Compute a vector <PD> (T= number of threads) • Exhaustive search for <PD> is not practical • A heuristicsearch algorithm finds a combination of threads’ RDD peaks that maximizes hit rate • The single-core model generates top 3 peaks per thread • The complexity is O(T2) • See the paper for more detail

Evaluation Methodology • CMP$im simulator, LLC replacement • Target cache: LLC

Evaluation Methodology (Cont.) • Benchmarks: SPEC CPU 2006 benchmarks • Excluded those which did not stress the LLC • Single-core: • Compared to EELRU, SDP, DIP, DRRIP • Multi-core • 4- and 16-core configurations, 80 workloads each • The workloads generated by randomly combining benchmarks • Compared to UCP, PIPP, TA-DRRIP • Our policy: PDP-x, where x is the number of bits per cache line

Single-core PDP • PDP-x, where x is the number of bits per cache line • Each benchmark is executed for 1B instructions • Best if can use 3 bits per line, but still better than prior work at 2 bits

Adaptation to Program Phases • 5 benchmarks which demonstrate significant phase changes • Each benchmark is run for 5B instructions • Change of PD (X-axis: 1M LLC accesses)

Adaptation to Program Phases (Cont.) • IPC improvement over DIP

PD-based Cache Partitioning for 16 cores • Normalized to TA-DRRIP

Hardware Overhead

Other Results • Exploration of PDP cache parameters • Cache bypass fraction • Prefetch-aware PDP • PD-based cache management policy for 4-core

Conclusions • Proposed the concept of Protecting Distance (PD) • Showed that it can be used to better balance reuse and cache pollution • Developed a hit rate model as a function of the PD, program behavior, and cache configuration • Proposed PD-based management policies for both single- and multi-core systems • PD-based policies outperform existing policies

Thank You!

Backup Slides • RDD, E and hit rate of all benchmarks

RDDs, Modeled and Real Hit Rates of SPEC CPU 2006 Benchmarks

RDDs, Modeled and Real Hit Rates of SPEC CPU 2006 Benchmarks (Cont.)

Improving Cache Management Policies Using Dynamic Reuse Distances