Reuse-based Online Models for Caches

Reuse-based Online Models for Caches Rathijit SeN David A. Wood ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA

The Problem Core Core LLC LLC Core Core LLC LLC LLC LLC Core Core Miss Fetch Core Core LLC LLC DRAM • Caches: power vs performance • Reconfigurable caches • e.g., IvyBridge • The Problem: Which configuration to select? e.g., to get the best energy-efficiency? ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA

Cache Performance Prediction ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • We propose a framework h = (r · B) · φ • h: hit ratio • r: reuse-distance distribution (novel hardware support) • B: stochastic Binomial matrix • φ: hit function (LRU, PLRU, RANDOM, NMRU) • Case study: Energy-Delay Product (EDP) within 7% of minimum

Agenda ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • The Problem • Framework • Locality (r) • Matrix transformations (B) • Hit functions (φ) • h = (r · B) · φ • Hardware support • Case Study

Cache Overview Address N Tag Match? Miss Y Hit Associativity (A) Sets (S) ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • Limited storage • Sets of (usually 64-byte) blocks • #blocks/set = associativity (#ways) • Set Index + Address tags identify data

Workload Variation swim mgrid apache zeus oltp jbb equake, gafort, wupwise fma3d ammp, blackscholes,bodytrack, fluidanimate, freqmine, swaptions Last-Level Cache (LLC) ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA

Bad configurations hurt! Maximum Minimum 218% worse 27% worse ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA EDP (energy-delay product)

Problem Summary Associativity (A) Sets (S) ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA Reconfigurable caches Multiple replacement policies Goal: Online miss-ratio prediction

Indexing Assumption ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • Mapping of unique addresses to cache sets • Assumption: independent, uniform [Smith, 1978] • Unique accesses as Bernoulli trials • (Partial) Hashing • POWER4, POWER5, POWER6, Xeon • Simple XOR-based function [similar to Cypher, 2008]

Agenda  ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • The Problem • Framework • Locality (r) • Matrix transformations (B) • Hit functions (φ) • h = (r · B) · φ • Hardware support • Case Study

Temporal Locality Metrics Size? i • ■ ■■ ■ … ■ ■ r P(URD=i) ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • Unique Reuse Distance (URD) • #unique intervening addresses • x y z z y x : URD(x)=2 • Stack Distance [Mattson, 1970] – 1 • Large cache  large distances to track • Absolute Reuse Distance (ARD) • #intervening addresses • x y z z y x : ARD(x)=4

Per-set Locality, r(S) #sets: S > S #sets: S  x x i • ■ ■■ ■ … ■ ■ r         x x P(URD=i)      ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • r(S) is “compressed” as S (#sets) increases • Less of the tail is important

Agenda   ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • The Problem • Framework • Locality (r) • Matrix transformations (B) • Hit functions (φ) • h = (r · B) · φ • Hardware support • Case Study

Estimating per-set locality B 0 1 0 0 0 0 0 0 i 0   0 0 0 0 0 • ■ ■■ ■ ■ ■ ■ ■ r    0 i 0 0 0 0     0 0 0 0      0 0 0 P(k successes in i trials) i.e., P(k of i to the same set) P(URD=i)       0 0        0         k ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA Generalized stochastic Binomial matrices [Strum, 1977] r(S) = r(1) ·B(1 – 1/S, 1/S) Composition: r(S) = r(S) ·B(1 – S/S, S/S)

Computation reuse & speedup Poisson Approximation i r(214) r(214) • ■ ■■ ■ … ■ ■ r r(213) r(213) Size? P(URD=i) r(212) r(212) r(1) r(1) r(210) r(211) r(211) Now: compute Later: hardware support  r(210) “Shorter” tail  smaller matrices ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA

Size of r(210)? i • ■ ■■ ■ … ■ ■ r P(URD=i) ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA Prediction with r(210) limited to URD < n

Agenda    ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • The Problem • Framework • Locality (r) • Matrix transformations (B) • Hit functions (φ) • h = (r · B) · φ • Hardware support • Case Study

Hit Function, φ        x x Not x φ0 = 1 φk ≤ φk-1 φ= 0 ∞ ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • φk: P(x will hit|URD(x)=k) • Monotonically decreasing model • Intuition: larger URD  same or larger eviction probability

Hit Function, φ ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA Example: A=8

Formulating φ ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • φ(LRU): step-function • (r · B) · φ(LRU)  [Smith, 1978], [Hill & Smith, 1989] • φ(PLRU): • Assumes on average, traffic evenly divided between subtrees • φ(RANDOM): • Estimates #intervening misses using ARD • φ(NMRU): similar to φ(RANDOM) except φ1=1

Agenda     ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • The Problem • Framework • Locality (r) • Matrix transformations (B) • Hit functions (φ) • h = (r · B) · φ • Hardware support • Case Study

Prediction Accuracy ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA LRU, PLRU(A=2), NMRU(A=2): exact per-set model Others: approximate per-set model

Overheads ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • r = r · B : 6  80 μsec • Binomial  Poisson approximation for each row of B • h = (r · B) · φ : 20  30 μsec • Average over 24 configurations • B applied 8 times

Agenda      ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • The Problem • Framework • Locality (r) • Matrix transformations (B) • Hit functions (φ) • h = (r · B) · φ • Hardware support • Case Study

Computation reuse & speedup Poisson Approximation i r(214) r(214) • ■ ■■ ■ … ■ ■ r r(213) r(213) Size=512 P(URD=i) r(212) r(212) r(1) r(1) r(210) r(211) r(211) Now: compute Later: hardware support  Now r(210) “Shorter” tail  smaller matrices ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA

Insights i • ■ ■■ ■ … ■ ■ r P(URD=i) ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA x y z z y x : URD(x)=2 • Unique “remember” addresses • Only cardinality, not full addresses Bloom filter for compact (approximate) representation • r(210) is seen by any set of a cache with S=210 • Filter address stream

read access filtered access reset Set Filter Control Logic 9-bit Counter load hit read inc 1024-bit Bloom Filter 2 hash fns Reference address register insert Hardware Support for estimating r(210) Start Sample 512-entry Histogram array Y Addr match? inc N Unique? Y (not hit) Remember End Sample ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA

Agenda      + way counters ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • The Problem • Framework • Locality (r) • Matrix transformations (B) • Hit functions (φ) • h = (r · B) · φ • Hardware support • Case Study

LRU Way Counters [Suh, et al. 2002] ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • One counter per logical way (stack position) • Determining logical position is hard • not totally (re-)ordered with every access • heuristics, e.g., for PLRU [Kedzierski, et al. 2010] • Other Limitations • Inclusion property • Fixed #sets • S = S : special case of reuse framework • S  S ? Use B • provided, enough tail of r(S) is available

Min. EDP configuration ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA EDP within 7% of minimum Reuse models outperform PLRU way counters in most cases

Summary ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • The Problem: Online miss-rate estimation for reconfigurable caches • We propose a framework h = (r · B) · φ • h: hit-ratio • r: reuse-distance distribution (novel hardware support) • B: stochastic Binomial matrix • φ: hit function (LRU, PLRU, RANDOM, NMRU) • Case study: EDP within 7% of minimum • Future work: More policies, applications/case studies

Also in the paper ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA r: lossy summarization of the address trace Estimation for ARD Optimizations for LRU Conditions for PLRU eviction More details on models & evaluation

Reuse-based Online Models for Caches ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA Questions?

Example LLC performance ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA OLTP (TPC-C + IBM DB2)

Estimating cache performance i i i • ■ ■■ ■ … ■ ■ •  …  φ r P(URD=i) P(hit|URD=i) ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA Hit ratio = hits/access ∑ P(URD=i) · P(hit|URD=i) = · Miss ratio = misses/access = 1 – hit ratio Miss rate = misses/instruction = miss ratio x access/instruction

URD vs ARD {z0}* {z0,z1}* {z0,z1,z2}* {z0,z1,z2,...,zk-1}* x x z0 z1 z2 z3 zk-1 dk ∞ dk= dk-1 +1/ri Approximation: k ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA

Reuse-based Online Models for Caches