360 likes | 542 Views
Reuse-based Online Models for Caches. Rathijit SeN David A. Wood. The Problem. Core. Core. LLC. LLC. Core. Core. LLC. LLC. LLC. LLC. Core. Core. Miss. Fetch. Core. Core. LLC. LLC. DRAM. Caches: power vs performance Reconfigurable caches e.g ., IvyBridge
E N D
Reuse-based Online Models for Caches Rathijit SeN David A. Wood ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
The Problem Core Core LLC LLC Core Core LLC LLC LLC LLC Core Core Miss Fetch Core Core LLC LLC DRAM • Caches: power vs performance • Reconfigurable caches • e.g., IvyBridge • The Problem: Which configuration to select? e.g., to get the best energy-efficiency? ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Cache Performance Prediction ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • We propose a framework h = (r · B) · φ • h: hit ratio • r: reuse-distance distribution (novel hardware support) • B: stochastic Binomial matrix • φ: hit function (LRU, PLRU, RANDOM, NMRU) • Case study: Energy-Delay Product (EDP) within 7% of minimum
Agenda ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • The Problem • Framework • Locality (r) • Matrix transformations (B) • Hit functions (φ) • h = (r · B) · φ • Hardware support • Case Study
Cache Overview Address N Tag Match? Miss Y Hit Associativity (A) Sets (S) ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • Limited storage • Sets of (usually 64-byte) blocks • #blocks/set = associativity (#ways) • Set Index + Address tags identify data
Workload Variation swim mgrid apache zeus oltp jbb equake, gafort, wupwise fma3d ammp, blackscholes,bodytrack, fluidanimate, freqmine, swaptions Last-Level Cache (LLC) ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Bad configurations hurt! Maximum Minimum 218% worse 27% worse ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA EDP (energy-delay product)
Problem Summary Associativity (A) Sets (S) ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA Reconfigurable caches Multiple replacement policies Goal: Online miss-ratio prediction
Indexing Assumption ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • Mapping of unique addresses to cache sets • Assumption: independent, uniform [Smith, 1978] • Unique accesses as Bernoulli trials • (Partial) Hashing • POWER4, POWER5, POWER6, Xeon • Simple XOR-based function [similar to Cypher, 2008]
Agenda ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • The Problem • Framework • Locality (r) • Matrix transformations (B) • Hit functions (φ) • h = (r · B) · φ • Hardware support • Case Study
Temporal Locality Metrics Size? i • ■ ■■ ■ … ■ ■ r P(URD=i) ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • Unique Reuse Distance (URD) • #unique intervening addresses • x y z z y x : URD(x)=2 • Stack Distance [Mattson, 1970] – 1 • Large cache large distances to track • Absolute Reuse Distance (ARD) • #intervening addresses • x y z z y x : ARD(x)=4
Per-set Locality, r(S) #sets: S > S #sets: S x x i • ■ ■■ ■ … ■ ■ r x x P(URD=i) ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • r(S) is “compressed” as S (#sets) increases • Less of the tail is important
Agenda ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • The Problem • Framework • Locality (r) • Matrix transformations (B) • Hit functions (φ) • h = (r · B) · φ • Hardware support • Case Study
Estimating per-set locality B 0 1 0 0 0 0 0 0 i 0 0 0 0 0 0 • ■ ■■ ■ ■ ■ ■ ■ r 0 i 0 0 0 0 0 0 0 0 0 0 0 P(k successes in i trials) i.e., P(k of i to the same set) P(URD=i) 0 0 0 k ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA Generalized stochastic Binomial matrices [Strum, 1977] r(S) = r(1) ·B(1 – 1/S, 1/S) Composition: r(S) = r(S) ·B(1 – S/S, S/S)
Computation reuse & speedup Poisson Approximation i r(214) r(214) • ■ ■■ ■ … ■ ■ r r(213) r(213) Size? P(URD=i) r(212) r(212) r(1) r(1) r(210) r(211) r(211) Now: compute Later: hardware support r(210) “Shorter” tail smaller matrices ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Size of r(210)? i • ■ ■■ ■ … ■ ■ r P(URD=i) ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA Prediction with r(210) limited to URD < n
Agenda ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • The Problem • Framework • Locality (r) • Matrix transformations (B) • Hit functions (φ) • h = (r · B) · φ • Hardware support • Case Study
Hit Function, φ x x Not x φ0 = 1 φk ≤ φk-1 φ= 0 ∞ ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • φk: P(x will hit|URD(x)=k) • Monotonically decreasing model • Intuition: larger URD same or larger eviction probability
Hit Function, φ ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA Example: A=8
Formulating φ ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • φ(LRU): step-function • (r · B) · φ(LRU) [Smith, 1978], [Hill & Smith, 1989] • φ(PLRU): • Assumes on average, traffic evenly divided between subtrees • φ(RANDOM): • Estimates #intervening misses using ARD • φ(NMRU): similar to φ(RANDOM) except φ1=1
Agenda ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • The Problem • Framework • Locality (r) • Matrix transformations (B) • Hit functions (φ) • h = (r · B) · φ • Hardware support • Case Study
Prediction Accuracy ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA LRU, PLRU(A=2), NMRU(A=2): exact per-set model Others: approximate per-set model
Overheads ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • r = r · B : 6 80 μsec • Binomial Poisson approximation for each row of B • h = (r · B) · φ : 20 30 μsec • Average over 24 configurations • B applied 8 times
Agenda ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • The Problem • Framework • Locality (r) • Matrix transformations (B) • Hit functions (φ) • h = (r · B) · φ • Hardware support • Case Study
Computation reuse & speedup Poisson Approximation i r(214) r(214) • ■ ■■ ■ … ■ ■ r r(213) r(213) Size=512 P(URD=i) r(212) r(212) r(1) r(1) r(210) r(211) r(211) Now: compute Later: hardware support Now r(210) “Shorter” tail smaller matrices ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Insights i • ■ ■■ ■ … ■ ■ r P(URD=i) ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA x y z z y x : URD(x)=2 • Unique “remember” addresses • Only cardinality, not full addresses Bloom filter for compact (approximate) representation • r(210) is seen by any set of a cache with S=210 • Filter address stream
read access filtered access reset Set Filter Control Logic 9-bit Counter load hit read inc 1024-bit Bloom Filter 2 hash fns Reference address register insert Hardware Support for estimating r(210) Start Sample 512-entry Histogram array Y Addr match? inc N Unique? Y (not hit) Remember End Sample ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Agenda + way counters ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • The Problem • Framework • Locality (r) • Matrix transformations (B) • Hit functions (φ) • h = (r · B) · φ • Hardware support • Case Study
LRU Way Counters [Suh, et al. 2002] ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • One counter per logical way (stack position) • Determining logical position is hard • not totally (re-)ordered with every access • heuristics, e.g., for PLRU [Kedzierski, et al. 2010] • Other Limitations • Inclusion property • Fixed #sets • S = S : special case of reuse framework • S S ? Use B • provided, enough tail of r(S) is available
Min. EDP configuration ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA EDP within 7% of minimum Reuse models outperform PLRU way counters in most cases
Summary ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA • The Problem: Online miss-rate estimation for reconfigurable caches • We propose a framework h = (r · B) · φ • h: hit-ratio • r: reuse-distance distribution (novel hardware support) • B: stochastic Binomial matrix • φ: hit function (LRU, PLRU, RANDOM, NMRU) • Case study: EDP within 7% of minimum • Future work: More policies, applications/case studies
Also in the paper ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA r: lossy summarization of the address trace Estimation for ARD Optimizations for LRU Conditions for PLRU eviction More details on models & evaluation
Reuse-based Online Models for Caches ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA Questions?
Example LLC performance ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA OLTP (TPC-C + IBM DB2)
Estimating cache performance i i i • ■ ■■ ■ … ■ ■ • … φ r P(URD=i) P(hit|URD=i) ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA Hit ratio = hits/access ∑ P(URD=i) · P(hit|URD=i) = · Miss ratio = misses/access = 1 – hit ratio Miss rate = misses/instruction = miss ratio x access/instruction
URD vs ARD {z0}* {z0,z1}* {z0,z1,z2}* {z0,z1,z2,...,zk-1}* x x z0 z1 z2 z3 zk-1 dk ∞ dk= dk-1 +1/ri Approximation: k ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA