SLO-aware Hybrid Store

SLO-aware Hybrid Store Priya Sehgal, Kaladhar Voruganti, Rajesh Sundaram 8th March 2013 MSST 2012: http://storageconference.org/2012/Papers/21.Short.2.SLOAware.pdf University Day 2013

Introduction • What is SLO? • Service Level Objective • Specification of application requirements • Technology-independent • Examples • Performance : Average I/O latency, throughput • Capacity • Reliability • Security, etc. University Day 2013

Problem and Motivation • Assumptions: • SLO = Latency (ms) • W1 and W2 working on different data sets • SSD tier used for read caching only • Problems: • SLO inversion • SLO violation • Sub-optimal SSD Utilization W1 on vol1, SLO=5ms W2 on vol2, SLO=10ms HyS1 SSD Array (Read Cache) HDD Array (Permanent Store) Hybrid Store Need: Bring SLO-awareness to SSD caching (read) in HyS University Day 2013

Solution – Trailer Per-workload cache partitioning and dynamic cache sizing University Day 2013

Architecture W1 Controller Communicate w/ upper layer Master Controller SSD size increase/decrease 4 W2 Controller 2 5 3 Wn Controller 6 6 Store Hybrid Store 1 SSD size SSD size Monitoring daemon Eviction Engine1 Eviction Engine2 SLO Stats Observed latency Expected Latency HyS SSD stats SSD hit% SSD Utilization Eviction Engine n University Day 2013

Controller Design Space Dimensions Un-partitioned cache vs. Partitioned Static vs. Dynamic Partition Cache size decrement decision: max threshold vs. min-max range University Day 2013

When to decrease the cache size? • Decrease partition size as soon as SLO is met • Oscillations due to unnecessary evictions • Decrease when observed latency goes below a minimum threshold • Range of operation (min and max threshold) • Helps stabilize cache size and avoid unnecessary evictions • Range varies depending upon the SLO and cache size • Percentage conformance University Day 2013

Decrease Partition Size Problem Decrease in partition size leading to SLO violations University Day 2013

Insights on range of operation University Day 2013

a. SLO Target = 99 percentile b. SLO Target = 90 percentile c. SLO Target = 80 percentile d. SLO Target = 75 percentile University Day 2013

Error-aware Feedback Controller (EAFC) Kpinc , Kpdec Controller (Kp * e) Kp : 0 or Kpinc or Kpdec Cwindow szold * (1+ kp * e) Eviction Engine szold * (1 - kp * e) SLO range Error (e) ∑ SSD utilization %, SSD hit % 75th percentile Observed Latency SLO violation Within SLO range N N e = lmin - lobs Y Y e = lmax- lobs e = 0 University Day 2013

Test Description • Hybrid Store prototype • 1 workload per volume • 1 workload: All I/Os coming to a volume • HDD Space: 1TB • SSD Space: 160GB • RAM Size: 16 GB • Test1 – SPECsfs 2008-like: • No. of threads = 20 • Load/thread = 250 IOPS  5000 IOPS • Total dataset size = 600 GB • Target Latency = 8ms, 3ms University Day 2013

SPECsfs 2008 (SLO target = 8ms) Amount of SSD read cache required for meeting 8ms latency = 15 GB University Day 2013

SPECsfs 2008 (SLO target = 3ms) Amount of SSD read cache required for meeting 3ms latency = 35 GB (2.3x more than 8ms target latency) University Day 2013

SPECsfs 2008 Tests • Filer Setup same as before • Test1: • No. of threads = 20 • Load/thread = 250 IOPS  5000 IOPS • Total dataset size = 600 GB • Target Latency = 8ms, 3ms • Test2: • No. of threads = 20 • Load/thread = varying from 100, 200, 250  2000, 4000 and 5000 IOPS (each run for 5 hrs) • Target latency = 3ms University Day 2013

SPECsfs 2008 Test 2 (Varying loads) University Day 2013

Conclusion • Insights • It is not necessary to cache whole WSS to meet certain latency targets • 75th percentile SLO conformance yields close to optimal SSD size meeting average SLO almost all the time • A cache sizer needs only few 100 history points to set the appropriate SSD size  light-weight • Objectives Met • SLO met close to 100% • With close to optimal amount of SSD • Improving SSD utilization • Without much computation and memory overheads University Day 2013

University Day 2013

Backup Slides University Day 2013

References [1] J.Chang and G.S. Sohi, “Cooperative cache partitioning for chip multiprocessors”, In Proc. ICS’07, 2007. [2] S.Kim, D. Chandra, and Y.Solihin, ”Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture”, In Proc. PACT’04, 2004. [3] R.Lyer, “QoS: A framework for enabling qos in shared caches cmp platforms”, In Proc.ICS’04, pp.257-266, 2004. [4] H. S. Stone, J. Turek, and J.L. Wolf, “Optimal partitioning of cache memory”, IEEE Transactions on Computers., 41(9), 1992. [5] G. E. Suh et al., “Dynamic partitioning of shared cache memory”, In Journal of Supercomputing, 28(1), 2004. [6] M. K. Qureshi and Y.N. Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches”, In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. [7] PawanGoyal, et al., “CacheCOW: QoS for Storage System Cache”, In Eleventh International Workshop on Quality of Service, 2003. University Day 2013

SLO-aware Hybrid Store

SLO-aware Hybrid Store

Presentation Transcript

Cabrillo’s SLO Journey

SLO Training

Writing a SLO

SLO 102: On-going SLO Assessment

SLO Review

SLO Update

Hybrid Event Store Integration with Athena/StoreGate

Row Buffer Locality Aware Caching Policies for Hybrid Memories

SLO Assessment

SLO

SLO and You

SLO Training

Energy Independence in SLO County: A Hybrid Solution for a Hybrid Problem

Hybrid-ε-greedy for Mobile Context-Aware Recommender System

SLO Assessment

SLO Summit 2013

Hybrid Event Store

SLO Assessment Study Workshop Devon Atchison, SLO Coordinator

SLO VA KIA

Best Hybrid Strains | THC Marijuana Store