320 likes | 471 Views
Memento : Coordinated In-Memory Caching for Data-Intensive Clusters. Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, Ion Stoica. Data Intensive Computation. Data analytic clusters are pervasive
E N D
Memento: Coordinated In-Memory Caching for Data-Intensive Clusters Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker, Ion Stoica
Data Intensive Computation • Data analytic clusters are pervasive • Jobs run multiple tasks in parallel • Jobs operate on petabytes of input • Distributed file systems (DFS) store data distributed and replicated • Data reads are either disk-local or remote across the network
Access to disk slowMemory orders of magnitude faster How do we leverage memory storagefor datacenter jobs?
Can we store all data in memory? • Machines have tens of gigabytes of memory • But, huge discrepancy between storage and memory capacities • Facebook cluster has ~200x more data on disk than memory Use Memory as Cache
Will the data fit in cache? Heavy-tailed 96% of smallest jobscan fit in the memory cache 10% total input is >80% of all jobs
Elephants and mice • Mix of a few “large” jobs and very many “small” jobs • Large jobs: • Batch operations • Production jobs • Small jobs: • Interactive queries (e.g., Hive, SCOPE) • Experimental analytics
Challenge: Small Parallel Jobs • Job finishes when its last task finishes • Need to cache all-or-nothing
In summary… • Only option for memory-locality is caching • 96% of jobscan have their data in memory, if we cache it right
Outline • FATE: Cache Replacement • Memento: System Architecture • Evaluation
We care about jobs finishing faster… • Job j that completed in tn time normally, takes tm time with memory caching • %Reductionj= • Average % Reduction in Completion Time
Traditional Cache Replacement • Traditional cache replacement policies (e.g., LRU, LFU) optimize for hit-ratio • Belady’s MIN: Evict blocks that are to be accessed “farthest in future”
Belady’s MIN Example Data Block E, F, B, D, C, A (time) F, B, D, C, A (time) B, D, C, A (time) … 50% cache hit
MIN: How much do jobs benefit? Data Block • Memory-local tasks are 10x (or 90%) faster B, D, C, A (time) J1 A B J2 C D • 4 computation slots J1A J1B J2C J2D 0% 0% Reduction: Average(0 + 0)/2 = 0%
“Whole-job” inputs J1 A B Data Block J2 D C E, F, B, D, C, A (time) F, B, D, C, A (time) B, D, C, A (time) … 50% cache hit
MIN: How much do jobs benefit? Data Block B, D, C, A • Memory-local tasks are 10x (or 90%) faster (time) J1 A B J2 C D • 4 computation slots J1A J1B J2C J2D 90% 0% Reduction: Average(90 + 0)/2 = 45% Cache hit-ratio not the most suited (MIN): Average(0 + 0)/2 = 0%
FATE Cache Replacement • Maximize “whole-job” inputs in cache • Need global coordination • Parallel tasks distributed over different machines • Property: • Small jobs get preference • Large jobs benefit with remaining cache space
Waves in the job • Single Wave • (small jobs) • All-or-nothing Multiple Waves (large jobs) Linear benefits
Waves in the job Multiple Waves Single Wave
Outline • FATE: Cache Replacement • Memento: System Architecture • Evaluation
Global coordination of local caches Global cache view
Memento: Salient Features Metadata communication External Service Localcache reads
Outline • FATE: Cache Replacement • Memento: System Architecture • Evaluation
Evaluation • HDFS in conjunction with Memento • Microsoft and Facebook traces replayed • Replay jobs with same inter-arrival time • Deployment on EC2 cluster of 100 machines • 20GB memory for Memento • Jobs binned by their size
Jobs are 77% faster at average Small jobs see 85% reduction in completion time
Cache hit-ratio matters less Average job faster by 77% with FATE (vs.) 49% with MIN
Memento scales sufficiently • Coordinator handles 10,000 simultaneous client communications • Client can handle eight simultaneous local map tasks • Sufficient for current datacenter loads
Simpler Implementation [1] • Ride the OS cache • Estimate where block is cached • Change job manager to track block accesses • No FATE, use default (LRU?) • Initial results show 2.3x improvement in cache hit-rate
Alternate Metrics [2] • We optimize for “average % reduction in completion time” of jobs • Average • Weighted to include job priorities? • Other metrics • Reduction of load on disk subsystem? • Utilization?
Solid State Devices [3] • SSDs, a new layer in the storage hierarchy • Hierarchical Caching • Include SSDs between disk and memory • What’s the best cache replacement policy?
Summary • Memory-caching can be surprisingly effective • …despite disk and memory capacity discrepancy • Memento: Coordinated cache management • FATE Replacement Policy (“whole-jobs”) • Encouraging results for datacenter workload