140 likes | 324 Views
File Caching with SSD Arrays. Wei Yang. Motivation. We are curious No immediate needs, but future needs Caching (only) analysis job inputs SSD has limited write cycles Other goals, see the last slide File level caching Conventional LFU/LRU algorithms
E N D
File Caching with SSD Arrays Wei Yang US ATLAS Distributed Facility Workshop University of California, Santa Cruz
Motivation • We are curious • No immediate needs, but future needs • Caching (only) analysis job inputs • SSD has limited write cycles • Other goals, see the last slide • File level caching • Conventional LFU/LRU algorithms • can not capture ATLAS analysis jobs data usage pattern (if there is such a pattern) • Sub-file level caching would be great! But book keeping is hard • We search for caching algorithm • Out-Bytes > In-Bytes under ATLAS workload • Use LRU, but based on days/weeks/months job usage pattern US ATLAS Distributed Facility Workshop University of California, Santa Cruz
Setup 1: Caching based on File Access Frequency Cache miss! forward to HD storage Analysis jobs visit SSD cache first Fill the cache Xrootd monitoring stream • A table records access • Frequency of all files • Rotate columns to maintain • N days of records US ATLAS Distributed Facility Workshop University of California, Santa Cruz
Setup 2: Caching based on Historic File Access Info Cache miss! forward to HD storage Analysis jobs visit SSD cache first Fill the cache Xrootd monitoring stream to UCSD collector • Record every file access as • event like info • save to ROOT files for later • analysis US ATLAS Distributed Facility Workshop University of California, Santa Cruz
Hardware of the SSD Box • Dell 610 • 8-core 2.4 Ghz • 24GB • Intel dual X520 10Gb NIC • LSI SAS 9200-8e (support TRIM) • RHEL 6 x86_64 • Xrootd • SSD Array • Dell MD1220 • 12x OCZ Talos 960GB MLC SSDs, total ~11TB • Non-raid to support TRIM. • Xrootd take care to gluing them together as a single space US ATLAS Distributed Facility Workshop University of California, Santa Cruz
6-month plot as of 2012-11-12 File Access Freq. Alg. Net data sink, not cache 3-hour plot 2012-11-05 Cache brings in ~200GB/hour Sept 1 The box can deliver Can the caching algorithm deliver? Algorithm: Bytes-read/file size > 110% during the last 5 days, prioritized by this ratio and up to 200GB/hour Lack of jobs US ATLAS Distributed Facility Workshop University of California, Santa Cruz
GB/hour from SSD + HDD GB/hour from SSD GB/hour to SSD Ceiling of 10Gb NIC Lost monitoring data from HDD Lack of jobs for the last 4 days UCSD collector dead US ATLAS Distributed Facility Workshop University of California, Santa Cruz
Simulate the Cache with Historic Data For a given caching algorithm, what do we want to learn from those historic data? Day 0: Size of all files read Bytes read from SSD+HDD Bytes read from SSD Cache size required for day [-x, -1] day –n -n+1 -1 0 Cache size required for [-x+1, 0] - = New data to cache US ATLAS Distributed Facility Workshop University of California, Santa Cruz
Algorithm: every files during the last N days. US ATLAS Distributed Facility Workshop University of California, Santa Cruz
Algorithm: every files during the last N days Cache hit rate = Byte from SSD/Bytes from SSD+HDD US ATLAS Distributed Facility Workshop University of California, Santa Cruz
Algorithm: Bytes-read/file size > 110% during the last 5 days US ATLAS Distributed Facility Workshop University of California, Santa Cruz
Analyzing the Historic Data • Try to find a way to identify • data worth caching. • So far, not much success • Worth caching US ATLAS Distributed Facility Workshop University of California, Santa Cruz
Do the jobs tend to open the same file in a short time window? • If some, we may not have a chance to cache • File that worth caching • Access time (open) scatter over several hours – cacheable • But “scattering over several hours” doesn’t mean the file worth caching US ATLAS Distributed Facility Workshop University of California, Santa Cruz
Next Step • So far focusing on making it a good cache • More work to be done • Should also look at • Asking Panda for input files lists of coming jobs • Possibility of sub-file level caching • How much can the cache speed up analysis jobs? • All files are in SSD cache • Normal caching --- some files in SSD cache, some are not US ATLAS Distributed Facility Workshop University of California, Santa Cruz