260 likes | 334 Views
Improving Disk Throughput in Data-Intensive Servers. Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University. Introduction. Disk drives are often bottlenecks Several optimizations have been proposed Disk arrays
E N D
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University
Introduction • Disk drives are often bottlenecks • Several optimizations have been proposed • Disk arrays • Fewer disk reads using fancy buffer cache mgmt • Optimized disk writes using logs • Optimized disk scheduling • Disk throughput still problem for data-intensive servers
Modern Disk Drives • Substantial processing and memory capacity • Disk controller cache • Independent segments = sequential streams • If #streams > #segments, LRU segm is replaced • On access, blocks are read ahead to fill segment • Disk arrays • Array controller may also cache data • Striping affects read-ahead
Key Problem • Controller caches not designed for servers • Sequential access to small # large files • Read-ahead of consecutive blocks • Segment is unit of allocation and replacement • Data-intensive servers • Small files • Large # concurrent accesses • Large # blocks often miss in the controller cache
This Work • Goal • Management techniques for disk controller caches that are efficient for servers • Techniques • File-Oriented Read-ahead (FOR) • Host-guided Device Caching (HDC) • Exploit processing and memory of drives
File-Oriented Read-ahead • Disk controller has no notion of file layout • Read-ahead can be useless for small files • Disk utilization is not amortized • Useless blocks pollute the controller cache • FOR only reads ahead blocks of same file
File-Oriented Read-ahead • FOR needs to know layout of files on disk • Bitmap of disk blocks kept by controller • 1 block is logical continuation of previous block • Initialized at boot, updated on metadata writes • # blocks to read-ahead = # consecutive 1’s or max read-ahead size
File-Oriented Read-ahead • FOR could underutilize segments, so allocation and replacement based on blocks • Replacement policy: MRU • FOR benefits • Lower disk utilization • Higher controller cache hit rates
Host-guided Device Caching • Data-intensive servers rely on disk arrays, so non-trivial amount of cache space • Current disk controller caches are speed matching and read-ahead buffers • More useful if each cache can be managed directly by the host processor
Host-guided Device Caching • Our evaluation: • Disk controllers permanently cache data with most misses in buffer cache • Each controller caches data stored on its disk • Assumes block-based organization • Support for three simple commands • pin_blk() • unpin_blk() • flush_hdc()
Host-guided Device Caching • Execution divided into periods to determine: • How many blocks to cache; which blocks those are; when to cache them • HDC benefits • Higher cache hit rate • Lower disk utilization • Tradeoff: space for HDC and read-aheads
Methodology • Simulation of 8 IBM Ultrastar 36Z15 drives attached to non-caching Ultra160 SCSI card • Logical disk blocks striped across array • Contention for buses, memories, and other components is simulated in detail • Synthetic + real traces (Web, proxy, file)
Real Workloads Web: I/O time as function of striping unit size HDC: 2MB
Real Workloads Web: I/O time as function of HDC memory size Stripes: 16KB
Real Workloads • Summary • Consistent and significant performance gains • Combination achieves best overall performance
Related Work • Techniques external to disk controllers • Controller cache different than other caches • Lack of temporal locality • Orders of magnitude smaller than main memory • Read-ahead restricted to sequential blocks • Explicit grouping • Grouping needs to be found and maintained • Segment replacements may eliminate benefits
Related Work • Controller read-ahead & caching techniques • None considered file system info, host-guided caching, or block-based organizations • Other disk controller optimizations • Scheduling of requests • Utilizing free bandwidth • Data replication • FOR and HDC are orthogonal
Conclusions • Current controller cache management is inappropriate for servers • FOR and HDC can achieve significant and consistent increases in server throughput • Real workloads show improvements of 47, 33 and 21% (Web, proxy, and file server)
Extensions • Strategies for servers that use raw I/O • Better approach than bitmap • Array controllers that cache data and hide individual disks • Impact of other replacement policies and sizes for the buffer cache
More Information http://www.darklab.rutgers.edu
Synthetic Workloads I/O time as function of file size
Synthetic Workloads I/O time as function of simultaneous streams
Synthetic Workloads I/O time as function of access frequency
Synthetic Workloads • Summary • No read-ahead hurts performance for files > 16KB • No effect if simply replace segments with blocks • FOR gains increase as file size decreases and # simultaneous streams increases • HDC gains increase as requests are shifted toward a small # blocks • FOR gains decrease as % writes increases
Synthetic Workloads I/O time as function of percentage of writes