Improving Disk Throughput in Data-Intensive Servers

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University

Introduction • Disk drives are often bottlenecks • Several optimizations have been proposed • Disk arrays • Fewer disk reads using fancy buffer cache mgmt • Optimized disk writes using logs • Optimized disk scheduling • Disk throughput still problem for data-intensive servers

Modern Disk Drives • Substantial processing and memory capacity • Disk controller cache • Independent segments = sequential streams • If #streams > #segments, LRU segm is replaced • On access, blocks are read ahead to fill segment • Disk arrays • Array controller may also cache data • Striping affects read-ahead

Key Problem • Controller caches not designed for servers • Sequential access to small # large files • Read-ahead of consecutive blocks • Segment is unit of allocation and replacement • Data-intensive servers • Small files • Large # concurrent accesses • Large # blocks often miss in the controller cache

This Work • Goal • Management techniques for disk controller caches that are efficient for servers • Techniques • File-Oriented Read-ahead (FOR) • Host-guided Device Caching (HDC) • Exploit processing and memory of drives

Architecture

File-Oriented Read-ahead • Disk controller has no notion of file layout • Read-ahead can be useless for small files • Disk utilization is not amortized • Useless blocks pollute the controller cache • FOR only reads ahead blocks of same file

File-Oriented Read-ahead • FOR needs to know layout of files on disk • Bitmap of disk blocks kept by controller • 1  block is logical continuation of previous block • Initialized at boot, updated on metadata writes • # blocks to read-ahead = # consecutive 1’s or max read-ahead size

File-Oriented Read-ahead • FOR could underutilize segments, so allocation and replacement based on blocks • Replacement policy: MRU • FOR benefits • Lower disk utilization • Higher controller cache hit rates

Host-guided Device Caching • Data-intensive servers rely on disk arrays, so non-trivial amount of cache space • Current disk controller caches are speed matching and read-ahead buffers • More useful if each cache can be managed directly by the host processor

Host-guided Device Caching • Our evaluation: • Disk controllers permanently cache data with most misses in buffer cache • Each controller caches data stored on its disk • Assumes block-based organization • Support for three simple commands • pin_blk() • unpin_blk() • flush_hdc()

Host-guided Device Caching • Execution divided into periods to determine: • How many blocks to cache; which blocks those are; when to cache them • HDC benefits • Higher cache hit rate • Lower disk utilization • Tradeoff: space for HDC and read-aheads

Methodology • Simulation of 8 IBM Ultrastar 36Z15 drives attached to non-caching Ultra160 SCSI card • Logical disk blocks striped across array • Contention for buses, memories, and other components is simulated in detail • Synthetic + real traces (Web, proxy, file)

Real Workloads Web: I/O time as function of striping unit size HDC: 2MB

Real Workloads Web: I/O time as function of HDC memory size Stripes: 16KB

Real Workloads • Summary • Consistent and significant performance gains • Combination achieves best overall performance

Related Work • Techniques external to disk controllers • Controller cache different than other caches • Lack of temporal locality • Orders of magnitude smaller than main memory • Read-ahead restricted to sequential blocks • Explicit grouping • Grouping needs to be found and maintained • Segment replacements may eliminate benefits

Related Work • Controller read-ahead & caching techniques • None considered file system info, host-guided caching, or block-based organizations • Other disk controller optimizations • Scheduling of requests • Utilizing free bandwidth • Data replication • FOR and HDC are orthogonal

Conclusions • Current controller cache management is inappropriate for servers • FOR and HDC can achieve significant and consistent increases in server throughput • Real workloads show improvements of 47, 33 and 21% (Web, proxy, and file server)

Extensions • Strategies for servers that use raw I/O • Better approach than bitmap • Array controllers that cache data and hide individual disks • Impact of other replacement policies and sizes for the buffer cache

More Information http://www.darklab.rutgers.edu

Synthetic Workloads I/O time as function of file size

Synthetic Workloads I/O time as function of simultaneous streams

Synthetic Workloads I/O time as function of access frequency

Synthetic Workloads • Summary • No read-ahead hurts performance for files > 16KB • No effect if simply replace segments with blocks • FOR gains increase as file size decreases and # simultaneous streams increases • HDC gains increase as requests are shifted toward a small # blocks • FOR gains decrease as % writes increases

Synthetic Workloads I/O time as function of percentage of writes

Improving Disk Throughput in Data-Intensive Servers

Improving Disk Throughput in Data-Intensive Servers

Presentation Transcript

Data-Intensive Distributed Computing

Data-Intensive Computing

Data-Intensive Distributed Computing

Data Management in Application Servers

Petascale Data Intensive Computing

Resource Management in Data-Intensive Systems

Data-Intensive Scientific Computing in Astronomy

Data-Intensive Statistical Challenges in Astrophysics

Data Intensive Cyberinfrastructure

Data Intensive Cyberinfrastructure

Data Intensive Computing

Linux IDE Disk Servers

Data-intensive Research Policy In Ireland

Improving Bluetooth EDR Data Throughput Using FEC and Interleaving

HARD DISK DATA RECOVERY

Building High Throughput, Multi-threaded Servers in C#/.NET

Improving Web Servers performance