160 likes | 318 Views
Gecko Storage System. Tudor Marian, Lakshmi Ganesh , and Hakim Weatherspoon Cornell University. Gecko. Save power by spinning/powering down disks E.g. RAID-1 mirror scheme with 5 primary/mirrors File system (FS) access pattern of disk is arbitrary
E N D
Gecko Storage System Tudor Marian, LakshmiGanesh, and Hakim Weatherspoon Cornell University
Gecko • Save power by spinning/powering down disks • E.g. RAID-1 mirror scheme with 5 primary/mirrors • File system (FS) access pattern of disk is arbitrary • Depends on FS internals, and gets worse as FS ages • When to turn disks off? What if prediction is wrong? write(fd,…) read(fd,…) Block Device
Predictable Writes • Access same disks predictably for long periods • Amortize the cost of spinning down & up disks • Idea: Log Structured Storage/File System • Writes go to the head of the log until disk(s) full write(fd,…) Log head Log tail Block Device
Unpredictable Reads • What about reads? May access any part of log! • Keep only the “primary” disks spinning • Trade off read throughput for power savings • Can afford to spin up disks on demand as load surges • File/buffer cache absorbs read traffic anyway read(fd,…) write(fd,…) Log head Log tail Block Device
Stable Throughput • Unlike LSF, reads do not interfere with writes • Keep data from head (written) disks in file cache • Log cleaning not on the critical path • Afford to incur penalty of on-demand disk spin-up • Return reads from primary, clean log from mirror read(fd,…) write(fd,…) Log head Log tail Block Device
Design Virtual File System (VFS) File/Buffer Cache Block Device File Mapping Layer Generic Block Layer Device Mapper Disk Filesystem Disk Filesystem I/O Scheduling Layer (anticipatory, CFQ, deadline, null) Block Device Drivers
Design Overview • Log structured storage at block level • Akin to SSD wear-leveling • Actually, supersedes on-chip wear-leveling of SSDs • The design works with RAID-1, RAID-5, and RAID-6 • RAID-5 ≈ RAID-4 due to the append-nature of log • The parity drive(s) are not a bottleneck since writes are appends • Prototype as a Linux kernel dm (device-mapper) • Real, high-performance, deployable implementation
Challenges • dm-gecko • All IO requests at this storage layer are asynchronous • SMP-safe: leverages all available CPU cores • Maintain in-core (RAM) large memory maps • battery backed NVRAM, and persistently stored on SSD • Map: virtual block <-> linear block <-> disk block (8 sectors) • To keep maps manageable: block size = page size (4K) • FS layered atop uses block size = page size • Log cleaning/garbage collection (gc) in the background • Efficient cleaning policy: when write IO capacity is available
Commodity Architecture Dell PowerEdge R710 Dual Socket Multi-core CPUs Battery Backed RAM OCZ RevoDrivePCIe x4 SSD 2TB Hitachi HDS72202 Disks
dm-gecko • In-memory map (one-level of indirection) • virtual block: conventional block array exposed to VFS • linear block: the collection of blocks structured as a log • Circular ring structure • E.g.: READs are simply indirected read block Virtual Block Device Linear Block Device Log head Log tail Free blocks Used blocks
dm-gecko • WRITE operations are append to log head • Allocate/claim the next free block • Schedule log compacting/cleaning (gc) if necessary • Dispatch write IO on new block • Update maps & log on IO completion write block Virtual Block Device Linear Block Device Log head Log tail Free blocks Used blocks
dm-gecko • TRIM operations free the block • Schedule log compacting/cleaning (gc) if necessary • Fast forward the log tail if the tail block was trimmed trim block Virtual Block Device Linear Block Device Log head Log tail Free blocks Used blocks
Log Cleaning • Garbage collection (gc) block compacting • Relocate the used block that is closest to tail • Repeat until compact (e.g. watermark), or fully contiguous • Use spare IO capacity, do not run when IO load is high • More than enough CPU cycles to spare (e.g. 2x quad core) Virtual Block Device Linear Block Device Log head Log tail Free blocks Used blocks
Gecko IO Requests • All IO requests at storage layer are asynchronous • Storage stack is allowed to reorder requests • VFS, file system mapping, and file/buffer cache play nice • Un-cooperating processes may trigger inconsistencies • Read/write and write/write conflicts are fair game • Log cleaning interferes w/ storage stack requests • SMP-safe solution that leverages all available CPU cores • Request ordering is enforced as needed • At block granularity
Request Ordering • Block b has no prior pending requests • Allow read or write request to run, mark block w/ ‘pending IO’ • Allow gc to run, mark block as ‘being cleaned’ • Block b has prior pending read/write requests • Allow read or write requests, track the number of `pending IO’ • If gc needs to run on block b, defer until all read/write requests have completed (zero `pending IOs’ on block b) • Block b is being relocated by the gc • Discard gc requests on same block b (doesn’t actually occur) • Defer all read/write requests until gc has completed on block b
Limitations • In-core memory map (there are two maps) • Simple, direct map requires lots of memory • Multi-level map is complex • Akin to virtual memory paging, only simpler • Fetch large portions of the map on demand from larger SSD • Current prototype uses two direct maps: