G22.3250-001

G22.3250-001 SGI’s XFSor Cool Pet Tricks with B+ Trees Robert Grimm New York University

Altogether Now:The Three Questions • What is the problem? • What is new or different? • What are the contributions and limitations?

Inode On-disk data structure containing a file’s metadataand pointers to its data Defines FS-internal namespace Inode numbers Originated in Unix FFS Vnode Kernel data structure forrepresenting files Provides standard APIfor accessing different FS’s Originated in Sun OS Some Background

Motivation • On one side: I/O bottleneck • It’s becoming harder to utilize increasing I/O capacity and bandwidth • 112 9 GB disk drives provide 1 TB of storage • High end drives provide 500 MB/sec sustained disk bandwidth • On the other side: I/O intensive applications • Editing of uncompressed video • 30 MB/sec per stream, 108 GB for one hour • Streaming compressed video on demand • 2.7 TB for 1,000 movies, 200 movies require 100 MB/sec

Scalability Problems of Existing File Systems • Slow crash recovery • fsck needs to scan entire disk • No support for large file systems • 32 bit block pointers address only 4 million blocks • At 8 KB per block, 32 TB • No support for large, sparse files • 64 bit block pointers require more levels of indirection • Are also quite inefficient • Fixed-size extents are still too limiting

Scalability Problems of Existing File Systems (cont.) • No support for large, contiguous files • Bitmap structures for tracking free and allocated blocks do not scale • Hard to find large regions of contiguous space • But, we need contiguous allocation for good utilization of bandwidth • No support for large directories • Linear layout does not scale • In-memory hashing imposes high memory overheads • No support for large numbers of files • Inodes preallocated during file system creation

XFS Approach in a Nutshell • Use 64 bit block addresses • Support for larger files systems • Use B+ trees and extents • Support for larger number of files, larger files,larger directories • Better utilization of I/O bandwidth • Log metadata updates • Faster crash recovery

I/O manager I/O requests Directory manager File system name space Space manager Free space, inode & file allocation Transaction manager Atomic metadata updates Unified buffer cache Volume manager Striping, concatenation, mirroring XFS Architecture

Storage Scalability • Allocation groups • Are regions with their own free space maps and inodes • Support AG-relative block and inode pointers • Reduce size of data structures • Improve parallelism of metadata management • Allow concurrent accesses to different allocation groups • Unlike FFS, are mostly motivated by scalability not locality • Free space • Two B+ trees describing extents (what’s a B+ tree?) • One indexed by starting block (used when?) • One indexed by length of extent (used when?)

Storage Scalability (cont.) • Large files • File storage tracked by extent map • Each entry: block offset in file, length in blocks, starting block on disk • Small extent map organized as list in inode • Large extent map organized as B+ tree rooted in inode • Indexed by block offset in file • Large number of files • Inodes allocated dynamically • In chunks of 64 • Inode locations tracked by B+ tree • Only points to inode chunks

Storage Scalability (cont.) • Large directories • Directories implemented as (surprisingly) B+ trees • Map 4 byte hashes to directory entries (name, inode number) • Fast crash recovery • Enabled by write ahead log • For all structural updates to metadata • E.g., creating a file  directory block, new inode, inode allocation tree block, allocation group header block, superblock • Independent of actual data structures • However, still need disk scavengers for catastrophic failures

Performance Scalability • Allocating files contiguously • On-disk allocation is delayed until flush • Uses (cheap) memory to improve I/O performance • Typically enables allocation in one extent • Even for random writes (think memory-mapped files) • Avoids allocation for short-lived files • Extents have large range: 21 bit length field • Block size can vary by file system • Small blocks for file systems with many small files • Large blocks for file systems with mostly large files • What prevents long-term fragmentation?

Performance Scalability (cont.) • Performing file I/O • Read requests issued for large I/O buffers • Followed by multiple read ahead requests for sequential reads • Writes are clustered to form larger I/O requests • Delayed allocation helps with buffering writes • Direct I/O lets applications bypass buffer cache and use DMA • Applications have control over I/O, while still accessing file system • But also need to align data on block boundaries and issue requests on multiples of block size • Reader/writer locking supports more concurrency • Direct I/O leaves serialization entirely to applications

Performance Scalability (cont.) • Accessing and updating metadata • Updates performed in asynchronous write-ahead log • Modified data still only flushed after log update has completed • But metadata not locked, multiple updates can be batched • Log may be placed on different device from file system • Including non-volatile memory • Log operation is simple (though, log is centralized) • Provide buffer space, copy, write out, notify • Copying can be done by processors performing transaction

Experiences

I/O Throughput • What can we conclude? • Read speed, difference between creates and writes, paralellism

Benchmark Results • Datamation sort • 3.52 seconds (7 seconds previous record) • Indy MinuteSort • 1.6 GB sorted in 56 seconds (1.08 GB previously) • SPEC SFS • 8806 SPECnfs instead of 7023 SPECnfs with EFS • 12% increase with mostly small & synchronous writeson similar hardware

Directory Lookups Why this noticeable break? Why this noticeable break?

A Final Note • Are these systems/approaches contradictory? • Recoverable virtual memory • Simpler is better, better performing • XFS • Way more complex is better, better performing

What Do You Think?

G22.3250-001

G22.3250-001

Presentation Transcript

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001

G22.3250-001