1 / 20

G22.3250-001

G22.3250-001. SGI’s XFS or Cool Pet Tricks with B+ Trees. Robert Grimm New York University. Altogether Now: The Three Questions. What is the problem? What is new or different? What are the contributions and limitations?. Inode

penha
Download Presentation

G22.3250-001

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. G22.3250-001 SGI’s XFSor Cool Pet Tricks with B+ Trees Robert Grimm New York University

  2. Altogether Now:The Three Questions • What is the problem? • What is new or different? • What are the contributions and limitations?

  3. Inode On-disk data structure containing a file’s metadataand pointers to its data Defines FS-internal namespace Inode numbers Originated in Unix FFS Vnode Kernel data structure forrepresenting files Provides standard APIfor accessing different FS’s Originated in Sun OS Some Background

  4. Motivation • On one side: I/O bottleneck • It’s becoming harder to utilize increasing I/O capacity and bandwidth • 112 9 GB disk drives provide 1 TB of storage • High end drives provide 500 MB/sec sustained disk bandwidth • On the other side: I/O intensive applications • Editing of uncompressed video • 30 MB/sec per stream, 108 GB for one hour • Streaming compressed video on demand • 2.7 TB for 1,000 movies, 200 movies require 100 MB/sec

  5. Scalability Problems of Existing File Systems • Slow crash recovery • fsck needs to scan entire disk • No support for large file systems • 32 bit block pointers address only 4 million blocks • At 8 KB per block, 32 TB • No support for large, sparse files • 64 bit block pointers require more levels of indirection • Are also quite inefficient • Fixed-size extents are still too limiting

  6. Scalability Problems of Existing File Systems (cont.) • No support for large, contiguous files • Bitmap structures for tracking free and allocated blocks do not scale • Hard to find large regions of contiguous space • But, we need contiguous allocation for good utilization of bandwidth • No support for large directories • Linear layout does not scale • In-memory hashing imposes high memory overheads • No support for large numbers of files • Inodes preallocated during file system creation

  7. XFS Approach in a Nutshell • Use 64 bit block addresses • Support for larger files systems • Use B+ trees and extents • Support for larger number of files, larger files,larger directories • Better utilization of I/O bandwidth • Log metadata updates • Faster crash recovery

  8. I/O manager I/O requests Directory manager File system name space Space manager Free space, inode & file allocation Transaction manager Atomic metadata updates Unified buffer cache Volume manager Striping, concatenation, mirroring XFS Architecture

  9. Storage Scalability • Allocation groups • Are regions with their own free space maps and inodes • Support AG-relative block and inode pointers • Reduce size of data structures • Improve parallelism of metadata management • Allow concurrent accesses to different allocation groups • Unlike FFS, are mostly motivated by scalability not locality • Free space • Two B+ trees describing extents (what’s a B+ tree?) • One indexed by starting block (used when?) • One indexed by length of extent (used when?)

  10. Storage Scalability (cont.) • Large files • File storage tracked by extent map • Each entry: block offset in file, length in blocks, starting block on disk • Small extent map organized as list in inode • Large extent map organized as B+ tree rooted in inode • Indexed by block offset in file • Large number of files • Inodes allocated dynamically • In chunks of 64 • Inode locations tracked by B+ tree • Only points to inode chunks

  11. Storage Scalability (cont.) • Large directories • Directories implemented as (surprisingly) B+ trees • Map 4 byte hashes to directory entries (name, inode number) • Fast crash recovery • Enabled by write ahead log • For all structural updates to metadata • E.g., creating a file  directory block, new inode, inode allocation tree block, allocation group header block, superblock • Independent of actual data structures • However, still need disk scavengers for catastrophic failures

  12. Performance Scalability • Allocating files contiguously • On-disk allocation is delayed until flush • Uses (cheap) memory to improve I/O performance • Typically enables allocation in one extent • Even for random writes (think memory-mapped files) • Avoids allocation for short-lived files • Extents have large range: 21 bit length field • Block size can vary by file system • Small blocks for file systems with many small files • Large blocks for file systems with mostly large files • What prevents long-term fragmentation?

  13. Performance Scalability (cont.) • Performing file I/O • Read requests issued for large I/O buffers • Followed by multiple read ahead requests for sequential reads • Writes are clustered to form larger I/O requests • Delayed allocation helps with buffering writes • Direct I/O lets applications bypass buffer cache and use DMA • Applications have control over I/O, while still accessing file system • But also need to align data on block boundaries and issue requests on multiples of block size • Reader/writer locking supports more concurrency • Direct I/O leaves serialization entirely to applications

  14. Performance Scalability (cont.) • Accessing and updating metadata • Updates performed in asynchronous write-ahead log • Modified data still only flushed after log update has completed • But metadata not locked, multiple updates can be batched • Log may be placed on different device from file system • Including non-volatile memory • Log operation is simple (though, log is centralized) • Provide buffer space, copy, write out, notify • Copying can be done by processors performing transaction

  15. Experiences

  16. I/O Throughput • What can we conclude? • Read speed, difference between creates and writes, paralellism

  17. Benchmark Results • Datamation sort • 3.52 seconds (7 seconds previous record) • Indy MinuteSort • 1.6 GB sorted in 56 seconds (1.08 GB previously) • SPEC SFS • 8806 SPECnfs instead of 7023 SPECnfs with EFS • 12% increase with mostly small & synchronous writeson similar hardware

  18. Directory Lookups Why this noticeable break? Why this noticeable break?

  19. A Final Note • Are these systems/approaches contradictory? • Recoverable virtual memory • Simpler is better, better performing • XFS • Way more complex is better, better performing

  20. What Do You Think?

More Related