220 likes | 257 Views
Log-structured File Systems. Myeongcheol Kim (mckim@dcslab.snu.ac.kr) School of Computer Science and Engineering Seoul National University. Contents. Motivation Log-structured File System Effective Sequential Write The Inode Map Garbage Collection Crash Recovery Summary. Motivation.
E N D
Log-structured File Systems Myeongcheol Kim (mckim@dcslab.snu.ac.kr) School of Computer Science and Engineering Seoul National University
Contents Motivation Log-structured File System Effective Sequential Write The Inode Map Garbage Collection Crash Recovery Summary
Motivation • Growing memory size • Most of reads would be service in the cache. • File system performance would largely be determined by write performance. • Growing performance gap between random and sequential I/O • Rapid increase in transfer bandwidth • Slow decrease in seek and rotational delay costs • Workloads • FFS spreads information around the disk for a file. • A file creation incurs many small writes. • File systems were not RAID-aware.
Log-structured File System (LFS) Question How to make all writes sequential write? • A file system developed at Berkeley in the early 90’s • A group led by Professor John Ousterhout and graduate student Mendel Rosenblum • Performance goal • High performance for small writes • Matching or exceeding performance for reads and large writes • Idea • Buffering all updates in an in-memory segment • Writing it in one long, sequential transfer
Writing To Disk Sequentially D I blk[0]: A0 A0 • LFS never overwrites existing data, but rather always write segments to free locations. • Data block (D) and metadata (I: inode)
Writing Sequentially And Effectively write_at(A) write_at(A+1) write_at(A+1) Trotation - δ T+Trotation T T+δ rotational delay Time • What if there is a delay between two sequential writes? • Simply writing in sequential order is not enough to achieve peak performance. • Write buffering • LFS buffers updates in an in-memory segment. • The segment is written to disk all at once. • As long as the segment is large enough, writes will be efficient.
Writing Sequentially And Effectively blk[0]: A4 blk[0]: A0 blk[0]: A0 blk[0]: A4 D[k,0] D[j,1] D[j,2] D[j,0] In-memory segment blk[1]: A1 blk[1]: A1 blk[2]: A2 blk[2]: A2 Inode[k] Inode[j] D[j,1] D[j,2] D[k,0] D[j,0] Disk Inode[j] A4 Inode[k] A0 A1 A2 • Example • LFS buffers two sets of updates into a small segment.
How Much To Buffer? Size of data (MB) Disk transfer rate (MB/s) Time to write Effective rate of writing Fraction of the peak rate (0 < F < 1) • Positioning overhead vs. transfer rate • A fixed amount of positioning overhead is paid for a write. • The more you write, the better you amortize the positioning cost.
Problem: Finding Inodes • In a typical file system, finding inode is easy. • The location of inode table is fixed on the disk. • Array-based indexing with the given inode number • In LFS, it is more difficult. • Inodes are scattered throughout the disk. • The latest version of an inode keeps moving due to out-of-place write.
Solution Through Indirection D blk[0]: A0 map[k]: A1 A0 I[k] imap A1 • The inode map (imap) • A level of indirection between inode numbers and the inodes • For a given inode number, it produces the disk address of the most recent version of the inode. • Location of the imap • Fixed location • Performance would suffer due to more disk seeks. • Moving imap • LFS writes the chunk of inode map right next to all the other new information.
The Check Region D blk[0]: A0 map[k]: A1 A0 imap [k…k+N]: A2 CR I[k] imap A2 A1 0 • Checkpoint region (CR) • Resides in a fixed place (address 0) on disk • Contains pointers to the latest pieces of the inode map. • Reading a file from disk 1) Reading CR, reading in the entire inode map, and caching it in memory 2) Looking up inode-number to inode-disk-address mapping in the imap 3) Proceeding exactly as a typical UNIX file system
What About Directories? D[k] I[k] blk[0]: A0 D[dir] I[dir] blk[0]: A2 imap map[k]: A1 (foo, k) map[dir]: A3 A0 A1 A2 A3 • Directory structure of LFS is identical to classic UNIX file systems. • A collection of mappings (name, inode number) • Recursive update problem • When the update of an inode entails an update to the directory • It mandates updates all the way up the file system tree. • LFS avoids this problem with inode map. • Only imap is updated while the directory holds the same name-to-inumber mapping.
A New Problem: Garbage Collection blk[0]: A4 blk[0]: A4 blk[0]: A0 blk[0]: A0 D0 D0 I[k] I[k] D0 D1 I[k] I[k] A0 (both garbage) A4 A0 (garbage) A4 • Garbage collection • Freeing the dead blocks for use in a subsequent writes • LFS cleaner • Works on a segment-by-segment basis to prevent performance drop • Work flow 1) Reading periodically old (partially-used) segments 2) Determining the liveness of the blocks within the segments 3) Writing out a new set of segments with just the live blocks 4) Freeing up the old segments
A New Problem: Garbage Collection (cont.) • Mechanism • How can LFS tell which blocks within a segment are live? • Policy • How often should the cleaner run? • Which segments should it pick to clean?
Determining Block Liveness (N, T) = SegmentSummary[A]; inode = Read(imap[N]); if (inode[T] == A) // block D is alive else // block D is garbage blk[0]: A0 map[k]: A1 SS D I[k] imap A0:(k,0) A0 A1 • For each data block, LFS records the following in the segment summary block. • Inode number of the file it belongs to • Block offset within in the file • Liveness checking procedure
Determining Block Liveness (cont.) • Optimization • Version number • Stored in the inode map for each inode (V1) • Stored in the each entry of segment summary block (V2) • Incremented when a file is truncated or deleted • If V1 doesn’t match V2, the block can be discarded immediately without examining the file’s inode.
Policy: Which Blocks To Clean, And When? • When to clean? • Periodically • During idle time • When disk is full • Which blocks to clean? • Challenging question and the subject of many research papers • Example • Segregating hot and cold segment • Hot segment: waiting a long time before cleaning it • Cold segment: cleaning the segment sooner
Crash Recovery And The Log • Checkpoint • A position in the log at which all of the file system structures are consistent and complete. • CR contains • Addresses of all the imap blocks • Timestamp • A pointer to the last segment written • Segments are written in a log and CR is updated periodically (every 30 seconds or so) • Crash could happen during • Writing to segment • Writing to the CR
Crash Recovery And The Log (cont.) Discarded! imap: A2 TS: 15 imap: A9 TS: 20 CR0 CR1 imap imap imap: A2 TS: 15 imap: A9 TS: X CR0 CR1 imap imap … … … … A2 A9 0 A2 A9 0 • Crash during CR update operation • Timestamp is written at the end of the CR update. • Two CRs are maintained. • Most recent CR will always be chosen. • Crash during writing to a segment • Only updates recorded in the CR get recovered. • Last many seconds of updates would be lost. • Roll forward • Trying to recover as many valid updates as possible starting from the last checkpoint.
Summary • LFS enables highly efficient writing by exploiting sequential bandwidth. • Gathering all updates into an in-memory segment • Writing them out together sequentially • Garbage collection • Mechanism and policy • Concern over cleaning costs became the focus of much controversy in LFS. • Fast and efficient crash recovery