Understanding Crash Consistency in File Systems

Crash Consistency: FSCK and Journaling Deoksang Kim(dskim@dcslab.snu.ac.kr) School of Computer Science and Engineering Seoul National University

Introduction • File system data structures must persist • Challenges • Power loss • System crash • Those challenges may occur crash-consistency problem • File systems have used methods to overcome it • FSCK • Journaling

Example • Append a single data block to an existing file • It is not atomic 1 2 I[v1] I[v2] Da Db 3 owner : remzi permissions : read-only size : 1 pointer : 4 pointer : null pointer : null pointer : null owner : remzi permissions : read-only size : 2 pointer : 4 pointer : 5 pointer : null pointer : null

Crash Scenarios (1/2) size : 1 pointer : 4 pointer : null size : 2 pointer : 4 pointer : 5 I[v1] I[v2] Da Db • Just the data block (Db) is written to disk • It is not a problem from the perspective of file-system crash consistency • It might be a problem for users, who lost some data • Just the updated inode (I[v2]) is written to disk • If we read the data block, we will read garbage data • File-system inconsistency • The bitmap disagree with the inode information • Just the updated bitmap is written to disk • File-system inconsistency • Space leak

Crash Scenarios (2/2) size : 2 pointer : 4 pointer : 5 size : 1 pointer : 4 pointer : null I[v1] I[v2] Da Db • The inode (I[v2]) and bitmap are written • If we read the data block, we will read garbage data • The inode (I[v2]) and data block (Db) are written • File-system inconsistency • The bitmap and data block (Db) are written • File-system inconsistency • Space leak

File System Checker (1/3) • fsck is a UNIX tool for finding inconsistencies and repairing them • Superblock • Checking the file system size is greater than the number of blocks allocated • The system may decide to use an alternate copy of the superblock • Free blocks • Scanning the inodes and blocks to understand block allocation information • Producing correct version of the allocation bitmap • If there is any inconsistency between bitmaps and inodes, fsck trust inode • Inode state • Checking each inode for corruption or other problems (e.g. inode type field) • If there are problems with the inode field, fsck cleare the inode

File System Checker (2/3) • Inode links • Verifying the link count of each allocated inode • Scanning through the entire directory tree • If there is a mismatch, fsck fix the count within the inode • If an allocated inode referred by no directory is discovered, it is moved to the lost+found directory • Duplicates • Checking for duplicate pointers • The pointed-to block is copied, and giving each inode its own copy • Bad blocks • Checking for bad block pointer • fsck just removes the pointer

File System Checker (3/3) • Directory checks • Performing additional integrity checks • “.” and “..” are the first entries • Each inode referred to in a directory is allocated • Disadvantages • Require intricate knowledge of the file system • Too slow

Journaling • When updating the disk • Write down a little log describing what you are about to do • Overwrite the structures in place • Type of journaling • Data journaling • Metadata journaling

Data Journaling (1/2) • Journal write • Transaction begin • Transaction identifier • Information about the pending update • Contents • Physical logging • Exact physical contents of the update • Logical logging • Compact logical representation of the update • Transaction end • Transaction identifier • Checkpoint • Write the pending metadata and data updates to their final locations

Data Journaling (2/2) TxE Db ?? • When a crash occurs during the writes to the journal • It looks like a valid transaction • If the system reboots and runs recovery, it will replay this transaction • Journal write • Transaction begin • Contents • Journal commit • Transaction end • Checkpoint

Recovery • Crash happens before journal commit • The pending update is simply skipped • Crash happens right after journal commit • Redo logging • Scan the log and look for transactions that have committed to the disk • These transactions are replayed • Crash happens at any point during checkpointing • Redo logging • Worst case • Some of updates are performed again during recovery

Batching Log Updates (1/2) I[P] I’[P] I’’[P] I[F1] P P’ P’’ I[F2] • File creation • inode bitmap • newly-created inode of the file • data block of the parent directory • parent directory inode

Batching Log Updates (2/2) • Solution • File systems buffer all updates into a global transaction • The file system marks the in-memory inode bitmap, inodes of the files, directory data, and directory inode as dirty

Making The Log Finite • Log full • Recovery will take longer time • No further transactions can be committed • Circular log • File systems treat the log as a circular data structure, re-using it over and over • After a transaction has been checkpointed, a file system should free the space it was occupying • Journal write • Journal commit • Checkpoint • Free

Metadata Journaling • Unordered metadata journaling • Data can be written at any time • Journal metadata write • Journal commit • Checkpoint metadata • Free • Journaling without data blocks • Data blocks are written to the file system directly • Ordered metadata journaling • Data write • Journal metadata write • Journal commit • Checkpoint metadata • Free

Block Reuse (1/2) • A user adds an entry to foo directory • The contents of foo are written to the log • Directories are considered metadata • The user deletes everything in the directory as well as the directory itself, freeing up block 1000 for reuse • The user creates a new file • The inode of foobar is committed to disk • A crash occurs • During replay, the recovery process replays everything in the log

Block Reuse (2/2) • Solutions • Never reuse blocks until the delete of blocks is checkpointed out of the journal • Use revoke record(a new type of record) • When replaying the journal, the system first scans for revoke records • Any revoked data is never replayed

Other Approaches (1/2) • Soft updates • Order all writes to the file system to ensure that the on-disk structures are never left in an inconsistent state • Writing a pointed-to data block to disk before the inode that points to it • inode never points to garbage • Implementation can be a challenge • Require intricate knowledge of the exact file system structures • Add a fair amount of complexity to the system • Copy-on-write • It places new updates to previously unused locations on disk • After a number of updates are completed, COW file systems flip the root structure of the file system to include pointers to the newly updated structures • Make keeping the file system consistent straightforward

Other Approaches (2/2) • Backpointer-based consistency • No ordering is enforced between writes • To achieve consistency, an additional back pointer is added to every block • Each data block has a reference to the inode to which it belongs • Optimistic crash consistency • Issue as many write to disk as possible • Use a generalized form of the transaction checksum

Summary • We have introduced the problem of crash consistency • Solutions • FSCK • Journaling • Data journaling • Metadata journaling

Summary Data journaling

Summary Metadata journaling

Understanding Crash Consistency in File Systems