260 likes | 326 Views
Learn about crash consistency challenges, FSCK tools, journaling techniques, and scenarios to maintain file system integrity. Get insights on data journaling and recovery processes for robust file system management.
E N D
Crash Consistency: FSCK and Journaling Deoksang Kim(dskim@dcslab.snu.ac.kr) School of Computer Science and Engineering Seoul National University
Introduction • File system data structures must persist • Challenges • Power loss • System crash • Those challenges may occur crash-consistency problem • File systems have used methods to overcome it • FSCK • Journaling
Example • Append a single data block to an existing file • It is not atomic 1 2 I[v1] I[v2] Da Db 3 owner : remzi permissions : read-only size : 1 pointer : 4 pointer : null pointer : null pointer : null owner : remzi permissions : read-only size : 2 pointer : 4 pointer : 5 pointer : null pointer : null
Crash Scenarios (1/2) size : 1 pointer : 4 pointer : null size : 2 pointer : 4 pointer : 5 I[v1] I[v2] Da Db • Just the data block (Db) is written to disk • It is not a problem from the perspective of file-system crash consistency • It might be a problem for users, who lost some data • Just the updated inode (I[v2]) is written to disk • If we read the data block, we will read garbage data • File-system inconsistency • The bitmap disagree with the inode information • Just the updated bitmap is written to disk • File-system inconsistency • Space leak
Crash Scenarios (2/2) size : 2 pointer : 4 pointer : 5 size : 1 pointer : 4 pointer : null I[v1] I[v2] Da Db • The inode (I[v2]) and bitmap are written • If we read the data block, we will read garbage data • The inode (I[v2]) and data block (Db) are written • File-system inconsistency • The bitmap and data block (Db) are written • File-system inconsistency • Space leak
File System Checker (1/3) • fsck is a UNIX tool for finding inconsistencies and repairing them • Superblock • Checking the file system size is greater than the number of blocks allocated • The system may decide to use an alternate copy of the superblock • Free blocks • Scanning the inodes and blocks to understand block allocation information • Producing correct version of the allocation bitmap • If there is any inconsistency between bitmaps and inodes, fsck trust inode • Inode state • Checking each inode for corruption or other problems (e.g. inode type field) • If there are problems with the inode field, fsck cleare the inode
File System Checker (2/3) • Inode links • Verifying the link count of each allocated inode • Scanning through the entire directory tree • If there is a mismatch, fsck fix the count within the inode • If an allocated inode referred by no directory is discovered, it is moved to the lost+found directory • Duplicates • Checking for duplicate pointers • The pointed-to block is copied, and giving each inode its own copy • Bad blocks • Checking for bad block pointer • fsck just removes the pointer
File System Checker (3/3) • Directory checks • Performing additional integrity checks • “.” and “..” are the first entries • Each inode referred to in a directory is allocated • Disadvantages • Require intricate knowledge of the file system • Too slow
Journaling • When updating the disk • Write down a little log describing what you are about to do • Overwrite the structures in place • Type of journaling • Data journaling • Metadata journaling
Data Journaling (1/2) • Journal write • Transaction begin • Transaction identifier • Information about the pending update • Contents • Physical logging • Exact physical contents of the update • Logical logging • Compact logical representation of the update • Transaction end • Transaction identifier • Checkpoint • Write the pending metadata and data updates to their final locations
Data Journaling (2/2) TxE Db ?? • When a crash occurs during the writes to the journal • It looks like a valid transaction • If the system reboots and runs recovery, it will replay this transaction • Journal write • Transaction begin • Contents • Journal commit • Transaction end • Checkpoint
Recovery • Crash happens before journal commit • The pending update is simply skipped • Crash happens right after journal commit • Redo logging • Scan the log and look for transactions that have committed to the disk • These transactions are replayed • Crash happens at any point during checkpointing • Redo logging • Worst case • Some of updates are performed again during recovery
Batching Log Updates (1/2) I[P] I’[P] I’’[P] I[F1] P P’ P’’ I[F2] • File creation • inode bitmap • newly-created inode of the file • data block of the parent directory • parent directory inode
Batching Log Updates (2/2) • Solution • File systems buffer all updates into a global transaction • The file system marks the in-memory inode bitmap, inodes of the files, directory data, and directory inode as dirty
Making The Log Finite • Log full • Recovery will take longer time • No further transactions can be committed • Circular log • File systems treat the log as a circular data structure, re-using it over and over • After a transaction has been checkpointed, a file system should free the space it was occupying • Journal write • Journal commit • Checkpoint • Free
Metadata Journaling • Unordered metadata journaling • Data can be written at any time • Journal metadata write • Journal commit • Checkpoint metadata • Free • Journaling without data blocks • Data blocks are written to the file system directly • Ordered metadata journaling • Data write • Journal metadata write • Journal commit • Checkpoint metadata • Free
Block Reuse (1/2) • A user adds an entry to foo directory • The contents of foo are written to the log • Directories are considered metadata • The user deletes everything in the directory as well as the directory itself, freeing up block 1000 for reuse • The user creates a new file • The inode of foobar is committed to disk • A crash occurs • During replay, the recovery process replays everything in the log
Block Reuse (2/2) • Solutions • Never reuse blocks until the delete of blocks is checkpointed out of the journal • Use revoke record(a new type of record) • When replaying the journal, the system first scans for revoke records • Any revoked data is never replayed
Other Approaches (1/2) • Soft updates • Order all writes to the file system to ensure that the on-disk structures are never left in an inconsistent state • Writing a pointed-to data block to disk before the inode that points to it • inode never points to garbage • Implementation can be a challenge • Require intricate knowledge of the exact file system structures • Add a fair amount of complexity to the system • Copy-on-write • It places new updates to previously unused locations on disk • After a number of updates are completed, COW file systems flip the root structure of the file system to include pointers to the newly updated structures • Make keeping the file system consistent straightforward
Other Approaches (2/2) • Backpointer-based consistency • No ordering is enforced between writes • To achieve consistency, an additional back pointer is added to every block • Each data block has a reference to the inode to which it belongs • Optimistic crash consistency • Issue as many write to disk as possible • Use a generalized form of the transaction checksum
Summary • We have introduced the problem of crash consistency • Solutions • FSCK • Journaling • Data journaling • Metadata journaling
Summary Data journaling
Summary Metadata journaling