Crash Consistency: FSCK and Journaling

Crash Consistency: FSCK and Journaling Deoksang Kim(dskim@dcslab.snu.ac.kr) School of Computer Science and Engineering Seoul National University

Introduction • File system data structures must persist • Challenges • Power loss • System crash • Those challenges may occur crash-consistency problem • File systems have used methods to overcome it • FSCK • Journaling

Example • Append a single data block to an existing file • It is not atomic 1 2 I[v1] I[v2] Da Db 3 owner : remzi permissions : read-only size : 1 pointer : 4 pointer : null pointer : null pointer : null owner : remzi permissions : read-only size : 2 pointer : 4 pointer : 5 pointer : null pointer : null

Crash Scenarios (1/2) size : 1 pointer : 4 pointer : null size : 2 pointer : 4 pointer : 5 I[v1] I[v2] Da Db • Just the data block (Db) is written to disk • It is not a problem from the perspective of file-system crash consistency • It might be a problem for users, who lost some data • Just the updated inode (I[v2]) is written to disk • If we read the data block, we will read garbage data • File-system inconsistency • The bitmap disagree with the inode information • Just the updated bitmap is written to disk • File-system inconsistency • Space leak

Crash Scenarios (2/2) size : 2 pointer : 4 pointer : 5 size : 1 pointer : 4 pointer : null I[v1] I[v2] Da Db • The inode (I[v2]) and bitmap are written • If we read the data block, we will read garbage data • The inode (I[v2]) and data block (Db) are written • File-system inconsistency • The bitmap and data block (Db) are written • File-system inconsistency • Space leak

File System Checker (1/3) • fsck is a UNIX tool for finding inconsistencies and repairing them • Superblock • Checking the file system size is greater than the number of blocks allocated • The system may decide to use an alternate copy of the superblock • Free blocks • Scanning the inodes and blocks to understand block allocation information • Producing correct version of the allocation bitmap • If there is any inconsistency between bitmaps and inodes, fsck trust inode • Inode state • Checking each inode for corruption or other problems (e.g. inode type field) • If there are problems with the inode field, fsck cleare the inode

File System Checker (2/3) • Inode links • Verifying the link count of each allocated inode • Scanning through the entire directory tree • If there is a mismatch, fsck fix the count within the inode • If an allocated inode referred by no directory is discovered, it is moved to the lost+found directory • Duplicates • Checking for duplicate pointers • The pointed-to block is copied, and giving each inode its own copy • Bad blocks • Checking for bad block pointer • fsck just removes the pointer

File System Checker (3/3) • Directory checks • Performing additional integrity checks • “.” and “..” are the first entries • Each inode referred to in a directory is allocated • Disadvantages • Require intricate knowledge of the file system • Too slow

Journaling • When updating the disk • Write down a little log describing what you are about to do • Overwrite the structures in place • Type of journaling • Data journaling • Metadata journaling

Data Journaling (1/2) • Journal write • Transaction begin • Transaction identifier • Information about the pending update • Contents • Physical logging • Exact physical contents of the update • Logical logging • Compact logical representation of the update • Transaction end • Transaction identifier • Checkpoint • Write the pending metadata and data updates to their final locations

Data Journaling (2/2) TxE Db ?? • When a crash occurs during the writes to the journal • It looks like a valid transaction • If the system reboots and runs recovery, it will replay this transaction • Journal write • Transaction begin • Contents • Journal commit • Transaction end • Checkpoint

Recovery • Crash happens before journal commit • The pending update is simply skipped • Crash happens right after journal commit • Redo logging • Scan the log and look for transactions that have committed to the disk • These transactions are replayed • Crash happens at any point during checkpointing • Redo logging • Worst case • Some of updates are performed again during recovery

Batching Log Updates (1/2) I[P] I’[P] I’’[P] I[F1] P P’ P’’ I[F2] • File creation • inode bitmap • newly-created inode of the file • data block of the parent directory • parent directory inode

Batching Log Updates (2/2) • Solution • File systems buffer all updates into a global transaction • The file system marks the in-memory inode bitmap, inodes of the files, directory data, and directory inode as dirty

Making The Log Finite • Log full • Recovery will take longer time • No further transactions can be committed • Circular log • File systems treat the log as a circular data structure, re-using it over and over • After a transaction has been checkpointed, a file system should free the space it was occupying • Journal write • Journal commit • Checkpoint • Free

Metadata Journaling • Unordered metadata journaling • Data can be written at any time • Journal metadata write • Journal commit • Checkpoint metadata • Free • Journaling without data blocks • Data blocks are written to the file system directly • Ordered metadata journaling • Data write • Journal metadata write • Journal commit • Checkpoint metadata • Free

Block Reuse (1/2) • A user adds an entry to foo directory • The contents of foo are written to the log • Directories are considered metadata • The user deletes everything in the directory as well as the directory itself, freeing up block 1000 for reuse • The user creates a new file • The inode of foobar is committed to disk • A crash occurs • During replay, the recovery process replays everything in the log

Block Reuse (2/2) • Solutions • Never reuse blocks until the delete of blocks is checkpointed out of the journal • Use revoke record(a new type of record) • When replaying the journal, the system first scans for revoke records • Any revoked data is never replayed

Other Approaches (1/2) • Soft updates • Order all writes to the file system to ensure that the on-disk structures are never left in an inconsistent state • Writing a pointed-to data block to disk before the inode that points to it • inode never points to garbage • Implementation can be a challenge • Require intricate knowledge of the exact file system structures • Add a fair amount of complexity to the system • Copy-on-write • It places new updates to previously unused locations on disk • After a number of updates are completed, COW file systems flip the root structure of the file system to include pointers to the newly updated structures • Make keeping the file system consistent straightforward

Other Approaches (2/2) • Backpointer-based consistency • No ordering is enforced between writes • To achieve consistency, an additional back pointer is added to every block • Each data block has a reference to the inode to which it belongs • Optimistic crash consistency • Issue as many write to disk as possible • Use a generalized form of the transaction checksum

Summary • We have introduced the problem of crash consistency • Solutions • FSCK • Journaling • Data journaling • Metadata journaling

Summary Data journaling

Summary Metadata journaling

Crash Consistency: FSCK and Journaling

Crash Consistency: FSCK and Journaling

Presentation Transcript

Consistency and Replication

Replication and Consistency

Journaling

Spiritual Journaling

Math Journaling

Metadata Journaling

Transparent Journaling

Art Journaling

Journaling Quotes

Reflective Journaling

Dialogical Journaling

Journaling

Spiritual Journaling

SPIRITUAL JOURNALING

Spiritual Journaling

Visual Journaling

Crash Consistency: FSCK and Journaling

Journaling:

digital journaling