1 / 23

Crash Consistency: FSCK and Journaling

Crash Consistency: FSCK and Journaling. Deoksang Kim(dskim@dcslab.snu.ac.kr) School of Computer Science and Engineering Seoul National University. Introduction. File system data structures must persist Challenges Power loss System crash Those challenges may occur crash-consistency problem

inglis
Download Presentation

Crash Consistency: FSCK and Journaling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crash Consistency: FSCK and Journaling Deoksang Kim(dskim@dcslab.snu.ac.kr) School of Computer Science and Engineering Seoul National University

  2. Introduction • File system data structures must persist • Challenges • Power loss • System crash • Those challenges may occur crash-consistency problem • File systems have used methods to overcome it • FSCK • Journaling

  3. Example • Append a single data block to an existing file • It is not atomic 1 2 I[v1] I[v2] Da Db 3 owner : remzi permissions : read-only size : 1 pointer : 4 pointer : null pointer : null pointer : null owner : remzi permissions : read-only size : 2 pointer : 4 pointer : 5 pointer : null pointer : null

  4. Crash Scenarios (1/2) size : 1 pointer : 4 pointer : null size : 2 pointer : 4 pointer : 5 I[v1] I[v2] Da Db • Just the data block (Db) is written to disk • It is not a problem from the perspective of file-system crash consistency • It might be a problem for users, who lost some data • Just the updated inode (I[v2]) is written to disk • If we read the data block, we will read garbage data • File-system inconsistency • The bitmap disagree with the inode information • Just the updated bitmap is written to disk • File-system inconsistency • Space leak

  5. Crash Scenarios (2/2) size : 2 pointer : 4 pointer : 5 size : 1 pointer : 4 pointer : null I[v1] I[v2] Da Db • The inode (I[v2]) and bitmap are written • If we read the data block, we will read garbage data • The inode (I[v2]) and data block (Db) are written • File-system inconsistency • The bitmap and data block (Db) are written • File-system inconsistency • Space leak

  6. File System Checker (1/3) • fsck is a UNIX tool for finding inconsistencies and repairing them • Superblock • Checking the file system size is greater than the number of blocks allocated • The system may decide to use an alternate copy of the superblock • Free blocks • Scanning the inodes and blocks to understand block allocation information • Producing correct version of the allocation bitmap • If there is any inconsistency between bitmaps and inodes, fsck trust inode • Inode state • Checking each inode for corruption or other problems (e.g. inode type field) • If there are problems with the inode field, fsck cleare the inode

  7. File System Checker (2/3) • Inode links • Verifying the link count of each allocated inode • Scanning through the entire directory tree • If there is a mismatch, fsck fix the count within the inode • If an allocated inode referred by no directory is discovered, it is moved to the lost+found directory • Duplicates • Checking for duplicate pointers • The pointed-to block is copied, and giving each inode its own copy • Bad blocks • Checking for bad block pointer • fsck just removes the pointer

  8. File System Checker (3/3) • Directory checks • Performing additional integrity checks • “.” and “..” are the first entries • Each inode referred to in a directory is allocated • Disadvantages • Require intricate knowledge of the file system • Too slow

  9. Journaling • When updating the disk • Write down a little log describing what you are about to do • Overwrite the structures in place • Type of journaling • Data journaling • Metadata journaling

  10. Data Journaling (1/2) • Journal write • Transaction begin • Transaction identifier • Information about the pending update • Contents • Physical logging • Exact physical contents of the update • Logical logging • Compact logical representation of the update • Transaction end • Transaction identifier • Checkpoint • Write the pending metadata and data updates to their final locations

  11. Data Journaling (2/2) TxE Db ?? • When a crash occurs during the writes to the journal • It looks like a valid transaction • If the system reboots and runs recovery, it will replay this transaction • Journal write • Transaction begin • Contents • Journal commit • Transaction end • Checkpoint

  12. Recovery • Crash happens before journal commit • The pending update is simply skipped • Crash happens right after journal commit • Redo logging • Scan the log and look for transactions that have committed to the disk • These transactions are replayed • Crash happens at any point during checkpointing • Redo logging • Worst case • Some of updates are performed again during recovery

  13. Batching Log Updates (1/2) I[P] I’[P] I’’[P] I[F1] P P’ P’’ I[F2] • File creation • inode bitmap • newly-created inode of the file • data block of the parent directory • parent directory inode

  14. Batching Log Updates (2/2) • Solution • File systems buffer all updates into a global transaction • The file system marks the in-memory inode bitmap, inodes of the files, directory data, and directory inode as dirty

  15. Making The Log Finite • Log full • Recovery will take longer time • No further transactions can be committed • Circular log • File systems treat the log as a circular data structure, re-using it over and over • After a transaction has been checkpointed, a file system should free the space it was occupying • Journal write • Journal commit • Checkpoint • Free

  16. Metadata Journaling • Unordered metadata journaling • Data can be written at any time • Journal metadata write • Journal commit • Checkpoint metadata • Free • Journaling without data blocks • Data blocks are written to the file system directly • Ordered metadata journaling • Data write • Journal metadata write • Journal commit • Checkpoint metadata • Free

  17. Block Reuse (1/2) • A user adds an entry to foo directory • The contents of foo are written to the log • Directories are considered metadata • The user deletes everything in the directory as well as the directory itself, freeing up block 1000 for reuse • The user creates a new file • The inode of foobar is committed to disk • A crash occurs • During replay, the recovery process replays everything in the log

  18. Block Reuse (2/2) • Solutions • Never reuse blocks until the delete of blocks is checkpointed out of the journal • Use revoke record(a new type of record) • When replaying the journal, the system first scans for revoke records • Any revoked data is never replayed

  19. Other Approaches (1/2) • Soft updates • Order all writes to the file system to ensure that the on-disk structures are never left in an inconsistent state • Writing a pointed-to data block to disk before the inode that points to it • inode never points to garbage • Implementation can be a challenge • Require intricate knowledge of the exact file system structures • Add a fair amount of complexity to the system • Copy-on-write • It places new updates to previously unused locations on disk • After a number of updates are completed, COW file systems flip the root structure of the file system to include pointers to the newly updated structures • Make keeping the file system consistent straightforward

  20. Other Approaches (2/2) • Backpointer-based consistency • No ordering is enforced between writes • To achieve consistency, an additional back pointer is added to every block • Each data block has a reference to the inode to which it belongs • Optimistic crash consistency • Issue as many write to disk as possible • Use a generalized form of the transaction checksum

  21. Summary • We have introduced the problem of crash consistency • Solutions • FSCK • Journaling • Data journaling • Metadata journaling

  22. Summary Data journaling

  23. Summary Metadata journaling

More Related