400 likes | 421 Views
Learn about metadata, failure recovery strategies, and transaction approach in file system reliability to maintain consistent state and handle system crashes effectively.
E N D
Transactions and Reliability Andy Wang Operating Systems COP 4610 / CGS 5765
Motivation • File systems have lots of metadata: • Free blocks, directories, file headers, indirect blocks • Metadata is heavily cached for performance
Problem • System crashes • OS needs to ensure that the file system does not reach an inconsistent state • Example: move a file between directories • Remove a file from the old directory • Add a file to the new directory • What happens when a crash occurs in the middle?
UNIX File System (Ad Hoc Failure-Recovery) • Metadata handling: • Uses a synchronous write-through caching policy • A call to update metadata does not return until the changes are propagated to disk • Updates are ordered • When crashes occur, run fsck to repair in-progress operations
Some Examples of Metadata Handling • Undo effects not yet visible to users • If a new file is created, but not yet added to the directory • Delete the file • Continue effects that are visible to users • If file blocks are already allocated, but not recorded in the bitmap • Update the bitmap
UFS User Data Handling • Uses a write-back policy • Modified blocks are written to disk at 30-second intervals • Unless a user issues the sync system call • Data updates are not ordered • In many cases, consistent metadata is good enough
Example: Vi • Vi saves changes by doing the following 1. Writes the new version in a temp file • Now we have old_file and new_temp file 2. Moves the old version to a different temp file • Now we have new_temp and old_temp 3. Moves the new version into the real file • Now we have new_file and old_temp 4. Removes the old version • Now we have new_file
Example: Vi • When crashes occur • Looks for the leftover files • Moves forward or backward depending on the integrity of files
Transaction Approach • A transaction groups operations as a unit, with the following characteristics: • Atomic: all operations either happen or they do not (no partial operations) • Serializable: transactions appear to happen one after the other • Durable: once a transaction happens, it is recoverable and can survive crashes
More on Transactions • A transaction is not done until it is committed • Once committed, a transaction is durable • If a transaction fails to complete, it must rollback as if it did not happen at all • Critical sections are atomic and serializable, but not durable
Transaction Implementation (One Thread) • Example: money transfer Begin transaction x = x – 1; y = y + 1; Commit
Transaction Implementation (One Thread) • Common implementations involve the use of a log, a journal that is never erased • A file system uses a write-ahead log to track all transactions
Transaction Implementation (One Thread) • Once accounts of x and y are on a log, the log is committed to disk in a single write • Actual changes to those accounts are done later
Transaction Illustrated x = 1; y = 1; x = 1; y = 1;
Transaction Illustrated x = 0; y = 2; x = 1; y = 1;
begin transaction Commit the log to disk before updating the actual values on disk old x: 1 new x: 0 old y: 1 new y: 2 commit Transaction Illustrated x = 0; y = 2; x = 1; y = 1;
Transaction Steps • Mark the beginning of the transaction • Log the changes in account x • Log the changes in account y • Commit • Modify account x on disk • Modify account y on disk • <delete the transaction log entry>
Scenarios of Crashes • If a crash occurs after the commit • Replays the log to update accounts • If a crash occurs before the commit • Rolls back and discard the transaction • A crash cannot occur during the commit • Commit is built as an atomic operation • e.g. writing a single sector on disk
Two-Phase Locking (Multiple Threads) • Logging alone not enough to prevent multiple transactions from trashing one another (not serializable) • Solution: two-phase locking 1. Acquire all locks 2. Perform updates and release all locks • Thread A cannot see thread B’s changes until thread A commits and releases locks
Transactions in File Systems • Almost all file systems built since 1985 use write-ahead logging + Eliminates running fsck after a crash + Write-ahead logging provides reliability - All modifications need to be written twice
Log-Structured File System (LFS) • If logging is so great, why don’t we treat everything as log entries? • Log-structured file system • Everything is a log entry (file headers, directories, data blocks) • Write the log only once • Use version stamps to distinguish between old and new entries
More on LFS • New log entries are always appended to the end of the existing log • All writes are sequential • Seeks only occurs during reads • Not so bad due to temporal locality and caching • Problem: • Need to create contiguous space all the time
RAID and Reliability • So far, we assume that we have a single disk • What if we have multiple disks? • The chance of a single-disk failure increases • RAID: redundant array of independent disks • Standard way of organizing disks and classifying the reliability of multi-disk systems • General methods: data duplication, parity, and error-correcting codes (ECC)
RAID 0 • No redundancy • Uses block-level striping across disks • i.e., 1st block stored on disk 1, 2nd block stored on disk 2 • Failure causes data loss
Non-Redundant Disk Array Diagram (RAID Level 0) open(foo) read(bar) write(zoo) File System
Mirrored Disks (RAID Level 1) • Each disk has a second disk that mirrors its contents • Writes go to both disks + Reliability is doubled + Read access faster - Write access slower - Expensive and inefficient
Mirrored Disk Diagram (RAID Level 1) open(foo) read(bar) write(zoo) File System
Memory-Style ECC (RAID Level 2) • Some disks in array are used to hold ECC + More efficient than mirroring + Can correct, not just detect, errors - Still fairly inefficient • e.g., 4 data disks require 3 ECC disks
Memory-Style ECC Diagram (RAID Level 2) open(foo) read(bar) write(zoo) File System
Bit-Interleaved Parity (RAID Level 3) • Uses bit-level striping across disks • i.e., 1st bit stored on disk 1, 2nd bit stored on disk 2 • One disk in the array stores parity for the other disks + More efficient than Levels 1 and 2 - Parity disk doesn’t add bandwidth
Parity Method • Disk 1: 1001 • Disk 2: 0101 • Disk 3: 1000 • Parity: 0100 = 1001 xor 0101 xor 1000 • To recover disk 2 • Disk 2: 0101 = 1001 xor 1000 xor 0100
Bit-Interleaved RAID Diagram (Level 3) open(foo) read(bar) write(zoo) File System
Block-Interleaved Parity (RAID Level 4) • Like bit-interleaved, but data is interleaved in blocks + More efficient data access than level 3 • Parity disk can be a bottleneck • Small writes
To update just one block • Do we need to read in the entire stripe?
To update just one block • Do we need to read in the entire stripe? • parity = block1 block2 block3 • parity’ = block1’ block2 block3 • Xor two equations • (anything anything = 0) • (anything 0 = anything) • parity parity’ = block1 block1’ • Xor both sides with parity • parity’ = block1’ block1 parity
Block-Interleaved Parity Diagram (RAID Level 4) open(foo) read(bar) write(zoo) File System
Block-Interleaved Distributed-Parity (RAID Level 5) • Sort of the most general level of RAID • Spreads the parity out over all disks • No parity disk bottleneck • All disks contribute read bandwidth • Requires 4 I/Os for small writes
Block-Interleaved Distributed-Parity Diagram (RAID Level 5) open(foo) read(bar) write(zoo) File System