350 likes | 495 Views
More on File Systems. CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts , 7 th ed., by Silbershatz, Galvin, & Gagne and from Modern Operating Systems , 2 nd ed., by Tanenbaum). Reading Assignments. Silbershatz, §12.7 & 12.8 §12.7 – RAID systems
E N D
More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts, 7th ed., by Silbershatz, Galvin, & Gagne and from Modern Operating Systems, 2nd ed., by Tanenbaum) More on File Systems
Reading Assignments • Silbershatz, §12.7 & 12.8 • §12.7 – RAID systems • §12.8 – Stable Storage • Silbershatz, §11.8 • Log-structured file systems (aka journaling file systems) • Silbershatz, §21.7 • Linux file systems, including journaling More on File Systems
Mapping files to Virtual Memory • Instead of “reading” from disk into virtual memory, why not simply use file as the swapping storage for certain VM pages? • Called mapping • Page tables in kernel point to disk blocks of the file More on File Systems
Memory-Mapped Files • Memory-mapped file I/O allows file I/O to be treated as routine memory access by mapping a disk block to a page in memory • A file is initially “read” using demand paging. A page-sized portion of the file is read from the file system into a physical page. Subsequent reads/writes to/from the file are treated as ordinary memory accesses. • Simplifies file access by allowing application to simple access memory rather than be forced to use read() & write() calls to file system. More on File Systems
Memory-Mapped Files (continued) • A tantalizingly attractive notion, but … • Cannot use C/C++ pointers within mapped data structure • Corrupted data structures likely to persist in file • Recovery after a crash is more difficult • Don’t really save anything in terms of • Programming energy • Thought processes • Storage space & efficiency More on File Systems
Memory-Mapped Files (continued) Nevertheless, the idea has its uses • Simpler implementation of file operations • read(), write() are memory-to-memory operations • seek() is simply changing a pointer, etc… • Called memory-mapped I/O • Shared Virtual Memory among processes More on File Systems
Shared Virtual Memory More on File Systems
Shared Virtual Memory (continued) • Supported in • Apollo DOMAIN • Windows XP • Linux (shmget, etc.) • Synchronization is the responsibility of the sharing applications • OS retains no knowledge • Few (if any) synchronization primitives between processes in separate address spaces More on File Systems
Questions? More on File Systems
Problem • Question:– • If mean time to failure of a disk drive is 100,000 hours, • and if your system has 100 identical disks, • what is mean time between drive replacement? • Answer:– • 1000 hours (i.e., 41.67 days 6 weeks) • I.e.:– • You lose 1% of your data every 6 weeks! • But don’t worry – you can restore most of it from backup! More on File Systems
Can we do better? • Yes, mirrored • Write every block twice, on two separate disks • Mean time between simultaneous failure of both disks is >57,000 years • Can we do even better? • E.g., use fewer extra disks? • E.g., get more performance? More on File Systems
RAID – Redundant Array of Inexpensive Disks • Distribute a file system intelligently across multiple disks to • Maintain high reliability and availability • Enable fast recovery from failure • Increase performance More on File Systems
“Levels” of RAID • Level 0 – non-redundant striping of blocks across disk • Level 1 – simple mirroring • Level 2 – striping of bytes or bits with ECC • Level 3 – Level 2 with parity, not ECC • Level 4 – Level 0 with parity block • Level 5 – Level 4 with distributed parity blocks More on File Systems
stripe 0 stripe 1 stripe 2 stripe 3 stripe 4 stripe 5 stripe 6 stripe 7 stripe 8 stripe 9 stripe 10 stripe 11 RAID Level 0 – Simple Striping • Each stripe is one or a group of contiguous blocks • Block/group i is on disk (imodn) • Advantage • Read/write n blocks in parallel; n times bandwidth • Disadvantage • No redundancy at all. System MBTF is 1/n disk MBTF! More on File Systems
stripe 0 stripe 1 stripe 2 stripe 3 stripe 0 stripe 1 stripe 3 stripe 2 stripe 4 stripe 5 stripe 6 stripe 7 stripe 4 stripe 5 stripe 6 stripe 7 stripe 8 stripe 9 stripe 10 stripe 11 stripe 10 stripe 9 stripe 8 stripe 11 RAID Level 1– Striping and Mirroring • Each stripe is written twice • Two separate, identical disks • Block/group i is on disks (imod 2n) & (i+nmod2n) • Advantages • Read/write n blocks in parallel; n times bandwidth • Redundancy: System MBTF = (Disk MBTF)2 at twice the cost • Failed disk can be replaced by copying • Disadvantage • A lot of extra disks for much more reliability than we need More on File Systems
RAID Levels 2 & 3 • Bit- or byte-level striping • Requires synchronized disks • Highly impractical • Requires fancy electronics • For ECC calculations • Not used; academic interest only • See Silbershatz, §12.7.3 (pp. 471-472) More on File Systems
Observation • When a disk or stripe is read incorrectly, we know which one failed! • Conclusion: • A simple parity disk can provide very high reliability • (unlike simple parity in memory) More on File Systems
stripe 1 stripe 3 stripe 2 stripe 0 parity 0-3 stripe 4 stripe 6 stripe 7 stripe 5 parity 4-7 stripe 8 stripe 9 stripe 10 stripe 11 parity 8-11 RAID Level 4 – Parity Disk • parity 0-3 = stripe 0 xor stripe 1 xor stripe 2 xor stripe 3 • n stripes plus parity are written/read in parallel • If any disk/stripe fails, it can be reconstructed from others • E.g., stripe 1 = stripe 0 xor stripe 2 xor stripe 3 xor parity 0-3 • Advantages • n times read bandwidth • System MBTF = (Disk MBTF)2 at 1/n additional cost • Failed disk can be reconstructed “on-the-fly” (hot swap) • Hot expansion: simply add n + 1 disks all initialized to zeros • However • Writing requires read-modify-write of parity stripe only 1x write bandwidth. More on File Systems
stripe 0 stripe 1 stripe 2 stripe 3 parity 0-3 stripe 4 stripe 5 stripe 6 parity 4-7 stripe 7 stripe 8 stripe 9 parity 8-11 stripe 10 stripe 11 stripe 12 parity 12-15 stripe 13 stripe 14 RAID Level 5 – Distributed Parity stripe 15 • Parity calculation is same as RAID Level 4 • Advantages & Disadvantages – Mostly same as RAID Level 4 • Additional advantages • avoids beating up on parity disk • Some writes in parallel (if no contention for parity drive) • Writing individual stripes (RAID 4 & 5) • Read existing stripe and existing parity • Recompute parity • Write new stripe and new parity More on File Systems
RAID 4 & 5 • Very popular in data centers • Corporate and academic servers • Built-in support in Windows XP and Linux • Connect a group of disks to fast SCSI port (320 MB/sec bandwidth) • OS RAID support does the rest! • Other RAID variations also available More on File Systems
New Topic More on File Systems
Incomplete Operations • Problem – how to protect against disk write operations that don’t finish • Power or CPU failure in the middle of a block • Related series of writes interrupted before all are completed • Examples: • Database update of charge and credit • RAID 1, 4, 5 failure between redundant writes More on File Systems
Solution (part 1) – Stable Storage • Write everything twice to separate disks • Be sure 1st write does not invalidate previous 2nd copy • RAID 1 is okay; RAID 4/5 not okay! • Read blocks back to validate; then report completion • Reading both copies • If 1st copy okay, use it – i.e., newest value • If 2nd copy different or bad, update it with 1st copy • If 1st copy is bad; update it with 2nd copy – i.e., old value More on File Systems
Stable Storage (continued) • Crash recovery • Scan disks, compare corresponding blocks • If one is bad, replace with good one • If both good but different, replace 2nd with 1st copy • Result:– • If 1st block is good, it contains latest value • If not, 2nd block still contains previous value • An abstraction of an atomic disk write of a single block • Uninterruptible by power failure, etc. More on File Systems
What about more complex disk operations? • E.g., File create operation involves • Allocating free blocks • Constructing and writing i-node • Possibly multiple i-node blocks • Reading and updating directory • Update Free list and store back onto disk • What if system crashes with the sequence only partly completed? • Answer: inconsistent data structures on disk More on File Systems
Solution (Part 2) –Journaling File System • Make changes to cached copies in memory • Collect together all changed blocks • Including i-nodes and directory blocks • Write to log file (aka journal file) • A circular buffer on disk • Fast, contiguous write • Update log file pointer in stable storage • Later: Play back log file to actually update directories, i-nodes, free list, etc. More on File Systems
Journaling File System – Crash Recovery • If crash occurs before log pointer is updated • File system reverts to previous state • Contents of log discarded • If crash occurs after log pointer is updated but before log is replayed • Replay log at system restart • File system reflects updated contents • … More on File Systems
Journaling File System – Crash Recovery • … • If crash occurs during replay of log • Replay log again at system restart • Replaying log multiple times does not hurt • If replay succeeds, update log pointer in stable storage More on File Systems
Journaling File System (continued) • What if a process wants to use blocks that are currently in the log and not replayed? • Log is a cache of disk blocks • Must check there first for valid contents • Further updates are added to the log after current log pointer • Just as if they had been in their original places • Log pointer can be updated in stable storage after each set of updates More on File Systems
Transaction Data Base Systems • Similar techniques • Every transaction is recorded in log before recording on disk • Stable storage techniques for managing log pointers • One log exist is confirmed, disk can be updated in place • After crash, replay log to redo disk operations More on File Systems
Journaling File Systems • Linux ext3 file system • Windows NTFS More on File Systems
Berkeley LFS — a slight variation • Everything is written to log • i-nodes point to updated blocks in log • i-node cache in memory updated whenever i-node is written • Cleaner daemon follows behind to compact log • Advantages: • LFS is always consistent • LFS performance • Much better than Unix file system for small writes • At least as good for reads and large writes • Tanenbaum, §6.3.8, pp. 428-430 • Rosenblum & Ousterhout, Log-structured File System (pdf) • Note: not same as Linux LFS (large file system) More on File Systems
a a modified blocks old blocks b b c c old i-node i-node new blocks new i-node b c a Example After Before log More on File Systems
Reading Assignments • Silbershatz, §12.7 & 12.8 • §12.7 – RAID systems • §12.8 – Stable Storage • Silbershatz, §11.8 • Log-structured file systems (aka journaling file systems) • Silbershatz, §21.7 • Linux file systems, including journaling More on File Systems
Questions? More on File Systems