140 likes | 292 Views
UNIX File and Directory Caching. How UNIX Optimizes File System Performance and Presents Data to User Processes Using a Virtual File System. V-NODE Layer. V-node in-memory interface to the disk consist of: File system independent functions dealing with: Hierarchical naming; Locking;
E N D
UNIX File and Directory Caching How UNIX Optimizes File System Performance and Presents Data to User Processes Using a Virtual File System
V-NODE Layer • V-node in-memory interface to the disk consist of: • File system independent functions dealing with: • Hierarchical naming; • Locking; • Quotas; • Attribute management and protection. • Object (file) creation and deletion, read and write, changes in space allocation: • These functions refer to file-store internals specific to the file system: • Physical organization of data on device; • For local data files, these functions refer to v-node refers to UNIX-specific structure called i-node (index node) that has all necessary information to access the actual data store.
Actual File I/O • CPU cannot access the file data directly. • Must be first brought to the main memory. • How to do this efficiently? • Read/Write mapping using in-memory system file/directory buffer cache. • Memory mapped files – INODE lists, Directories, Regular files. • Then fed from memory to data pipeline in CPU.
File Read/Write Memory Mapping • File data is made available to applications via a pre-allocated main memory region - the buffer cache. • The file systems transfers data between the buffer cache and disk in granularity of disk blocks. • The data is explicitly copied from/to buffer cache to/from the user application address space (process). • A file (or a portion thereof) is mapped into a contiguous region of the process virtual memory. • Mapping operation is very efficient: just marking a block. • The access to file is governed by the virtual memory subsystem. • Advantages: • reduce copying • no need for a pre-allocated buffer cache in the main memory • Disadvantages: • less or no control over the actual disk writing: the file data becomes volatile • A mapped area must fit the virtual address space
Buffer Cache management • All disk I/O goes through the buffer cache. Both data and metadata (e.g., i-node, directories) are cached using LRU replacement • Dirty (modified) marker to indicate whether write-back is needed for data blocks. • Advantages: - Hiding disk access the user program. Block size, memory alignment, memory allocation in multiples of the block size, etc… - Disk blocks are cached - Block aggregation for small transfers (locality) - Block re-use across processes - Transient data might be never written to disk • Disadvantages: - Extra copying: Disk->buffer cache->user space - Vulnerability to failures - Does not care about the user data blocks - Control data blocks (metadata) are the real problem • INODES, pointer blocks, directories can be in cache when a failure occurs • As a result the file system internal state might be corrupted • fsck required, resulting in long (re-)boot times
File System Reliability and Recovery • File system data consists of file control data (metadata), user data • Failures can cause data loss and corruption for cached metadata or user data • Power failure during the sector write may corrupt physically the data stored in the sector • Lost or corruption of the metadata might lead to a more massive user data loss. • File systems must care about the metadata more than about the user data • The Operating System cares about the file system data (e.g. metadata) • Users must care about their data themselves (e.g., backups) • Caching affects the WRITE process reliability. • Is it guaranteed that the requested data is indeed written on disk? • What if cache blocks are the metadata blocks versus user data? • Solutions: • write-through: writes bypass cache • write-back: dirty blocks are written asynchronously [bracket processes]
Data Reliability in UNIX • User data writes based on write-back policy: • User data is written back to disk periodically • Program commands like sync and fsync are used for forced write of the dirty blocks. • Metadata writes are based on write-through policy. Updates are written to disk immediately bypassing cache. Problem: - Some data is not written in-place. Can go back to the last consistent version - Some data is replicated like UNIX superblock. - File system goes through consistency check/repair cycle at the boot time as specified in /etc/fstab options (see manpage on fsck, fstab). - Write-through negatively affects performance • Solution: maintain a sequential log of metadata updates, a Journal: e.g. IBM’s Journal File System (JFS) in AIX
Journal File System (JFS) • Metadata operations logged (journaled): • create,link,mkdir,truncate,allocating, write, … • each operation may involve several metadata updates (transaction) • Once operation is logged it returns, write ahead logging • The disk writes are performed asynchronously. Block aggregation possible. • A cursor (pointer) is maintained. The cursor is advanced once the updated blocks associated with the transaction are written to disk (hardened). Hardened transaction records can be deleted from the journal. • Upon recovery: Re-do all the operations starting from the last cursor position. • Advantages: • Asynchronous metadata write • Fast recovery: depends on the Journal size and not on the file-system size • Disadvantages • extra write • space wasted by journal (insignificant)