760 likes | 921 Views
Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18. Examples. UNIX: UFS based on FFS Windows: Disk: FAT, FAT32 and NTFS CD, DVD, floppy-disk .. filesystems Linux (40+): ext2, ext3, .. Distributed filesystems: NFS
E N D
Filesystem Implementation Tanenbaum Ch. 6 Silberschatz Ch. 11 Bovet Ch. 12, 18
Examples • UNIX: UFS based on FFS • Windows: • Disk: FAT, FAT32 and NTFS • CD, DVD, floppy-disk .. filesystems • Linux (40+): ext2, ext3, .. • Distributed filesystems: NFS Modern OS must concurrently support multiple types of filesystems (fs)!
Layered Approach Content / operation on files Handles metadata and directory structures Handles files (logical blocksphysical blocks) Handles basic reading/writing of physical blocks Interrupt handlers Device drivers Shared HW (Disk)
Virtual File Systems (VFS) • VFS provide an object-oriented way of implementing filesystems • VFS allows the same system call interface (the API) to be used for different types of filesystems • The interface is to the VFS interface, rather than any specific type of filesystem
Schematic View of VFS Concurrently support of multiple filesystems
VFS and Linux • VFS introduces a “common file model” • vnode: File representation structure • “implemented” by • FAT32, NTFS, ext2/3, AFS, NFS, ReiserFS … • Linux • i-node object (a file) • file object (an open file) • superblock object (entire file system) • dentry object (directory entry)
The VFS Objects: Common File Model from <fs.h> from <dcache.h> Represents an open file in a process Represents a file in the filesystem Represents a directory entry from <fs.h> Represents a filesystem static ssize_t fifo_read( struct file *file, char *buf, size_t count, loff_t *ppos ) { struct inode *node = file->f_dentry->d_inode; unsigned int minor = iminor(node); … } macro from < /usr/include/linux/fs.h> Process struct file struct dentry struct inode file struct super_block
Outline • Implementing filesystems on disk • implementing files • contiguous allocation • linked-list allocation • file allocation table (FAT) • i-node • implementing directories • trade-offs and performance • Look at some of the VFS objects for Linux • no complete listings • Example of filesystem implementation • ext2, ext3
Filesystem Implementation A possible file system layout
Files Consist of Blocks of Data 1 2 3 8 12 4 9 • Where to store/allocate blocks? • How to find files/blocks? • What is a good block size? 4 5 6 1 6 7 5 7 8 9 10 11 12 3 10 2 11 Logical address (block) Physical address (block)
Implementing Files (1) (a) Contiguous allocation of disk space for 7 files (b) State of the disk after files D and E have been removed
Contiguous Allocation • Finding files/blocks is easy • Offset + number of blocks • Excellent read performance • Fragmentation • Compaction • Reuse of holes • Need to know max file size when allocating • Where could this allocation be useful? • What is the standard alternative to static allocation in computer science (think arrays in C)?
Implementing Files (2) - Storing a file as a linked list of disk blocks - Directory contains a pointer to first and last blocks How much data can be stored in 10 blocks?
Linked List Allocation • No holes, no pre-allocation problem • Only address of first block needs to be stored • Finding block n is expensive • Need to read all n-1 blocks prior to block n • Size of data block is not 2x • The pointer is not data • Both disadvantages can be removed using a new data structure, which?
Implementing Files – FAT A: 4 – 7 – 2 – 10 – 12 Idea: store the pointers in a table • Fast random access • Table can be stored in RAM • Full 2x block size This method is called FAT(File Allocation Table) Disadvantage: table size 20 GB, block size 1 KB 20M blocks 80 MB (4-byte entries) or 60MB (3-byte entries) What can we do to reduce the storage requirement?
FAT i-nodes • Do we actually need to have the whole table in memory all the time? • table size proportional to disk size! • Actually, only open files need to be there… • Split the table into per-file tables, called i-nodes (index node)
Implementing Files (4) An example i-node Indirect block to handle large files
Indirect Addressing An i-node with 3 levels of indirect blocks
Directories • Opening a file: • locate root directory, • search for desired directory, • directory contains info to find file blocks on disk. • disk address (contiguous allocation) • number of first block (linked list) • i-node number • Directory system: maps ASCII file name onto the information needed to open it
Implementing Directories (1) Where to store file attributes? (a) A simple directory fixed size entries (1 per file) disk addresses and attributes in directory entry (b) Directory in which each entry just refers to an i-node
Implementing Directories (2) • Directories are files (i-node) with i-node pointers • Directory systems should translate a name to a file (i-node) • dentry keeps this info in VFS
Shared Files Storing attributes in i-node simplifies sharing File system containing a shared file
Hard/Symbolic Links • Hard links are actually the same file • share the same i-node • will be seen as the same file everywhere • same owner • same contents • same permissions • keeps counter • Symbolic links are dereferenced • a special file • different owners/permissions • can cross filesystem boundaries • short cuts in Windows, alias in Mac
Shared Files (a) Situation prior to linking (b) After the link is created (c) After the original owner removes the file
Check this under Linux.. Execute as u1=user1, u2=user2 (make sure that u2 has write permissions) • u1: echo Hi > file-u1 • u2: ln file-u1 file-u2 • u2: ln –s file-u1 file-u2-s • u2: cat file-u2 • u2: cat file-u2-s • u1: echo again >> file-u1 • u1: rm file-u1 • u2: cat file-u2 • u2: cat file-u2-s What is the output of line 4, 5 & 8, 9? Why?
Mounting / • The directory i-node indicates that it is a mount point usr bin tmp windows Windows Temp Documents and Settings
Disk Space Management Block size (bytes) Store files in fixed-size blocks, how big the blocks should be? - Average file size is important All files are 2KB large
Keeping Track of Free Blocks (1) (a) Storing the free list on a linked list (32 bits / block) (b) A bit map (1 bit per block, but for all blocks)
Keeping Track of Free Blocks (2) • Bitmap size depends on disk and block size • Linked list size depends on # free blocks • Bitmaps are generally smaller • Linked lists can use free blocks … • Only one block of the linked list needed in main memory • The others are read/written on demand • Problems? What happens if files are deleted?
Keeping Track of Free Blocks (3) (a) Almost-full block of pointers to free disk blocks in RAM - three blocks of pointers on disk (b) Result of freeing a 3-block file (c) Alternative strategy for handling 3 free blocks - shaded entries are pointers to free disk blocks
Quota Quotas for keeping track of each user’s disk use
Backups • Performing filesystem backups is essential for reliable systems • Two types • Full • Incremental • Typically a mixed algorithm is used • How to keep track of which files to save?
Backups • A filesystem to be dumped • squares are directories, circles are files • shaded items, modified since last dump • each directory & file labeled by i-node number File that has not changed
Backups • Commonly all modified files and directories above them are stored • Can restore on another filesystem • Individual files can be restored from incremental backup • Bitmaps are used to find the modified i-nodes
Backups 4 phases of the algorithm • Recursively mark each dir and each modified i-node (a) • Recursively unmark non-modified dirs (b) • Dump all dirs (c) • Dump all modified i-nodes (d)
Outline • Implementing filesystems on disk • Implementing files • contiguous allocation • linked-list allocation • file allocation table (FAT) • i-node • Implementing directories • trade-offs and performance • Look at some of the VFS objects for Linux • no complete listings • Example of filesystem implementation • ext2, ext3
The Common File Model from <fs.h> from <dcache.h> Represents an open file in a process Represents a file in the filesystem Represents a directory entry from <fs.h> Represents a filesystem Process struct file struct dentry struct inode file struct super_block
struct task_struct { volatile long state; struct thread_info *thread_info; atomic_t usage; unsigned long flags; unsigned long ptrace; int lock_depth; int prio, static_prio; struct list_head run_list; prio_array_t *array; unsigned long sleep_avg; long interactive_credit; […] /* file system info */ int link_count, total_link_count; struct tty_struct *tty; /* NULL if no tty */ /* ipc stuff */ struct sysv_sem sysvsem; /* CPU-specific state of this task */ struct thread_struct thread; /* filesystem information */ struct fs_struct *fs; /* open file information */ struct files_struct *files; /* namespace */ struct namespace *namespace; /* signal handlers */ struct signal_struct *signal; struct sighand_struct *sighand; […] }; struct files_struct { atomic_t count; spinlock_t file_lock; int max_fds; int max_fdset; int next_fd; struct file ** fd; /* current fd array */ fd_set *close_on_exec; fd_set *open_fds; fd_set close_on_exec_init; fd_set open_fds_init; struct file * fd_array[NR_OPEN_DEFAULT]; }; task_struct (sched.h) Remember: • Each process is represented using a task_struct • Keeps “a list” of open files • files_struct
struct file { struct list_head f_list; struct dentry *f_dentry; struct vfsmount *f_vfsmnt; struct file_operations *f_op; atomic_t f_count; unsigned int f_flags; mode_t f_mode; loff_t f_pos; struct fown_struct f_owner; unsigned int f_uid, f_gid; int f_error; struct file_ra_state f_ra; unsigned long f_version; void *f_security; [..] }; The fileobject: Created by the OS when a file is opened Does not exist on disk! no “dirty” bit is needed Several processes can use the same file object Contains a list of pointers to operations on this file File (fs.h) Directory entry for the file! Set by the OS when file loaded from inode File reference count Current file pointer (offset)
Operations of Files struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *); int (*release) (struct inode *, struct file *); int (*fsync) (struct file *, struct dentry *, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void __user *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); };
Dentry (Directory Entry) /users/aja/crap/exam.tex • Dentry does not represent directories! • i-nodes represent directories • Used in directory related operations • e.g., pathname lookup 1 dentry and 1 i-node for each component
Dentry Cache • Dentry objects are created on the fly • time consuming! • inefficient • dentry objects are often reused soon after creation • Store dentry objects in a SW cache • the dentry cache (remember dcache.h)
Software Caches • The frequently used (created/destroyed) objects are stored/allocated in SW caches • Basically three caches exists in Linux • User mode memory (VM) • Slab allocator (common structures/objects) • Page cache (inodes, disk blocks) • Disk caches (the Page Cache) are used to cache disk accesses (not VM pages!!) • Crucial to system performance! • Must also be part of the page replacement algorithm • Bovet, Ch. 17
Dentry Cache • Unused dentry objects stored in a list • Allows easy LRU replacement • A hash table (name dentry) • Allows fast lookup • Dentry states: • In use– used, and contains valid info • Unused– not used, but points to valid i-node • Negative – the i-node does not exist, kept to speed up lookups • Free– contains no valid info (stored in the slab cache) Can safely be deleted by the page replacement algorithm
dentry (dcache.h) dentry: • Associates the components of a pathname to their inodes • Does not exist on disk struct dentry { atomic_t d_count; unsigned long d_vfs_flags; /* moved here to be on same cacheline */ spinlock_t d_lock; /* per dentry lock */ struct inode * d_inode; /* Where the name belongs to - NULL is negative */ struct list_head d_lru; /* LRU list */ struct list_head d_child; /* child of parent list */ struct list_head d_subdirs; /* our children */ struct list_head d_alias; /* inode alias list */ unsigned long d_time; /* used by d_revalidate */ struct dentry_operations *d_op; struct super_block * d_sb; /* The root of the dentry tree */ unsigned int d_flags; int d_mounted; void * d_fsdata; /* fs-specific data */ struct rcu_head d_rcu; struct dcookie_struct * d_cookie; /* cookie, if any */ unsigned long d_move_count;/* to indicated moved dentry while lockless lookup */ struct qstr * d_qstr; /* quick str ptr used in lockless lookup and concurrent d_move */ struct dentry * d_parent; /* parent directory */ struct qstr d_name; struct hlist_node d_hash; /* lookup hash list */ struct hlist_head * d_bucket; /* lookup hash bucket */ unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* small names */ } ____cacheline_aligned;
struct inode { struct hlist_node i_hash; struct list_head i_list; struct list_head i_sb_list; struct list_head i_dentry; unsigned long i_ino; atomic_t i_count; umode_t i_mode; unsigned int i_nlink; uid_t i_uid; gid_t i_gid; dev_t i_rdev; loff_t i_size; struct timespec i_atime; struct timespec i_mtime; struct timespec i_ctime; unsigned int i_blkbits; unsigned long i_blksize; unsigned long i_version; unsigned long i_blocks; unsigned short i_bytes; spinlock_t i_lock; struct semaphore i_sem; struct inode_operations *i_op; struct file_operations *i_fop; struct super_block *i_sb; struct file_lock *i_flock; struct address_space *i_mapping; struct address_space i_data; struct dquot *i_dquot[MAXQUOTAS]; /* These three should probably be a union */ struct list_head i_devices; struct pipe_inode_info *i_pipe; struct block_device *i_bdev; struct cdev *i_cdev; int i_cindex; unsigned long i_dnotify_mask; struct dnotify_struct *i_dnotify; unsigned long i_state; unsigned int i_flags; unsigned char i_sock; atomic_t i_writecount; void *i_security; u32 i_generation; union { void *generic_ip; } u; #ifdef __NEED_I_SIZE_ORDERED seqcount_t i_size_seqcount; #endif }; inode (fs.h) Structure with pointers to the page cache List of operations supported on this file(system) There is also an inode cache (inode.c)
inode_operations (fs.h) struct inode_operations { int (*create) (struct inode *,struct dentry *,int, struct nameidata *); struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *); int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); int (*symlink) (struct inode *,struct dentry *,const char *); int (*mkdir) (struct inode *,struct dentry *,int); int (*rmdir) (struct inode *,struct dentry *); int (*mknod) (struct inode *,struct dentry *,int,dev_t); int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *); int (*readlink) (struct dentry *, char __user *,int); int (*follow_link) (struct dentry *, struct nameidata *); void (*truncate) (struct inode *); int (*permission) (struct inode *, int, struct nameidata *); int (*setattr) (struct dentry *, struct iattr *); int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); };
struct address_space • Stores pages in the page cache as a radix tree • Remember digital search trees (tries)? • Allows fast lookup and sorting • Retrieve all dirty blocks • Read more on: • http://lwn.net/Articles/175432/ • Bovet, Ch. 15
super_block (fs.h) struct super_block { struct list_head s_list; /* Keep this first */ dev_t s_dev; /* search index; _not_ kdev_t */ unsigned long s_blocksize; unsigned long s_old_blocksize; unsigned char s_blocksize_bits; unsigned char s_dirt; unsigned long long s_maxbytes; /* Max file size */ struct file_system_type * s_type; struct super_operations * s_op; struct dquot_operations * dq_op; struct quotactl_ops * s_qcop; struct export_operations * s_export_op; unsigned long s_flags; unsigned long s_magic; struct dentry * s_root; struct rw_semaphore s_umount; Used to store filesystem specific information This reflects VFS’s view of the fs!