The EXT2 File System

1. The EXT2 File System Presented by: S. Arun Nair Abhinav Golas

2. The Second Extended File System The Second Extended File system was devised (by R�my Card) as an extensible and powerful file system for Linux. It is also the most successful file system so far in the Linux community and is the basis for all of the currently shipping distributions. Due to this, it is extremely well integrated into the kernel, with good performance enhancements.

3. Disk Layout Partition Table Master Boot Record (MBR) List of partitions: Primary Partitions Extended Partitions How to specify Block size for partition table Beginning block Ending block

4. Filesystem Way to organize data on 1 or multiple partitions Basic abstraction for file usage

5. Ext2 File System Layout

6. Partition Layout � ext2 The Boot sector block is optional, not required if you do not want to make this partition bootable Each Block group has the same number of available data blocks and inodes Having multiple block groups helps counter fragmentation, improves reliability (since backups of the superblock are there) and even speeds up access as the inode table is near the data blocks � reduced seek time for data blocks

7. Partition layout � ext2 Each block group has the following structure Again, not all block groups have the superblock . The first block group however, must have it, and it is the one used by the kernel. Others are backups to be used by filesystem checkers for consistency checks.

8. Some definitions Boot sector � Block which may contain the stage 1 boot loader and which points to the stage 1.5 or stage 2 boot loader Superblock � The filesystem header, identifies and represents the filesystem and provides relevant information about the fs. It must be present at block 1 if a boot sector is present, otherwise at block 0 FS/Group descriptor � Pointers to the bitmaps and table in the block group

9. Some definitions Block bitmap � Block usage information, tells which blocks in the block group are empty(0) or used(1) Inode Bitmap � Inode usage information Inode table � Table of the inodes. Each inode provides necessary and relevant information about each file. Data blocks � blocks where the data is stored!

10. The Ext2 Superblock The Superblock contains a description of the basic size and shape of this file system. System keeps multiple copies of the Superblock in many Block Groups. It holds the following information : Magic Number : 0xef53 for the current implementation. Revision Level : for checking compatibility Mount Count and Maximum Mount Count : to ensure that the filesystem is periodically checked Block Group Number : The Block Group that holds this copy of Superblock. Block Size : size of block for the file system in bytes.

11. The Ext2 Superblock Blocks per Group : fixed when file system is created � the block bitmap must fit into 1 block, hence number of blocks per group = 8*block size Free Blocks : Number of free blocks in the system � excludes the blocks reserved for root Free Inodes : Number of free Inodes in the system � again excludes inodes reserved for root First Inode : The first Inode in an EXT2 root file system would be the directory entry for the '/' directory.

12. Superblock Defined in include/linux/fs.h

13. The Ext2 Group Descriptor All the group descriptors for all of the Block Groups are duplicated in each Block Group in case of file system corruption. The Group Descriptor contains the following: Blocks Bitmap : block number of block allocation bitmap Inode Bitmap : block number of Inode allocation bitmap Inode Table : The block number of the starting block for the Inode table for this Block Group. Free blocks count : number of data blocks free in the Group Free Inodes count : number of Inodes free in the Group Used directory count : number of inodes allocated to directories

14. The superblock usage sequence Mount � VFS, sets the s_state variable to EXT2_ERROR_FS if mounted as rw. At all other time it is at EXT2_VALID_FS � check for clean mount/unmount. Cached copies of this superblock and the group descriptor are always kept. Most VFS superblock operations are inherited for ext2

15. The Ext2 Inode

16. The Ext2 Inode Direct/Indirect Blocks : Pointers to the blocks that contain the data that this Inode is describing. Timestamp: The time that the Inode was created and the last time that it was modified. Size : The size of the file in bytes. Owner info : This stores user and group identifiers of the owners of this file or directory Mode : This holds two pieces of information; what this inode describes and the permissions that users have to it .

17. Struct inode { Kdev_t I_dev; Unsigned long I_ino; Umode_t I_mode; Nlink_t I_nlinkl; Uid,gid etc�. } Inodes are managed as doubly linked lists as well as a hash table. iget() function can be used to get the inode specified by the superblock.It uses hints to resolve cross mounted file systems as well.Any access to inode increments a usage counter.

18. Inode Allocation There are two policies for allocating an inode. If the new inode is a directory, then a forward search is made for a block group with both free space and a low directory-to-inode ratio (find_group_dir); if that fails, then of the groups with above-average free space, that group with the fewest directories already is chosen (find_group_orlov). For other inodes, search forward from the parent directory's block group to find a free inode (find_group_other).

19. struct inode *ext2_new_inode(struct inode *dir, int mode) { �. if (S_ISDIR(mode)) { if (test_opt(sb, OLDALLOC)) group = find_group_dir(sb, dir); else group = find_group_orlov(sb, dir); } else group = find_group_other(sb, dir); �. loop (through all block groups starting with the one computed above) find the first zero bit in the group�s inode bitmap if no bit is zero then group = (group+1)/N ; continue; else if that bitmap is now 1 then {

20. if no more free bitmaps the group = (group+1)/N else find a new zero bitmap and try to set it to 1 again } . . . . . . . Set all the inode parameters from the mode information and from the parent directory. if (test_opt (sb, GRPID)) inode->i_gid = dir->i_gid; else if (dir->i_mode & S_ISGID) { inode->i_gid = dir->i_gid; if (S_ISDIR(mode)) mode |= S_ISGID; } else

21. insert_inode_hash(inode); . . . . . . ext2_preread_inode(inode); return inode; We perform asynchronous prereading of the new inode's inode block when we create the inode, in the expectation that the inode will be written back soon. There are two reasons: When creating a large number of files, the async prereads will be nicely merged into large reads When writing out a large number of inodes, we don't need to keep on stalling the writes while we read the inode block. }

22. Inode De-allocation When we get the inode, we're the only people that have access to it, and as such there are no race conditions we have to worry about. The inode is not on the hash-lists, and it cannot be reached through the file system because the directory entry has been deleted earlier.

23. void ext2_free_inode (struct inode * inode) { we must free any quota before locking the superblock, as writing the quota to disk may need the lock as well. . . . . . if (!is_bad_inode(inode)) { ext2_xattr_delete_inode(inode); DQUOT_FREE_INODE(inode); DQUOT_DROP(inode); } . . . . . . .

24. We must make sure that we get no aliases, which means that we have to call "clear_inode()" _before_ we mark the inode not in use in the inode bitmaps. Otherwise a newly created file might use the same inode number (not actually the same pointer though), and then we'd have two inodes sharing the same inode number and space on the hard disk. . . . . clear_inode (inode); if (!ext2_clear_bit_atomic(sb_bgl_lock(EXT2_SB(sb), block_group), bit, (void *) bitmap_bh->b_data)) ext2_error (sb, "ext2_free_inode", "bit already cleared for inode %lu", ino); else ext2_release_inode(sb, block_group, is_directory);

25. mark_buffer_dirty(bitmap_bh); if (sb->s_flags & MS_SYNCHRONOUS) sync_dirty_buffer(bitmap_bh); error_return: brelse(bitmap_bh); }

26. Inode Updation First of all get the pointer to buffer head and the inode in the memory using ext2_get_inode(pointer to the superblock, inode no., pointer to the pointer to the head of the buffer) . . . . . . . struct ext2_inode * raw_inode = ext2_get_inode(sb, ino, &bh); . . . . . . . Then update the Inode there using the inode given. This updation depends on what file does that inode represent raw_inode->i_blocks = cpu_to_le32(inode->i_blocks); raw_inode->i_dtime = cpu_to_le32(ei->i_dtime); raw_inode->i_flags = cpu_to_le32(ei->i_flags); raw_inode->i_faddr = cpu_to_le32(ei->i_faddr); raw_inode->i_frag = ei->i_frag_no; raw_inode->i_fsize = ei->i_frag_size;

27. If it is a regular file then we copy the attributes as well as the address of the blocks containing data into the field i_block[i] of the raw_inode. . . . . . . if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) { if (old_valid_dev(inode->i_rdev)) { raw_inode->i_block[0] = cpu_to_le32(old_encode_dev(inode->i_rdev)); raw_inode->i_block[1] = 0; } else { raw_inode->i_block[0] = 0; raw_inode->i_block[1] = cpu_to_le32(new_encode_dev(inode->i_rdev)); raw_inode->i_block[2] = 0; } } else for (n = 0; n < EXT2_N_BLOCKS; n++) raw_inode->i_block[n] = ei->i_data[n]

28. Else if it is a special file type then we set its attributes in a different manner.

29. Inode Deletion Check whether the inode actually exists or not.Record the current time(time keeping is done in this way). Mark this inode dirty and the call the update function. void ext2_delete_inode (struct inode * inode) { if (is_bad_inode(inode)) goto no_delete; EXT2_I(inode)->i_dtime = get_seconds(); mark_inode_dirty(inode); ext2_update_inode(inode, inode_needs_sync(inode));

30. Then free the inode i.e we first release the space used by the inode and then make changes in the inode and block bitmaps (ext2_free_inode()) to reflect these changes in the block group. inode->i_size = 0; if (inode->i_blocks) ext2_truncate (inode); ext2_free_inode (inode); return; no_delete: clear_inode(inode); /* We must guarantee clearing of inode... */ }

31. The EXT2 Directories

32. Block manipulation Aims: Avoid fragmentation Block groups Low access times Access : use logical address for inside the block group and translate using block group number Allocation : ext2_get_block() -> ext2_alloc_block() called with inode pointer and a goal.

33. Block allocation Goal decided by ext2_getblk() Heuristic: Block to be allocated is next to the last allocated block If not 1, then it is next to some previously allocated block If not 2, then it is in the same block group as the inode Search: If goal is in preallocated blocks, allocate it If goal is free, allocate it � and preallocate upto 8 blocks after that Else search the next 64 blocks Consider all block groups, first for a set of atleast 8 blocks, then for solo blocks

34. Block allocation Preallocation : Set in superblock, used to avoid extra disk accesses Used even if disk is close to being filled � because big time saver Preallocated blocks are released on truncation, close or a non-sequential write Also, corresponding fields in group descriptor, inode, block bitmap are updated

35. Filesystem consistency Superblock, group descriptors etc. � all metadata must be consistent with each other E2fsck � file system consistency checker, invoked if partition not unmounted before shutdown, or timeout � each disk must be checked after a certain number of mounts Will try to ensure that the metadata, superblock downwards, is consistent. Consistency of data with metadata is not ensured � big problem

36. Possible solutions Careful ordering of changes can minimize damage. eg. Increment link counter for inode for an inode before putting the hardlink on the disk Still not completely safe, as is required for certain systems Journaling

37. Journaling Possible solution to data-metadata consistency problem Log all changes before writing onto disk, while keeping the log on preferably on a separate partition/disk as the data itself Recovery: System failure before commit to journal � ignore, as no changes have been made to data or metadata After commit � make all changes mentioned to filesystem Expensive operation � too many disk writes Another problem � a change may involve many low level ops � all may not be safeguarded by journal -> partially copied files etc.

38. The ext solution Journaling not a part of ext2, but can be switched on A pivotal part of ext3 3 modes to balance performance with safety: Journal � safest, log all data and metadata before write Ordered � log only metadata, but group metadata and related data, and write data to disk before metadata. Because metadata will be restored from log - Default Writeback � log only metadata, mode similar to journalling mode found in other filesystems

39. Ext3 journalling Uses JBD layer in kernel � Journaling Block Device layer Intended as general journaling support, currently used only be ext3

40. JBD logging Log structure: Log record � describes single update of disk block in filesystem Atomic operation handle � includes log records relative to a single high-level change of filesystem Transaction � includes several atomic operations, basic unit for fsck retrieval

41. Log record Description of low-level operation to be executed by system Represented as normal blocks of data, marked with journal_block_tag_t tag � saves logical block number affected, and status flags Journal_head attached to head if ordered mode type order is to be maintained

42. Atomic operations Group of log records Journal_start() indicates start, journal_stop() indicates end Ensures that a subset of the intended operations doesn�t get executed

43. Transaction A grouping of consecutive atomic operations MUST be stored in consecutive blocks After creation, end if: Fixed timeout, typically 5 seconds (fs set) No free blocks left in journal for new atomic operation handle Described by descriptor of type transaction_t

44. Transaction State described by t_state Only complete transactions are processed for recovery � all log records included in transaction have been physically written into journal � t_state stores T_FINISHED Incomplete transactions � skipped by fsck. Possible t_state values T_RUNNING � still accepting atomic operation handles T_LOCKED � Not accepting new op handles, but some are incomplete T_FLUSH - All atomic op handles have finished, but some log records are being written to journal T_COMMIT � all log records written to disk, transaction to be marked complete

45. JBD functioning At any time, there can be several transactions in journal, but only 1 may be incomplete Completed transaction removed from journal after JBD verifies that all buffers referred to be log records have been successfully written to disk

The EXT2 File System

The EXT2 File System

Presentation Transcript

The UNIX File System

Ext2/Ext3 Linux File System

The Google File System

The Google File System

The Google File System

Ext2 & Ext3 File Systems

The Google File System

The File System

SCALABILITY OF EXT2

The Google File System

The File System

The Google File System

The UNIX File System

The Google File System

The Google File System

THE FILE SYSTEM

Digging into ext2

The FAT File System

The “File System”

The Spensa File System

The EXT2 File System