440 likes | 758 Views
. The Second Extended File System. The Second Extended File system was devised (by R
E N D
1. The EXT2 File System Presented by:
S. Arun Nair
Abhinav Golas
2. The Second Extended File System The Second Extended File system was devised (by Rémy Card) as an extensible and powerful file system for Linux.
It is also the most successful file system so far in the Linux community and is the basis for all of the currently shipping distributions.
Due to this, it is extremely well integrated into the kernel, with good performance enhancements.
3. Disk Layout Partition Table
Master Boot Record (MBR)
List of partitions:
Primary Partitions
Extended Partitions
How to specify
Block size for partition table
Beginning block
Ending block
4. Filesystem Way to organize data on 1 or multiple partitions
Basic abstraction for file usage
5. Ext2 File System Layout
6. Partition Layout – ext2 The Boot sector block is optional, not required if you do not want to make this partition bootable
Each Block group has the same number of available data blocks and inodes
Having multiple block groups helps counter fragmentation, improves reliability (since backups of the superblock are there) and even speeds up access as the inode table is near the data blocks – reduced seek time for data blocks
7. Partition layout – ext2 Each block group has the following structure
Again, not all block groups have the superblock . The first block group however, must have it, and it is the one used by the kernel. Others are backups to be used by filesystem checkers for consistency checks.
8. Some definitions Boot sector – Block which may contain the stage 1 boot loader and which points to the stage 1.5 or stage 2 boot loader
Superblock – The filesystem header, identifies and represents the filesystem and provides relevant information about the fs. It must be present at block 1 if a boot sector is present, otherwise at block 0
FS/Group descriptor – Pointers to the bitmaps and table in the block group
9. Some definitions Block bitmap – Block usage information, tells which blocks in the block group are empty(0) or used(1)
Inode Bitmap – Inode usage information
Inode table – Table of the inodes. Each inode provides necessary and relevant information about each file.
Data blocks – blocks where the data is stored!
10. The Ext2 Superblock The Superblock contains a description of the basic size and shape of this file system.
System keeps multiple copies of the Superblock in many Block Groups.
It holds the following information :
Magic Number : 0xef53 for the current implementation.
Revision Level : for checking compatibility
Mount Count and Maximum Mount Count : to ensure that the filesystem is periodically checked
Block Group Number : The Block Group that holds this copy of Superblock.
Block Size : size of block for the file system in bytes.
11. The Ext2 Superblock Blocks per Group : fixed when file system is created – the block bitmap must fit into 1 block, hence number of blocks per group = 8*block size
Free Blocks : Number of free blocks in the system – excludes the blocks reserved for root
Free Inodes : Number of free Inodes in the system – again excludes inodes reserved for root
First Inode : The first Inode in an EXT2 root file system would be the directory entry for the '/' directory.
12. Superblock Defined in include/linux/fs.h
13. The Ext2 Group Descriptor All the group descriptors for all of the Block Groups are duplicated in each Block Group in case of file system corruption.
The Group Descriptor contains the following:
Blocks Bitmap : block number of block allocation bitmap
Inode Bitmap : block number of Inode allocation bitmap
Inode Table : The block number of the starting block for the Inode table for this Block Group.
Free blocks count : number of data blocks free in the Group
Free Inodes count : number of Inodes free in the Group
Used directory count : number of inodes allocated to directories
14. The superblock usage sequence Mount – VFS, sets the s_state variable to EXT2_ERROR_FS if mounted as rw. At all other time it is at EXT2_VALID_FS – check for clean mount/unmount.
Cached copies of this superblock and the group descriptor are always kept.
Most VFS superblock operations are inherited for ext2
15. The Ext2 Inode
16. The Ext2 Inode Direct/Indirect Blocks : Pointers to the blocks that contain the data that this Inode is describing.
Timestamp: The time that the Inode was created and the last time that it was modified.
Size : The size of the file in bytes.
Owner info : This stores user and group identifiers of the owners of this file or directory
Mode : This holds two pieces of information; what this inode describes and the permissions that users have to it .
17. Struct inode {
Kdev_t I_dev;
Unsigned long I_ino;
Umode_t I_mode;
Nlink_t I_nlinkl;
Uid,gid etc….
}
Inodes are managed as doubly linked lists as well as a hash table.
iget() function can be used to get the inode specified by the superblock.It uses hints to resolve cross mounted file systems as well.Any access to inode increments a usage counter.
18. Inode Allocation There are two policies for allocating an inode. If the new inode is a directory, then a forward search is made for a block group with both free space and a low directory-to-inode ratio (find_group_dir); if that fails, then of the groups with above-average free space, that group with the fewest directories already is chosen (find_group_orlov). For other inodes, search forward from the parent directory's block group to find a free inode (find_group_other).
19. struct inode *ext2_new_inode(struct inode *dir, int mode) {
….
if (S_ISDIR(mode)) {
if (test_opt(sb, OLDALLOC))
group = find_group_dir(sb, dir);
else
group = find_group_orlov(sb, dir);
} else
group = find_group_other(sb, dir);
….
loop (through all block groups starting with the one computed above)
find the first zero bit in the group’s inode bitmap
if no bit is zero then group = (group+1)/N ; continue;
else if that bitmap is now 1 then {
20. if no more free bitmaps the group = (group+1)/N
else find a new zero bitmap and try to set it to 1 again
}
. . . . . . .
Set all the inode parameters from the mode information and from the parent directory.
if (test_opt (sb, GRPID))
inode->i_gid = dir->i_gid;
else if (dir->i_mode & S_ISGID) {
inode->i_gid = dir->i_gid;
if (S_ISDIR(mode))
mode |= S_ISGID;
} else
21. insert_inode_hash(inode);
. . . . . .
ext2_preread_inode(inode);
return inode;
We perform asynchronous prereading of the new inode's inode block when we create the inode, in the expectation that the inode will be written back soon. There are two reasons:
When creating a large number of files, the async prereads will be nicely merged into large reads
When writing out a large number of inodes, we don't need to keep on stalling the writes while we read the inode block.
}
22. Inode De-allocation When we get the inode, we're the only people that have access to it, and as such there are no race conditions we have to worry about. The inode is not on the hash-lists, and it cannot be reached through the file system because the directory entry has been deleted earlier.
23. void ext2_free_inode (struct inode * inode) {
we must free any quota before locking the superblock, as writing the quota to disk may need the lock as well.
. . . . .
if (!is_bad_inode(inode)) {
ext2_xattr_delete_inode(inode);
DQUOT_FREE_INODE(inode);
DQUOT_DROP(inode);
}
. . . . . . .
24. We must make sure that we get no aliases, which means that we have to call "clear_inode()" _before_ we mark the inode not in use in the inode bitmaps. Otherwise a newly created file might use the same inode number (not actually the same pointer though), and then we'd have two inodes sharing the same inode number and space on the hard disk.
. . . .
clear_inode (inode);
if (!ext2_clear_bit_atomic(sb_bgl_lock(EXT2_SB(sb), block_group),
bit, (void *) bitmap_bh->b_data))
ext2_error (sb, "ext2_free_inode",
"bit already cleared for inode %lu", ino);
else
ext2_release_inode(sb, block_group, is_directory);
25. mark_buffer_dirty(bitmap_bh);
if (sb->s_flags & MS_SYNCHRONOUS)
sync_dirty_buffer(bitmap_bh);
error_return:
brelse(bitmap_bh);
}
26. Inode Updation First of all get the pointer to buffer head and the inode in the memory using ext2_get_inode(pointer to the superblock, inode no., pointer to the pointer to the head of the buffer)
. . . . . . .
struct ext2_inode * raw_inode = ext2_get_inode(sb, ino, &bh);
. . . . . . .
Then update the Inode there using the inode given. This updation depends on what file does that inode represent
raw_inode->i_blocks = cpu_to_le32(inode->i_blocks);
raw_inode->i_dtime = cpu_to_le32(ei->i_dtime);
raw_inode->i_flags = cpu_to_le32(ei->i_flags);
raw_inode->i_faddr = cpu_to_le32(ei->i_faddr);
raw_inode->i_frag = ei->i_frag_no;
raw_inode->i_fsize = ei->i_frag_size;
27. If it is a regular file then we copy the attributes as well as the address of the blocks containing data into the field i_block[i] of the raw_inode.
. . . . . .
if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) {
if (old_valid_dev(inode->i_rdev)) {
raw_inode->i_block[0] =
cpu_to_le32(old_encode_dev(inode->i_rdev));
raw_inode->i_block[1] = 0;
} else {
raw_inode->i_block[0] = 0;
raw_inode->i_block[1] =
cpu_to_le32(new_encode_dev(inode->i_rdev));
raw_inode->i_block[2] = 0;
}
} else for (n = 0; n < EXT2_N_BLOCKS; n++)
raw_inode->i_block[n] = ei->i_data[n]
28. Else if it is a special file type then we set its attributes in a different manner.
29. Inode Deletion Check whether the inode actually exists or not.Record the current time(time keeping is done in this way).
Mark this inode dirty and the call the update function.
void ext2_delete_inode (struct inode * inode)
{
if (is_bad_inode(inode))
goto no_delete;
EXT2_I(inode)->i_dtime = get_seconds();
mark_inode_dirty(inode);
ext2_update_inode(inode, inode_needs_sync(inode));
30. Then free the inode i.e we first release the space used by the inode and then make changes in the inode and block bitmaps (ext2_free_inode()) to reflect these changes in the block group.
inode->i_size = 0;
if (inode->i_blocks)
ext2_truncate (inode);
ext2_free_inode (inode);
return;
no_delete:
clear_inode(inode); /* We must guarantee clearing of inode... */
}
31. The EXT2 Directories
32. Block manipulation Aims:
Avoid fragmentation
Block groups
Low access times
Access : use logical address for inside the block group and translate using block group number
Allocation : ext2_get_block() -> ext2_alloc_block() called with inode pointer and a goal.
33. Block allocation Goal decided by ext2_getblk()
Heuristic:
Block to be allocated is next to the last allocated block
If not 1, then it is next to some previously allocated block
If not 2, then it is in the same block group as the inode
Search:
If goal is in preallocated blocks, allocate it
If goal is free, allocate it – and preallocate upto 8 blocks after that
Else search the next 64 blocks
Consider all block groups, first for a set of atleast 8 blocks, then for solo blocks
34. Block allocation Preallocation :
Set in superblock, used to avoid extra disk accesses
Used even if disk is close to being filled – because big time saver
Preallocated blocks are released on truncation, close or a non-sequential write
Also, corresponding fields in group descriptor, inode, block bitmap are updated
35. Filesystem consistency Superblock, group descriptors etc. – all metadata must be consistent with each other
E2fsck – file system consistency checker, invoked if partition not unmounted before shutdown, or timeout – each disk must be checked after a certain number of mounts
Will try to ensure that the metadata, superblock downwards, is consistent.
Consistency of data with metadata is not ensured – big problem
36. Possible solutions Careful ordering of changes can minimize damage. eg. Increment link counter for inode for an inode before putting the hardlink on the disk
Still not completely safe, as is required for certain systems
Journaling
37. Journaling Possible solution to data-metadata consistency problem
Log all changes before writing onto disk, while keeping the log on preferably on a separate partition/disk as the data itself
Recovery:
System failure before commit to journal – ignore, as no changes have been made to data or metadata
After commit – make all changes mentioned to filesystem
Expensive operation – too many disk writes
Another problem – a change may involve many low level ops – all may not be safeguarded by journal -> partially copied files etc.
38. The ext solution Journaling not a part of ext2, but can be switched on
A pivotal part of ext3
3 modes to balance performance with safety:
Journal – safest, log all data and metadata before write
Ordered – log only metadata, but group metadata and related data, and write data to disk before metadata. Because metadata will be restored from log - Default
Writeback – log only metadata, mode similar to journalling mode found in other filesystems
39. Ext3 journalling Uses JBD layer in kernel – Journaling Block Device layer
Intended as general journaling support, currently used only be ext3
40. JBD logging Log structure:
Log record – describes single update of disk block in filesystem
Atomic operation handle – includes log records relative to a single high-level change of filesystem
Transaction – includes several atomic operations, basic unit for fsck retrieval
41. Log record Description of low-level operation to be executed by system
Represented as normal blocks of data, marked with journal_block_tag_t tag – saves logical block number affected, and status flags
Journal_head attached to head if ordered mode type order is to be maintained
42. Atomic operations Group of log records
Journal_start() indicates start, journal_stop() indicates end
Ensures that a subset of the intended operations doesn’t get executed
43. Transaction A grouping of consecutive atomic operations
MUST be stored in consecutive blocks
After creation, end if:
Fixed timeout, typically 5 seconds (fs set)
No free blocks left in journal for new atomic operation handle
Described by descriptor of type transaction_t
44. Transaction State described by t_state
Only complete transactions are processed for recovery – all log records included in transaction have been physically written into journal – t_state stores T_FINISHED
Incomplete transactions – skipped by fsck. Possible t_state values
T_RUNNING – still accepting atomic operation handles
T_LOCKED – Not accepting new op handles, but some are incomplete
T_FLUSH - All atomic op handles have finished, but some log records are being written to journal
T_COMMIT – all log records written to disk, transaction to be marked complete
45. JBD functioning At any time, there can be several transactions in journal, but only 1 may be incomplete
Completed transaction removed from journal after JBD verifies that all buffers referred to be log records have been successfully written to disk