820 likes | 843 Views
Advanced File System. Chapter Seven. Topics. Features Architecture On Disk Format BAS FAS Logging and Transactions In-Memory Structures. Features. Flexible Storage Assignment through Domains Ease of Administration Extent Based File Storage Allocation Sequential Access Performance
E N D
Advanced File System Chapter Seven
Topics • Features • Architecture • On Disk Format • BAS • FAS • Logging and Transactions • In-Memory Structures
Features • Flexible Storage Assignment through Domains • Ease of Administration • Extent Based File Storage Allocation • Sequential Access Performance • Logging • Reliable Operation and Fast Reboot • File System Read-Only Clones • On-line Backup • File Striping • Performance • Trashcan Directories • User controlled restore utility
File Domains and Filesets File Domain Volumes (Disk Partitions) Filesets Filesets != Partitions
Volumes • Physical Storage Building Blocks for a File Domain • Any Logical UNIX Block Device • "Real" Disk Partition • Hardware RAID Logical Disk • LSM Volume • Administered from /etc/fdmns /etc/fdmns/users_dmn: total o lrwxrwxrwx 1 root system 10 Jun 1 21:50 dsk5c -> /dev/disk/dsk5c lrwxrwxrwx 1 root system 9 Jun 1 21:50 dsk8e -> /dev/disk/dsk8e
Filesets • A file/directory tree mapped to a domain • Created with mkfset or dxadvfs • Mounted like a filesystem • Administered from /etc/fstab file users_domain#users_fs /users advfs rq,userquota
Extent Based Storage extent 1 extent 2 logical file Extent Map Disk Space extent 1 extent 2
Extents • Set of Contiguous Pages • AdvFS attempts to write a file with a few large extents # showfile -x /usr/bin/X11/dxadvfs ID Vol PgSz Pages XtntType Segs SegSz Log Perf File 4714.8001 1 16 812 simple ** ** off 100% dxadvfs extentMap: 1 pageOff pageCnt vol volBlock blockCnt 0 812 1 1560816 12992 extentCnt: 1
Why Logging • Many file system operations involve several widely separated writes to disk • a transaction • Crash in between them leaves the on disk file system inconsistent - See UFS and fsck
"log" tagN Directory 1 2 3 Logging a Transaction 3 2 6 Tag Directory 1 Bitfile Metadata Table Commit Log 4 5
AdvFS Logging • Advfs Transaction • Modifications to its own Metadata (internal structures) • NOT user file data • For each transaction AdvFS • Writes a series of log records describing all changes for an operation to disk and then • Performs changes (writes changed blocks to disk) • In case of crash • On reboot • On-disk log indicates which transactions are complete.
Fileset Clones Application Domain write read COW Backup tool read write after clone is created, before any writes first write to a block in the original (master) fileset access to COW write blocks in the cloned fileset
Clone Issues • apps shouldn't be writing to master when clone is created. • Fortunately cloning time is < second due to COW • a clone != a backup. • A clone is a tool for minimizing down time for a fileset due to backups • make clone of fileset • backup from clone • delete clone
File Striping Domain File 1 2 3 4 5 ..
Trashcan Dir rm mv Trash Cans
Commands File Domains mkfdmn addvol rmvol balance defragement Filesets mkfset chfsets clonefset Files migrate stripe mktrashcan
AdvFS Architecture VFS File Access Subsystem (FAS) VFS operations vnode operations Bitfile Access Subsystem (BAS) Domains and Volumes Bitfiles Transaction Management Block Device Interface
AdvFS Components (1 of 2) • File Access Subsystem (FAS) - the POSIX file system layer in AdvFS, translates VFS file system requests into BAS requests. Components: • Mount, Unmount, Initialization • Directory operations (lookup, create, delete) • File operations (create, read, write, stat, delete, rename) • Bitfile Access Subsystem (BAS) - the bitfile layer in AdvFS Components: • Domain ops (create, delete, open, close) • Bitfile set ops (create, delete, clone, open, close) • Bitfile ops (create, delete, open, close, migrate, read, write, add & remove stg) • Transactions management ops (start, stop, fail, pin pg, pin record, locking, recovery)
AdvFS Components (2 of 2) • Buffer cache ops (pin & unpin page, ref & deref page, flush bitfile, flush cache, prefetch pages, I/O queuing) • Volume ops (add, remove) • Hierarchical Storage Management ops (shelve, unshelve)
Acronyms and prefixes in AdvFS code fs_ - FAS layer routines bs_ - BAS layer routines ms_ - BAS layer routines msfs_ - BAS layer routines (most are VFS vnode or fs ops) advfs_ - BAS layer routine (replaces msfs_ and ms_ which stood for MegaSafe File System) ftx_ - Transaction manager routines rbf_ - Recoverable versions of BAS routines (rbf_pingpg()vs bs_pinpg()) lk_ - Lock manager routines lgr_ - Log manager routines mss_ - HSM shelving code (mss == MegaSafe Shelving) ter_ - Tertiary storage management imm_ - In-memory extent map management
(Primary) mcell 292 bytes Contains variable sized records such as; POSIX attributes extent map records Additional mcell(s) optional can contain more extent map records if needed extent 1 extent 2 BAS On Disk Format: Everything is a Bitfile owner group size mod bits .... Logical File 8K Pages On Disk
Bitfiles • Array of 8k Pages • Stored as extents • groups of on-disk contiguous 8k pages • managed by extent maps • Identified by a tag • tag.sequence, e.g. 4714.8001 • tag number ~inode number • sequence number ~generation number • All sectors are free or in a bitfile • Represented by one or more mcells
mcells • The "inodes" of Advfs • 28 Fixed-size (292 byte) mcells are packed into 8K pages • One or more linked mcells describe bitfiles • First mcell in list is primary mcell • Each mcell contains variably-sized records describing attributes of the bitfile • Contained in the Bitfile Metadata Table (BMT) • However, the mcells that describe reserved bitfiles, such as the BMT itself, are contained in the RBMT (Reserved BMT)
mcell records • Types include • Extent Maps (of various kinds) • Bitfile Attributes (clone, original, etc.) • Domain Attributes • Virtual Disk Attributes (disk ID, disk index, etc.) • Fragment Attributes - more later • POSIX file stats • Symbolic link targets
mcell header record record record mcell header record record record mcell header record record record mcell Page Structure Page page header mcell variable sized records mcell 28 mcells per page mcell ....
Extent Maps (1 of 2) • Stored as records in mcells of a bitfile • bitfile page number -> logical block number • For non-reserved files • extent maps stored in the BMT • primary extent map record • Contains within the file’s primary mcell • Can hold two extent records • Points to an extra extent map • extra extent map records • Can hold up to 31 extent records • Allocated and chained as the file grows
Extent Maps (2 of 2) • For reserved bitfiles, e.g. the BMT • extent maps are stored in the RBMT • For striped files • Extent map records are on different disks • A stripe file is a "meta-file" composed of bitfiles on many volumes.
BAS On-Disk Metadata Bitfiles Reserved BMT Reserved BMT Per Volume Bitfile Metadata Table Bitfile Metadata Table Storage Bitmap Storage Bitmap Misc Bit File Misc Bit File Per Domain Root Tag Directory On-Disk Log Tag Directory: fileset A Per Fileset Fragment File: fileset A
Bitfile Metadata Table (BMT) Bitfile • Contains all mcells • metadata files • user files • Analogus to UFS inode table • A bitfile • its own first mcell describes itself • as more mcells are needed it grows using extents
BMT Page 0 • Starts at sector 32 • Primary mcells in BMT 0 • mcell 0 Bitfile Metadata Table (BMT) • mcell 1 Storage Bitmap (SBM) • mcell 2 Root Tag Directory - Optional, one per domain • mcell 3 Log - Optional, one per domain • mcell 5 Misc bitfile • All secondary mcells (extent maps ) for the BMT must be in BMT page 0
BMT Page 1 • Starts at Sector 48 • mcell 0 is head of the BMT page free list • Contains mcells for non-reserved bitfiles • i.e. user files and directories • All other BMT pages are found via the BMT extent map
Storage Bitmap (SBM) Bitfile • One per volume • Indicates free disk blocks on volume, available for • extents for BMT growth (more mcells) • extents for filesets bitfiles (Tags Dirs, Fragment Files) • extents for user files • AdvFS Storage is allocated in clusters • 1 cluster == 2 sectors == 1024 bytes • On-disk SBM • Just an array of bits, 1 bit per cluster
Misc Bitfile • One per volume • Bitfile container for volume blocks not associated directly with any other AdvFS or user bitfiles. • Primary and Secondary Boot Block • Partition Table (Disk Label) • Fake UFS Super Block with Advfs Magic Number
H R R R ... H R R R ... LSN LSN LSN LSN On-Disk Log (1 of 2) • One per domain • Log Sequence Numbers (LSNs) • Identify a record in the log • Incremented with time so that recovery process can identify most recent log records • Log records are written in pages (8k) • The first word of each block has the same LSN if the log page was completely written 0 N 8 K H R R R ... ..... LSN LSN
On-Disk Log (2 of 2) • Log records may be "continued" • Long log records are linked lists of short ones • Log records are written in a circular buffer • Log pages will be reused long after the logged changes are made. • Parsing of the log need not be fast - It only gets done on reboots after crashes.
Tag Directories • Bitfiles are identified by tags • Tag directories are • Arrays indexed by tag number • Directory entries contain • sequence number • volume index • mcell id within volume • Sequence numbers start at 0x8001 and can be used 4096 times.
Root Tag Directory Bitfile • Maps fileset tag directory ids to volume:mcell location
(Fileset) Tag Directory Bitfile (1 of 2) • Each fileset has its own tag directory • Maps fileset bitfiles ids to volume:mcell location • Special fileset tags • 1 Fragment bitfile • 2 Root directory • Also includes mappings to volume and domain metadata bitfiles
(Fileset) Tag Directory Bitfile (2 of 2) • Tags for reserved bitfiles of virtual disk i are; tag = - (reserved-bitfile-primary-mcell_num + (vol_index * 6)) • Examples Reserved File Formula Disk 1 Disk 2 BMT - (0 + (vol * 6)) -6 -12 SBM - (1 + (vol * 6)) -7 -13 Root Tag Directory - (2 + (vol * 6)) -8 -14 Log - (3 + (vol * 6)) -9 -15 Misc Bitfile - (4 + (vol * 6)) -10 -16 • May be printed in weird tags fffffffa.0 BMT for disk 1, fffffff3.0 SBM for disk 2
The .tags Directory • A direct route through a fileset tag directory provided for user-space commands and tools • if /usr is an AdvFS fileset /usr/.tags/2 root (/usr) /usr/.tags/1 fragment bitfile /usr/.tags/-6 BMT of disk 1 /usr/.tags/-15 log of disk 2 • and /usr/.tags/M2 tag directory for 2nd fileset
Fragment Bitfile (1 of 2) • One per fileset • Contains small (< 8k) ends of files • Bitfiles are allocated in 8k (16 sector) units • If a file is "small" ( < 160k bytes) and its last page has less than 7k bytes, • A fragment is allocated for the file's last cluster within the Fragment bitfile. • The Fragment bitfile is broken into 128kb fragment groups • Each group consists of fragments of a particular size: free, 1k, 2k,...7k
Fragment Bitfile (2 of 2) • Each fragment group has the form Header:Fragment:Fragment:.... • Fragment groups with free frags are linked together using space in the header. • Head and tail pointers for the eight types of fragment groups in the mcell for the fileset tag directory.
Fragments and Files • POSIX stats record of mcells contains • Fragment bitfile ID • Page offset within fragment bitfile • If complete, last bytes of the file are found in the fragment bitfile • If needed, fragment allocation and bitfile truncation is done on last close of a file that has changed size.
Lowest Level Disk Image Disk Sector (512 bytes) 0 16 32 48 64 96 fake superblock Boot Block BMT Page 0 BMT Page 1 disklabel future growth of BMT Page 1+ descriptions of user bitfiles BMT as a bitfile Page 0, descriptions of itself and other reserved bitfiles
Low Level Disk Image with Metadata Bitfiles Disk Sector (512 bytes) 0 16 32 48 64 96 Misc Page 0 Misc Page 1 Misc Page 2 Root Tag Dir Storage Bit Map Trans Log BMT Page 0 BMT Page 1 optional optional .... More BMT Pages, Bifile Extents, Free Blocks
FAS On-Disk Structures FAS on-disk structures consist of: • Directories • File attributes • Quota files
“UNIX” Directories • UNIX Directories are contained in bitfiles • AdvFS format similar to UFS format except • tags numbers replace inode numbers • hidden in the padding at the end of each component entry is the 64-bit tag.sequence id • Why two levels of directories • to support file migration between disks
Fileset “UNIX” Directory "log" tagN Tag Directory AdvFS Directories and Migration "foo" Extents Bitfile Metdata Table Bitfile Metdata Table
File Attributes • FAS file attributes mainly consist of the POSIX stat structure. • Refer to /usr/include/sys/stat.h • Stored in a record of the bitfile’s mcell • POSIX attribute record
Quota Files (1 of 2) • Two quota files per fileset. • User quota file is used to keep track of file and disk space usage on a per-UID basis. • Group quota file is maintained on a per-GID basis.