Andrew Hanushevsky Stanford Linear Accelerator Center

Disk Cache Management In Large-Scale Object Oriented Databaseshttp://www.slac.stanford.edu/~abh/CHEP2000/Cache/ Andrew Hanushevsky Stanford Linear Accelerator Center Produced under contract DE-AC03-76SF00515 between Stanford University and the Department of Energy

Motivation • Problem • More data (>2 PB) than affordable disk space (< 300 TB) • Realization • Only about 10% of the data is used at any one time • Solution • Hierarchical Mass Storage System • Most data on tape (cheap) in-use data on disk (expensive) • Problem (it’s all circular) • Effectively manage the disk cache to keep the most useful data • Disk cache performance

migrate purge Basic Disk Caching Architecture Control Data pre-stage Database Management Cache Management stage ams Disk Cache hpss client

RAID RAID The Direct Solution: One Big Filesystem • Volume Manager + Journaled File System (e.g., Veritas) • Catenates disk devices to form very large capacity logical devices • High performance (60+ MB/Sec) journaled file system for fast recovery • Allows for fast streaming I/O and efficient small block transfers • Problems • Low random access performance • Limited to 1TB of cache/filesystem in most implementations • Unpredictable load balancing File System Volume Manager RAID RAID RAID RAID

RAID RAID The Indirect Solution: Multiple Smaller Filesystem • Still Need a Volume Manager + Journaled File System • But can spread the load across multiple heads I/O adapeters • Virtually unlimited cache size • Problems • Need to manage multiple filesystems • Need tools to balance the load • If not done automatically File System File System File System Volume Manager RAID RAID RAID RAID

Supporting Multiple Filesystem Index Area Optional data cache Default data area /databases/mydbfile symlink Naming convention allows for audit and index recovery Multiple Independent Filesystems /cache1/databases:mydbfile Data Area Any number Any Size Chosen based on free space in LRU order /cache2 /cache3

Staging Manager • Copies files into the cache • Uses index space to link wanted name to actual file location • Uses allocation manager to select target filesystem • Uses lock manager to serialize access to target files & directories • Uses resource manager to control tape drive usage Staging Manager Allocation Manager Lock Manager Resource Manager

File Placement (i.e., filesystem selection) • Round-robin allocation • Good for spreading the load • Maximum fit (fuzz == 0) • Filesystem with largest amount of free space • Good when size not known • Maximal fit (0 < fuzz < 1) • Filesystem with largest amount of free space within a delta • Good when size unknown but want to keep round-robin allocation • First fit (fuzz == 1) • First filesystem that can accommodate the file • Good when size known and want to spread the load

Pre-Staging Manager • Asks the staging manager to pre-fetch files • Allows user to transparently map objects to files • Avoids resource wait time (i.e., files available when job runs) • Notifies user synchronously or asynchronously when request completes • Uses client/server model of implementation for isolation Disk Queue Client Server Pre-Staging Manager Disk Cache Staging Manager

Migration Manager • Copies modified files from cache to Mass Storage System • File must not have been changed for x seconds • Reduces chance of multiple migrations of same file prior to purge • Specific files can be migrated on a priority basis by request • Uses client/server model of implementation for isolation Disk Queue Client Server Disk Cache Migration Manager hpss

Purge Manager • Removes unused migrated files from the cache • Files purged in LRU order across all filesystems • File must not have been used for at least x seconds • Tries to maintain free-space in each file system at a target amount • Purging starts when free space falls below a specified file system threshold • Targets are specific to a filesystem but may be the same for all • Either a space percentage or absolute value, and a global file count • Specific files can be purged on a priority basis by request • Uses client/server model of implementation for isolation • Implementation identical to migration priority queue • Files can be also pinned in the cache (i.e., not removable) • For a specific period of time • Until a certain date plus optional non-use time • Indefinitely

Cache Management Utilities • ooss_Xeq provides a common management interface • Audit cache disks (data files must be pointed to from the name space) • Optional fix-up allowed • Audit name space (name space must point to actual data files) • Optional fix-up allowed • Copy a file into the cache • Arbitrary source • Create an empty file in the cache • Rename a file in the index • Relocate a file to another filesystem • Remove a file from the index and cache • Optional removal from the Mass Storage System as well

Purge Manager allocation manager Staging Manager lock manager MSS Gateway Prestage Manager Resource Manager Migration Manager M/P Request Server Components For Effective Disk Cache Management Disk Cache Adminstration Tools

Conclusion • Effectively Managing A Large Disk Cache is Complex • Performance • Multiple small (100 GB) caches • Allocation Strategy • Relocation Strategy • External resource management (e.g., MSS tape drives) • Fault Tolerance • Multiple loosely connected components • Cache auditing and recovery • Usability • End-user interfaces for staging, migration, and purge • Administration • Extensive tools to safely manipulate cache contents

Andrew Hanushevsky Stanford Linear Accelerator Center