IBM Research Lab in Haifa

Architectural and Design Issues in the General Parallel File System IBM Research Lab in Haifa May 12, 2002 Benny Mandler - mandler@il.ibm.com

Agenda • What is GPFS? • a file system for deep computing • GPFS uses • General architecture • How does GPFS meet its challenges - architectural issues • performance • scalability • high availability • concurrency control

Scalable Parallel Computing • RS/6000 SP Scalable Parallel Computer • 1-512 nodes connected by high-speed switch • 1-16 CPUs per node (Power2 or PowerPC) • >1 TB disk per node • 500 MB/s full duplex per switch port • Scalable parallel computing enables I/O-intensive applications: • Deep computing - simulation, seismic analysis, data mining • Server consolidation - aggregating file, web servers onto a centrally-managed machine • Streaming video and audio for multimedia presentation • Scalable object store for large digital libraries, web servers, databases, ... What is GPFS?

GPFS addresses SP I/O requirements • High Performance - multiple GB/s to/from a single file • concurrentreads and writes, parallel data access - within a file and across files • Support fully parallel access both to file data and metadata • client caching enabled by distributed locking • wide striping, large data blocks, prefetch • Scalability • scales up to 512 nodes (N-Way SMP). Storage nodes, file system nodes, disks, adapters... • High Availability • fault-tolerancevia logging, replication, RAID support • survives node and disk failures • Uniform access via shared disks - Single image file system • High capacitymultiple TB per file system, 100s of GB per file. • Standards compliant (X/Open 4.0 "POSIX") with minor exceptions What is GPFS?

GPFS vs. local and distributed file systems on the SP2 • Native AIX File System (JFS) • No file sharing - application can only access files on its own node • Applications must do their own data partitioning • DCE Distributed File System (follow-up of AFS) • Application nodes (DCE clients) share files on server node • Switch is used as a fast LAN • Coarse-grained (file or segment level) parallelism • Server node is performance and capacity bottleneck • GPFS Parallel File System • GPFS file systems are striped across multiple disks on multiple storage nodes • Independent GPFS instances run on each application node • GPFS instances use storage nodes as "block servers" - all instances can access all disks

Tokyo Video on Demand Trial • Video on Demand for new "borough" of Tokyo • Applications: movies, news, karaoke, education ... • Video distribution via hybrid fiber/coax • Trial "live" since June '96 • Currently 500 subscribers • 6 Mbit/sec MPEG video streams • 100 simultaneous viewers (75 MB/sec) • 200 hours of video on line (700 GB) • 12-node SP-2 (7 distribution, 5 storage)

Engineering Design • Major aircraft manufacturer • Using CATIA for large designs, Elfini for structural modeling and analysis • SP used for modeling/analysis • Using GPFS to store CATIA designs and structural modeling data • GPFS allows all nodes to share designs and models GPFS uses

Shared Disks - Virtual Shared Disk architecture • File systems consist of one or more shared disks • Individual disk can contain data, metadata, or both • Disks are designated to failure group • Data and metadata are striped to balance load and maximize parallelism • Recoverable Virtual Shared Disk for accessing disk storage • Disks are physically attached to SP nodes • VSD allows clients to access disks over the SP switch • VSD client looks like disk device driver on client node • VSD server executes I/O requests on storage node. • VSD supports JBOD or RAID volumes, fencing, multi-pathing (where physical hardware permits) • GPFS only assumes a conventional block I/O interface General architecture

GPFS Architecture Overview • Implications of Shared Disk Model • All data and metadata on globally accessible disks (VSD) • All access to permanent data through disk I/O interface • Distributed protocols, e.g., distributed locking, coordinate disk access from multiple nodes • Fine-grained locking allows parallel access by multiple clients • Logging and Shadowing restore consistency after node failures • Implications of Large Scale • Support up to 4096 disks of up to 1 TB each (4 Petabytes) • The largest system in production is 75 TB • Failure detection and recovery protocols to handle node failures • Replication and/or RAID protect against disk / storage node failure • On-line dynamic reconfiguration (add, delete, replace disks and nodes; rebalance file system) General architecture

GPFS Architecture - Node Roles • Three types of nodes: file system, storage, and manager • Each node can perform any of these functions • File system nodes • run user programs, read/write data to/from storage nodes • implement virtual file system interface • cooperate with manager nodes to perform metadata operations • Manager nodes (one per “file system”) • global lock manager • recovery manager • global allocation manager • quota manager • file metadata manager • admin services fail over • Storage nodes • implement block I/O interface • shared access from file system and manager nodes • interact with manager nodes for recovery (e.g. fencing) • file data and metadata striped across multiple disks on multiple storage nodes General architecture

GPFS Software Structure General architecture

Disk Data Structures: Files • Large block size allows efficient use of disk bandwidth • Fragments reduce space overhead for small files • No designated "mirror", no fixed placement function: • Flexible replication (e.g., replicate only metadata, or only important files) • Dynamic reconfiguration: data can migrate block-by-block • Multi level indirect blocks • Each disk address: • list of pointers to replicas • Each pointer: • disk id + sector no. General architecture

Large File Block Size • Conventional file systems store data in small blocks to pack data more densely • GPFS uses large blocks (256KB default) to optimize disk transfer speed Performance

Parallelism and consistency • Distributed locking - acquire appropriate lock for every operation - used for updates to user data • Centralized management - conflicting operations forwarded to a designated node - used for file metadata • Distributed locking + centralized hints - used for space allocation • Central coordinator - used for configuration changes I/O slowdown effects Additional I/O activity rather than token server overload

Parallel File Access From Multiple Nodes • GPFS allows parallel applications on multiple nodes to access non-overlapping ranges of a single file with no conflict • Global locking serializes access to overlapping ranges of a file • Global locking based on "tokens" which convey access rights to an object (e.g. a file) or subset of an object (e.g. a byte range) • Tokens can be held across file system operations, enabling coherent data caching in clients • Cached data discarded or written to disk when token is revoked • Performance optimizations: required/desired ranges, metanode, data shipping, special token modes for file size operations Performance

Deep Prefetch for High Throughput • GPFS stripes successive blocks across successive disks • Disk I/O for sequential reads and writes is done in parallel • GPFS measures application "think time" ,disk throughput, and cache state to automatically determine optimal parallelism • Prefetch algorithms now recognize strided • and reverse sequential access • Accepts hints • Write-behind policy Application reads at 15 MB/sec Each disk reads at 5 MB/sec Three I/Os executed in parallel Performance

GPFS Throughput Scaling for Non-cached Files • Hardware: Power2 wide nodes, SSA disks • Experiment: sequential read/write from large number of GPFS nodes to varying number of storage nodes • Result: throughput increases nearly linearly with number of storage nodes • Bottlenecks: • microchannel limits node throughput to 50MB/s • system throughput limited by available storage nodes Scalability

Disk Data Structures: Allocation map • Segmented Block Allocation MAP: • Each segment contains bits representing blocks on all disks • Each segment is a separately lockable unit • Minimizes contention for allocation map when writing files on multiple nodes • Allocation manager service provides hints which segments to try Similar: inode allocation map Scalability

High Availability - Logging and Recovery • Problem: detect/fix file system inconsistencies after a failure of one or more nodes • All updates that may leave inconsistencies if uncompleted are logged • Write-ahead logging policy: log record is forced to disk before dirty metadata is written • Redo log: replaying all log records at recovery time restores file system consistency • Logged updates: • I/O to replicated data • directory operations (create, delete, move, ...) • allocation map changes • Other techniques: • ordered writes • shadowing High Availability

Node Failure Recovery • Application node failure: • force-on-steal policy ensures that all changes visible to other nodes have been written to disk and will not be lost • all potential inconsistencies are protected by a token and are logged • file system manager runs log recovery on behalf of the failed node • after successful log recovery tokens held by the failed node are released • actions taken: restore metadata being updated by the failed node to a consistent state, release resources held by the failed node • File system manager failure: • new node is appointed to take over • new file system manager restores volatile state by querying other nodes • New file system manager may have to undo or finish a partially completed configuration change (e.g., add/delete disk) • Storage node failure: • Dual-attached disk: use alternate path (VSD) • Single attached disk: treat as disk failure High Availability

Handling Disk Failures • When a disk failure is detected • The node that detects the failure informs the file system manager • File system manager updates the configuration data to mark the failed disk as "down" (quorum algorithm) • While a disk is down • Read one / write all available copies • "Missing update" bit set in the inode of modified files • When/if disk recovers • File system manager searches inode file for missing update bits • All data & metadata of files with missing updates are copied back to the recovering disk (one file at a time, normal locking protocol) • Until missing update recovery is complete, data on the recovering disk is treated as write-only • Unrecoverable disk failure • Failed disk is deleted from configuration or replaced by a new one • New replicas are created on the replacement or on other disks

Cache Management Stats Total Cache Seq / random optimal, total General Pool: Clock list, merge, re-map Seq / random optimal, total Block Size pool: Clock list Seq / random optimal, total Block Size pool: Clock list Seq / random optimal, total Block Size pool: Clock list Balance dynamically according to usage patterns Avoid fragmentation - internal and external Unified steal Periodical re-balancing

Epilogue • Used on six of the ten most powerful supercomputers in the world, including the largest (ASCI white) • Installed at several hundred customer sites, on clusters ranging from a few nodes with less than a TB of disk, up to 512 nodes with 140 TB of disk in 2 file systems • IP rich - ~20 filed patents • State of the art • TeraSort • world record of 17 minutes • using 488 node SP. 432 file system and 56 storage nodes (604e 332 MHz) • total 6 TB disk space • References • GPFS home page: http://www.haifa.il.ibm.com/projects/storage/gpfs.html • FAST 2002: http://www.usenix.org/events/fast/schmuck.html • TeraSort - http://www.almaden.ibm.com/cs/gpfs-spsort.html • Tiger Shark: http://www.research.ibm.com/journal/rd/422/haskin.html

IBM Research Lab in Haifa