320 likes | 505 Views
GPFS: A Shared-Disk File System for Large Computing Clusters. Frank Schmuck & Roger Haskin IBM Almaden Research Center. Introduction. Machines are getting more powerful But, we always can find bigger problems to solve Faster networks machines can form clusters
E N D
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center
Introduction • Machines are getting more powerful • But, we always can find bigger problems to solve • Faster networks machines can form clusters • Promising to solve big problems • GPFS (general parallel file system) • Mimics the semantics of a POSIX file system running on a single machine • Running on 6/10 of the top supercomputers
Introduction • Web server workloads • Multiple nodes access multiple files • Supercomputer workloads • Single node can access a file stored on multiple nodes • Multiple nodes can access the same file stored on multiple nodes • Need to access files and metadata in parallel • Need to perform administrative functions in parallel
GPFS Overview • Uses shared disks switching fabric
General Large File System Issues • Data striping and allocation, prefetch, and write-behind • Large directory support • Logging and recovery
Data Striping and Prefetch • Striping implemented at the file system level • Better control • Fault tolerance • Load balancing • GPFS recognizes sequential, reverse sequential, various strided access patterns • Prefetch data accordingly
Allocation • Large files are stored in 256KB blocks • Small files are stored in 8KB subblocks • Need to watch out for disks with different sizes • Maximizing space utilization • Larger disks receive more I/O requests • Bottleneck • Maximizing parallel performance • Under utilized disks
empty empty 0100 | file_1 1001 | file_2 empty empty 0100 | file1 1001 | file2 empty 0011 | dir1 1110 | file2_hardlink Large Directory Support • GPFS uses extensible hashing to support very large directories
Logging and Recovery • In a large file system, no time to run fsck • Use journaling and write ahead log for metadata • Data are not logged • Each node has a separate log • Can be read by all nodes • Any node can perform recovery on behalf of a failed node
Distributed Locking vs. Centralized Management • Goal: reading and writing in parallel from all nodes in the cluster • Constraint: • POSIX semantics • Synchronizing access to data and metadata from multiple nodes • If two processes on two nodes access the same file • A read on one node will see either all or none of the data written by a concurrent write
Distributed Locking vs. Centralized Management • Two approaches to locking: • Distributed • Consult with all other nodes before acquiring locks • Greater parallelism • Centralized • Consult with a designated node • Better for frequently updated metadata
Lock Granularity • Too small • High overhead • Too large • Many contending lock requests
The GPFS Distributed Lock Manager • Centralized global lock manager on one node • Local lock managers in each node • Global lock manager • Hands out lock tokens (right to grant locks) to local lock managers
Parallel Data Access • How to write to the same file from multiple nodes? • Byte-range locking to synchronize reads and writes • Allows concurrent writes to different parts of the same file
Byte-Range Tokens • First write request from one node • Acquires a token for the whole file • Efficient for non-concurrent writes • Second write request to the same file from a second node • Revoke part of the byte-range token held by the first node • Knowing the reference pattern helps to predict how to break up the byte-ranges
Byte-Range Tokens • Byte-range rounded to block boundaries • So two nodes cannot modify the same block • False sharing: a shared block being frequently moved between computers due to updates
Synchronizing Access to File Metadata • Multiple nodes writing to the same file • Concurrent updates to the inode and indirect blocks • Synchronizing updates is very expensive
Synchronizing Access to File Metadata • GPFS • Uses a shared write lock on the inode • Use the largest file size, latest time stamp • How do multiple nodes append to the same file concurrently? • One node is responsible for updating inodes • Elected dynamically
Allocation Maps • Need 32 bits per block due to subblocks • Divided into n separate lockable regions • Each node keeps track of 1/nth blocks on every disk • Striped across all disks • Minimize lock conflicts • One node maintains the free space statistics • Periodically updated
Other File System Metadata • Centralized management to coordinate metadata updates • Quota manager
Token Manager Scaling • File size is unbounded • Number of byte-range tokens is also unbounded • Can use up the entire memory • Token manager needs to monitor and prevent unbounded growth • Revoke tokens as necessary • Reuse token freed by deleted files
Fault Tolerance • Node failures • Communication failures • Disk failures
Node Failures • Periodic heartbeat messages to detect node failures • Run log recovery from surviving nodes • Token manager releases tokens held by the failed node • Other nodes can resend committed updates
Communication Failures • Network partition • Continued operation can result in corrupted file system • File system is accessible only by the group containing a majority of the nodes in the cluster
Disk Failures • Dual attached RAID controllers • Files can be replicated
Scalable Online System Utilities • Adding, deleting, replacing disks • Rebalancing the file system content • Defragmentation, quota-check, fsck • File system manager • Coordinate administrative activities
Experiences • Skewing of workloads • Small management overhead can affect parallel applications in significant ways • If a node slows down by 1%, it’s the same as leaving 5 nodes completely idle for 512-node cluster • Need dedicated administrative nodes
Experiences • Even the rarest failures can happen • Data loss in a RAID • A bad batch of disk drives
Related Work • Storage area network • Centralized metadata server • SGI’s XFS file system • Not a clustered file system • Frangipani, Global File System • Do not support multiple accesses to the same file
Summary and Conclusions • GPFS • Uses distributed locking and recovery • Uses RAID and replication for reliability • Can scale up to the largest super computers in the world • Provides fault tolerance and system management functions