GPFS: A Shared-Disk File System for Large Computing Clusters

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center

Introduction • Machines are getting more powerful • But, we always can find bigger problems to solve • Faster networks  machines can form clusters • Promising to solve big problems • GPFS (general parallel file system) • Mimics the semantics of a POSIX file system running on a single machine • Running on 6/10 of the top supercomputers

Introduction • Web server workloads • Multiple nodes access multiple files • Supercomputer workloads • Single node can access a file stored on multiple nodes • Multiple nodes can access the same file stored on multiple nodes • Need to access files and metadata in parallel • Need to perform administrative functions in parallel

GPFS Overview • Uses shared disks switching fabric

General Large File System Issues • Data striping and allocation, prefetch, and write-behind • Large directory support • Logging and recovery

Data Striping and Prefetch • Striping implemented at the file system level • Better control • Fault tolerance • Load balancing • GPFS recognizes sequential, reverse sequential, various strided access patterns • Prefetch data accordingly

Allocation • Large files are stored in 256KB blocks • Small files are stored in 8KB subblocks • Need to watch out for disks with different sizes • Maximizing space utilization • Larger disks receive more I/O requests • Bottleneck • Maximizing parallel performance • Under utilized disks

Logging and Recovery • In a large file system, no time to run fsck • Use journaling and write ahead log for metadata • Data are not logged • Each node has a separate log • Can be read by all nodes • Any node can perform recovery on behalf of a failed node

Managing Parallelism and Consistency in a Cluster

Distributed Locking vs. Centralized Management • Goal: reading and writing in parallel from all nodes in the cluster • Constraint: • POSIX semantics • Synchronizing access to data and metadata from multiple nodes • If two processes on two nodes access the same file • A read on one node will see either all or none of the data written by a concurrent write

Distributed Locking vs. Centralized Management • Two approaches to locking: • Distributed • Consult with all other nodes before acquiring locks • Greater parallelism • Centralized • Consult with a designated node • Better for frequently updated metadata

Lock Granularity • Too small • High overhead • Too large • Many contending lock requests

The GPFS Distributed Lock Manager • Centralized global lock manager on one node • Local lock managers in each node • Global lock manager • Hands out lock tokens (right to grant locks) to local lock managers

Parallel Data Access • How to write to the same file from multiple nodes? • Byte-range locking to synchronize reads and writes • Allows concurrent writes to different parts of the same file

Byte-Range Tokens • First write request from one node • Acquires a token for the whole file • Efficient for non-concurrent writes • Second write request to the same file from a second node • Revoke part of the byte-range token held by the first node • Knowing the reference pattern helps to predict how to break up the byte-ranges

Byte-Range Tokens • Byte-range rounded to block boundaries • So two nodes cannot modify the same block • False sharing: a shared block being frequently moved between computers due to updates

Synchronizing Access to File Metadata • Multiple nodes writing to the same file • Concurrent updates to the inode and indirect blocks • Synchronizing updates is very expensive

Synchronizing Access to File Metadata • GPFS • Uses a shared write lock on the inode • Use the largest file size, latest time stamp • How do multiple nodes append to the same file concurrently? • One node is responsible for updating inodes • Elected dynamically

Allocation Maps • Need 32 bits per block due to subblocks • Divided into n separate lockable regions • Each node keeps track of 1/nth blocks on every disk • Striped across all disks • Minimize lock conflicts • One node maintains the free space statistics • Periodically updated

Other File System Metadata • Centralized management to coordinate metadata updates • Quota manager

Token Manager Scaling • File size is unbounded • Number of byte-range tokens is also unbounded • Can use up the entire memory • Token manager needs to monitor and prevent unbounded growth • Revoke tokens as necessary • Reuse token freed by deleted files

Fault Tolerance • Node failures • Communication failures • Disk failures

Node Failures • Periodic heartbeat messages to detect node failures • Run log recovery from surviving nodes • Token manager releases tokens held by the failed node • Other nodes can resend committed updates

Communication Failures • Network partition • Continued operation can result in corrupted file system • File system is accessible only by the group containing a majority of the nodes in the cluster

Disk Failures • Dual attached RAID controllers • Files can be replicated

Scalable Online System Utilities • Adding, deleting, replacing disks • Rebalancing the file system content • Defragmentation, quota-check, fsck • File system manager • Coordinate administrative activities

Experiences • Skewing of workloads • Small management overhead can affect parallel applications in significant ways • If a node slows down by 1%, it’s the same as leaving 5 nodes completely idle for 512-node cluster • Need dedicated administrative nodes

Experiences • Even the rarest failures can happen • Data loss in a RAID • A bad batch of disk drives

Related Work • Storage area network • Centralized metadata server • SGI’s XFS file system • Not a clustered file system • Frangipani, Global File System • Do not support multiple accesses to the same file

Summary and Conclusions • GPFS • Uses distributed locking and recovery • Uses RAID and replication for reliability • Can scale up to the largest super computers in the world • Provides fault tolerance and system management functions

GPFS: A Shared-Disk File System for Large Computing Clusters

GPFS: A Shared-Disk File System for Large Computing Clusters

Presentation Transcript

Computing System Fundamentals/Trends + Review of Performance Evaluation and ISA Design

Distributed File Systems

Organizing and Managing Your Hard Disk

Windows XP File Systems

Order and Chaos

Service Computing – Grid Resource Management

Operating Systems CMPSC 473

Outline for Today’s Lecture

SUSE Linux Enterprise Server Administration (Course 3037)

Mass-Storage Structure

Computing Engine Choices

Security of Shared Data in Large Systems

Operating Systems CS3013 / CS502

Disaster-Tolerant OpenVMS Clusters Keith Parris

VB File Processing

云计算与云数据管理

The Design and Implementation of a Context-Aware File System for Ubiquitous Computing Applications

Network File System

Lecture XIII: Replication-II

Modulo VI Sistemas de Entrada e Saída