1 / 31

GPFS: A Shared-Disk File System for Large Computing Clusters

GPFS: A Shared-Disk File System for Large Computing Clusters. Frank Schmuck & Roger Haskin IBM Almaden Research Center. Introduction. Machines are getting more powerful But, we always can find bigger problems to solve Faster networks  machines can form clusters

dareh
Download Presentation

GPFS: A Shared-Disk File System for Large Computing Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center

  2. Introduction • Machines are getting more powerful • But, we always can find bigger problems to solve • Faster networks  machines can form clusters • Promising to solve big problems • GPFS (general parallel file system) • Mimics the semantics of a POSIX file system running on a single machine • Running on 6/10 of the top supercomputers

  3. Introduction • Web server workloads • Multiple nodes access multiple files • Supercomputer workloads • Single node can access a file stored on multiple nodes • Multiple nodes can access the same file stored on multiple nodes • Need to access files and metadata in parallel • Need to perform administrative functions in parallel

  4. GPFS Overview • Uses shared disks switching fabric

  5. General Large File System Issues • Data striping and allocation, prefetch, and write-behind • Large directory support • Logging and recovery

  6. Data Striping and Prefetch • Striping implemented at the file system level • Better control • Fault tolerance • Load balancing • GPFS recognizes sequential, reverse sequential, various strided access patterns • Prefetch data accordingly

  7. Allocation • Large files are stored in 256KB blocks • Small files are stored in 8KB subblocks • Need to watch out for disks with different sizes • Maximizing space utilization • Larger disks receive more I/O requests • Bottleneck • Maximizing parallel performance • Under utilized disks

  8. empty empty 0100 | file_1 1001 | file_2 empty empty 0100 | file1 1001 | file2 empty 0011 | dir1 1110 | file2_hardlink Large Directory Support • GPFS uses extensible hashing to support very large directories

  9. Logging and Recovery • In a large file system, no time to run fsck • Use journaling and write ahead log for metadata • Data are not logged • Each node has a separate log • Can be read by all nodes • Any node can perform recovery on behalf of a failed node

  10. Managing Parallelism and Consistency in a Cluster

  11. Distributed Locking vs. Centralized Management • Goal: reading and writing in parallel from all nodes in the cluster • Constraint: • POSIX semantics • Synchronizing access to data and metadata from multiple nodes • If two processes on two nodes access the same file • A read on one node will see either all or none of the data written by a concurrent write

  12. Distributed Locking vs. Centralized Management • Two approaches to locking: • Distributed • Consult with all other nodes before acquiring locks • Greater parallelism • Centralized • Consult with a designated node • Better for frequently updated metadata

  13. Lock Granularity • Too small • High overhead • Too large • Many contending lock requests

  14. The GPFS Distributed Lock Manager • Centralized global lock manager on one node • Local lock managers in each node • Global lock manager • Hands out lock tokens (right to grant locks) to local lock managers

  15. Parallel Data Access • How to write to the same file from multiple nodes? • Byte-range locking to synchronize reads and writes • Allows concurrent writes to different parts of the same file

  16. Byte-Range Tokens • First write request from one node • Acquires a token for the whole file • Efficient for non-concurrent writes • Second write request to the same file from a second node • Revoke part of the byte-range token held by the first node • Knowing the reference pattern helps to predict how to break up the byte-ranges

  17. Byte-Range Tokens • Byte-range rounded to block boundaries • So two nodes cannot modify the same block • False sharing: a shared block being frequently moved between computers due to updates

  18. Synchronizing Access to File Metadata • Multiple nodes writing to the same file • Concurrent updates to the inode and indirect blocks • Synchronizing updates is very expensive

  19. Synchronizing Access to File Metadata • GPFS • Uses a shared write lock on the inode • Use the largest file size, latest time stamp • How do multiple nodes append to the same file concurrently? • One node is responsible for updating inodes • Elected dynamically

  20. Allocation Maps • Need 32 bits per block due to subblocks • Divided into n separate lockable regions • Each node keeps track of 1/nth blocks on every disk • Striped across all disks • Minimize lock conflicts • One node maintains the free space statistics • Periodically updated

  21. Other File System Metadata • Centralized management to coordinate metadata updates • Quota manager

  22. Token Manager Scaling • File size is unbounded • Number of byte-range tokens is also unbounded • Can use up the entire memory • Token manager needs to monitor and prevent unbounded growth • Revoke tokens as necessary • Reuse token freed by deleted files

  23. Fault Tolerance • Node failures • Communication failures • Disk failures

  24. Node Failures • Periodic heartbeat messages to detect node failures • Run log recovery from surviving nodes • Token manager releases tokens held by the failed node • Other nodes can resend committed updates

  25. Communication Failures • Network partition • Continued operation can result in corrupted file system • File system is accessible only by the group containing a majority of the nodes in the cluster

  26. Disk Failures • Dual attached RAID controllers • Files can be replicated

  27. Scalable Online System Utilities • Adding, deleting, replacing disks • Rebalancing the file system content • Defragmentation, quota-check, fsck • File system manager • Coordinate administrative activities

  28. Experiences • Skewing of workloads • Small management overhead can affect parallel applications in significant ways • If a node slows down by 1%, it’s the same as leaving 5 nodes completely idle for 512-node cluster • Need dedicated administrative nodes

  29. Experiences • Even the rarest failures can happen • Data loss in a RAID • A bad batch of disk drives

  30. Related Work • Storage area network • Centralized metadata server • SGI’s XFS file system • Not a clustered file system • Frangipani, Global File System • Do not support multiple accesses to the same file

  31. Summary and Conclusions • GPFS • Uses distributed locking and recovery • Uses RAID and replication for reliability • Can scale up to the largest super computers in the world • Provides fault tolerance and system management functions

More Related