The Google File System

S. Ghemawat, H. Gobioff and S-T. Leung, The Google File System, In Proc. of the 19th ACM Symposium on Operating Systems Principles, Oct. 2003. Presenter: John Otto The Google File System

Outline • Overview • Motivation • Assumptions and Optimizations • Design Considerations • Structure • Physical • Data • File System Operations • Application Requirements • Write Procedure • Master Operations • Related Work • Discussion

Overview • Distributed tiered system • Terabytes of data, thousands of machines • Handles component failures • Optimized for small random reads, large sequential reads, and record append operations • Manages multiple clients by implementing atomic operations

Motivation • Need a robust storage mechanism for very large files • Manage large volumes of data being read/written • Transparently provide replication mechanisms to prevent data loss and handle component failure

Assumptions and Optimizations • Assume that components will fail • Optimize for large files; support small ones • Optimize for long sequential reads, small random reads • Optimize for long sequential writes, possibly from multiple clients • Optimize for high throughput, not low latency

Design Considerations • More important to implement these optimizations than the POSIX API • Flexibility to implement custom operations • e.g. snapshot, record append

Physical Structure

Data Structure • Chunks • 64MB, uniquely identified by chunk handle • Single Master • maintains file system metadata • logs all operations, commits to disk on self and replicas before reporting changes to clients • caches in memory current chunk locations • versions chunks • “Shadow” Master replicas • maintain logs of master operations • bear read-only load from clients • Many Chunkservers • maintain local authoritative chunk list • interact with clients for read/write data operation

File System Operations • Read • Mutation • Write • Record Append • Delete • Rename • Snapshot • Lease Revocation; “Copy on Write”

Application Requirements • Prefer append operations rather than overwriting data • Should be able to handle duplicate records/padding • Has to be able to handle stale or indefinite data (regions of the file written by multiple concurrent clients)

Write Procedure

Master Operations • Locking • Read/Write • Creation; directory doesn't maintain list of files • Replica Placing/Modification • Garbage Collection/Deletion

Fault Tolerance • Chunkservers come up within seconds • Master functions within 30-60 seconds • Must get current chunk locations from chunkservers • Replication • Checksums • Logging for Diagnostics

Evaluation – Example Clusters

Evaluation – Real Clusters

Related Work • AFS doesn't spread files across multiple servers • xFS and Swift use RAID, more efficient disk use regarding replication • Frangipani... no centralized server • NASD have variable object size on server vs. chunks

Discussion / Questions • How much responsibility is/should be pushed to the application? Is this a good thing? • Should there be better write/record append monitoring to keep track of consistency and versioning? • What would a “self-contained self-verifying record” look like? • Why aren't files treated as abstractions, with records being explicitly tracked, essentially making a big “database table” or set of rows? • Who maintains the list of record offsets and locations? The client or application?

The Google File System