300 likes | 463 Views
The G o o g l e File System . Omid Khalili ( okhalili@cs.ucsd.edu ). The G o o g l e File System (GFS). A scalable distributed file system for large distributed data intensive applications Multiple GFS clusters are currently deployed. The largest ones have: 1000+ storage nodes
E N D
The Google File System • Omid Khalili (okhalili@cs.ucsd.edu)
The Google File System (GFS) • A scalable distributed file system for large distributed data intensive applications • Multiple GFS clusters are currently deployed. • The largest ones have: • 1000+ storage nodes • 300+ TeraBytes of disk storage • heavily accessed by hundreds of clients on distinct machines
Introduction • Shares many same goals as previous distributed file systems • performance, scalability, reliability, etc • GFS design has been driven by four key observation of Google application workloads and technological environment
Intro: Observations 1 • 1. Component failures are the norm • constant monitoring, error detection, fault tolerance and automatic recovery are integral to the system • 2. Huge files (by traditional standards) • Multi GB files are common • I/O operations and blocks sizes must be revisited
Intro: Observations 2 • 3. Most files are mutated by appending new data • This is the focus of performance optimization and atomicity guarantees • 4. Co-designing the applications and APIs benefits overall system by increasing flexibility
The Design • Cluster consists of a single master and multiple chunkservers and is accessed by multiple clients
The Master • Maintains all file system metadata. • names space, access control info, file to chunk mappings, chunk (including replicas) location, etc. • Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state
The Master • Helps make sophisticated chunk placement and replication decision, using global knowledge • For reading and writing, client contacts Master to get chunk locations, then deals directly with chunkservers • Master is not a bottleneck for reads/writes
Chunkservers • Files are broken into chunks. Each chunk has a immutable globally unique 64-bit chunk-handle. • handle is assigned by the master at chunk creation • Chunk size is 64 MB • Each chunk is replicated on 3 (default) servers
Clients • Linked to apps using the file system API. • Communicates with master and chunkservers for reading and writing • Master interactions only for metadata • Chunkserver interactions for data • Only caches metadata information • Data is too large to cache.
Chunk Locations • Master does not keep a persistent record of locations of chunks and replicas. • Polls chunkservers at startup, and when new chunkservers join/leave for this. • Stays up to date by controlling placement of new chunks and through HeartBeat messages (when monitoring chunkservers)
Operation Log • Record of all critical metadata changes • Stored on Master and replicated on other machines • Defines order of concurrent operations • Changes not visible to clients until they propigate to all chunk replicas • Also used to recover the file system state
Consistency Model 1 • File namespace mutations are handled by the master and are atomic. • After a series of data mutations, the mutated file is consistent and contains data written by the last mutation • Apply mutations to replicas in same order • Use chunk version numbers to detect stale replicas • Never apply mutations to stale replicas; never give client location of a stale replica; and garbage collect them at the next possible time
Consistency Model 2 • What if the cached chunk on the client goes stale? • cache entries have a timeout • next open() of the file purges all cached information for its chunks
System Interactions: Leases and Mutation Order • Leases maintain a mutation order across all chunk replicas • Master grants a lease to a replica, called the primary • The primary choses the serial mutation order, and all replicas follow this order • Minimizes management overhead for the Master
System Interactions: Leases and Mutation Order
Atomic Record Append • Client specifies the data to write; GFS chooses and returns the offset it writes to and appends the data to each replica at least once • Heavily used by Google’s Distributed applications. • No need for a distributed lock manager • GFS choses the offset, not the client
Atomic Record Append: How? • Follows similar control flow as mutations • Primary tells secondary replicas to append at the same offset as the primary • If a replica append fails at any replica, it is retried by the client. • So replicas of the same chunk may contain different data, including duplicates, whole or in part, of the same record
Atomic Record Append: How? • GFS does not guarantee that all replicas are bitwise identical. • Only guarantees that data is written at least once in an atomic unit. • Data must be written at the same offset for all chunk replicas for success to be reported.
Replica Placement • Placement policy maximizes data reliability and network bandwidth • Spread replicas not only across machines, but also across racks • Guards against machine failures, and racks getting damaged or going offline • Reads for a chunk exploit aggregate bandwidth of multiple racks • Writes have to flow through multiple racks • tradeoff made willingly
Chunk creation • created and placed by master. • placed on chunkservers with below average disk utilization • limit number of recent “creations” on a chunkserver • with creations comes lots of writes
Chunk Re-replication • done when number of replicas is below a user defined goal • Priority for chunk re-replication based on • how far it is from its replication goal • prefer to re-replicate for chunks currently being used, as opposed to recently deleted ones • boost priority of chunk that is blocking a clients progress
Chunk rebalancing • The Master occasionally examines chunk distributions periodically moves replicas around for better disk space and load balancing
Detecting Stale Replicas • Master has a chunk version number to distinguish up to date and stale replicas • Increase version when granting a lease • If a replica is not available, its version is not increased • master detects stale replicas when a chunkservers report chunks and versions • Remove stale replicas during garbage collection
Garbage collection • When a client deletes a file, master logs it like other changes and changes filename to a hidden file. • Master removes files hidden for longer than 3 days when scanning file system name space • metadata is also erased • During HeartBeat messages, the chunkservers send the master a subset of its chunks, and the master tells it which files have no metadata. • Chunkserver removes these files on its own
Fault Tolerance:High Availability • Fast recovery • Master and chunkservers can restart in seconds • Chunk Replication • Master Replication • “shadow” masters provide read-only access when primary master is down • mutations not done until recorded on all master replicas
Fault Tolerance:Data Integrity • Chunkservers use checksums to detect corrupt data • Since replicas are not bitwise identical, chunkservers maintain their own checksums • For reads, chunkserver verifies checksum before sending chunk • Update checksums during writes
Performance! Actual network load is 3x, since writes propagate to 3 replicas Network configuration can support 750 MB/s