The G o o g l e File System

The Google File System • Omid Khalili (okhalili@cs.ucsd.edu)

The Google File System (GFS) • A scalable distributed file system for large distributed data intensive applications • Multiple GFS clusters are currently deployed. • The largest ones have: • 1000+ storage nodes • 300+ TeraBytes of disk storage • heavily accessed by hundreds of clients on distinct machines

Introduction • Shares many same goals as previous distributed file systems • performance, scalability, reliability, etc • GFS design has been driven by four key observation of Google application workloads and technological environment

Intro: Observations 1 • 1. Component failures are the norm • constant monitoring, error detection, fault tolerance and automatic recovery are integral to the system • 2. Huge files (by traditional standards) • Multi GB files are common • I/O operations and blocks sizes must be revisited

Intro: Observations 2 • 3. Most files are mutated by appending new data • This is the focus of performance optimization and atomicity guarantees • 4. Co-designing the applications and APIs benefits overall system by increasing flexibility

The Design • Cluster consists of a single master and multiple chunkservers and is accessed by multiple clients

The Master • Maintains all file system metadata. • names space, access control info, file to chunk mappings, chunk (including replicas) location, etc. • Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state

The Master • Helps make sophisticated chunk placement and replication decision, using global knowledge • For reading and writing, client contacts Master to get chunk locations, then deals directly with chunkservers • Master is not a bottleneck for reads/writes

Chunkservers • Files are broken into chunks. Each chunk has a immutable globally unique 64-bit chunk-handle. • handle is assigned by the master at chunk creation • Chunk size is 64 MB • Each chunk is replicated on 3 (default) servers

Clients • Linked to apps using the file system API. • Communicates with master and chunkservers for reading and writing • Master interactions only for metadata • Chunkserver interactions for data • Only caches metadata information • Data is too large to cache.

Chunk Locations • Master does not keep a persistent record of locations of chunks and replicas. • Polls chunkservers at startup, and when new chunkservers join/leave for this. • Stays up to date by controlling placement of new chunks and through HeartBeat messages (when monitoring chunkservers)

Operation Log • Record of all critical metadata changes • Stored on Master and replicated on other machines • Defines order of concurrent operations • Changes not visible to clients until they propigate to all chunk replicas • Also used to recover the file system state

Consistency Model 1 • File namespace mutations are handled by the master and are atomic. • After a series of data mutations, the mutated file is consistent and contains data written by the last mutation • Apply mutations to replicas in same order • Use chunk version numbers to detect stale replicas • Never apply mutations to stale replicas; never give client location of a stale replica; and garbage collect them at the next possible time

Consistency Model 2 • What if the cached chunk on the client goes stale? • cache entries have a timeout • next open() of the file purges all cached information for its chunks

System Interactions: Leases and Mutation Order • Leases maintain a mutation order across all chunk replicas • Master grants a lease to a replica, called the primary • The primary choses the serial mutation order, and all replicas follow this order • Minimizes management overhead for the Master

System Interactions: Leases and Mutation Order

Atomic Record Append • Client specifies the data to write; GFS chooses and returns the offset it writes to and appends the data to each replica at least once • Heavily used by Google’s Distributed applications. • No need for a distributed lock manager • GFS choses the offset, not the client

Atomic Record Append: How? • Follows similar control flow as mutations • Primary tells secondary replicas to append at the same offset as the primary • If a replica append fails at any replica, it is retried by the client. • So replicas of the same chunk may contain different data, including duplicates, whole or in part, of the same record

Atomic Record Append: How? • GFS does not guarantee that all replicas are bitwise identical. • Only guarantees that data is written at least once in an atomic unit. • Data must be written at the same offset for all chunk replicas for success to be reported.

Replica Placement • Placement policy maximizes data reliability and network bandwidth • Spread replicas not only across machines, but also across racks • Guards against machine failures, and racks getting damaged or going offline • Reads for a chunk exploit aggregate bandwidth of multiple racks • Writes have to flow through multiple racks • tradeoff made willingly

Chunk creation • created and placed by master. • placed on chunkservers with below average disk utilization • limit number of recent “creations” on a chunkserver • with creations comes lots of writes

Chunk Re-replication • done when number of replicas is below a user defined goal • Priority for chunk re-replication based on • how far it is from its replication goal • prefer to re-replicate for chunks currently being used, as opposed to recently deleted ones • boost priority of chunk that is blocking a clients progress

Chunk rebalancing • The Master occasionally examines chunk distributions periodically moves replicas around for better disk space and load balancing

Detecting Stale Replicas • Master has a chunk version number to distinguish up to date and stale replicas • Increase version when granting a lease • If a replica is not available, its version is not increased • master detects stale replicas when a chunkservers report chunks and versions • Remove stale replicas during garbage collection

Garbage collection • When a client deletes a file, master logs it like other changes and changes filename to a hidden file. • Master removes files hidden for longer than 3 days when scanning file system name space • metadata is also erased • During HeartBeat messages, the chunkservers send the master a subset of its chunks, and the master tells it which files have no metadata. • Chunkserver removes these files on its own

Fault Tolerance:High Availability • Fast recovery • Master and chunkservers can restart in seconds • Chunk Replication • Master Replication • “shadow” masters provide read-only access when primary master is down • mutations not done until recorded on all master replicas

Fault Tolerance:Data Integrity • Chunkservers use checksums to detect corrupt data • Since replicas are not bitwise identical, chunkservers maintain their own checksums • For reads, chunkserver verifies checksum before sending chunk • Update checksums during writes

Performance! Actual network load is 3x, since writes propagate to 3 replicas Network configuration can support 750 MB/s

The G o o g l e File System

The G o o g l e File System

Presentation Transcript

G o o g l e Earth

G o o g l e

G o o g l e File System

G o o g l e

G o o g l e

G o o g l e Now

G o o g l e

G o o g l e Speaks

Beyond G o o g l e ….

G o o g l e Search

G o o g l e Docs

G o o g l e

G o o g l e

G o o g l e +

G o o g l e AdSense

G o o g l e Docs

G o o g l e Docs