The Google File System

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영

Outline • Introduction • Design Overview • System Interactions • Master Operation • Fault Tolerance and Diagnosis • Conclusions

Introduction • GFS was designed to meet the demands of Google’s data processing needs. • Emphasis on Design • Component failures • Files are huge • Most files are mutated by appending

DESIGN OVERVIEW

Assumptions • Composed of inexpensive components often fail • Stores 100 MB or larger size file • Large streaming reads, small random reads • Large, sequential writes that append data to files. • Atomicity with minimal synchronization overhead is essential. • High sustained bandwidth is more important than low latency

Interface • Files are organized hierarchically in directories and identified by pathnames

Architecture • Google File System. Designed for system-to-system interaction, and not for user-to-system interaction.

Single Master

Chunk Size • Large chunk size – 64MB • Advantages • Reduce client-master interaction • Reduce network overhead • Reduce the size of metadata • Disadvantages • Hot spot - Many clients accessing the same file

Metadata • All metadata is kept in master’s memory • Less than 64bytes metadata each chunk • Types • File and chunk namespace • File to chunk mapping • Location of each chunk’s replicas

Metadata(Cont’d) • In-Memory data structure • Master operations are fast • Easy and efficient periodically scan • Operation log • Contain historical record of critical metadata changes • Replicate on multiple remote machines • Respond to client only after log record • Recovery by replaying the operation log

Consistency Model • Consistent • all clients will always see the same data regardless of which replicas they read from • Defined • consistent and clients will see what mutation writes in its entirety • Inconsistent • different clients may see different data at different times

SYSTEM INTERACTION

Leases and Mutation Order • Leases • To maintain a consistent mutation order across replicas and minimize management overhead • The master grants one of the replicas to become the primary • Primary picks a serial order of mutation • When applying mutation all replicas follow the order

Leases and Mutation Order(Cont’d)

Data Flow • Fully utilize network bandwidth • Decouple control flow and data flow • Avoid network bottlenecks and high-latency • Forwards the data to the closest machine • Minimize latency • Pipelining the data transfer

Atomic Record Appends • Record append : atomic append operation • Client specifies only the data • GFS appends data at an offset of GFS’s choosing and return that offset to client • Many clients append to the same file concurrently • such files often serves as multiple-producer/ single-consumer queue • Contain merged results

Snapshot SNAPSHOT Make a copy of a file or a directory tree Standard copy-on-write

MASTER OPERATION

Namespace Management and Locking • Namespace • Lookup table mapping full pathname to metadata • Locking • To ensure proper serialization multiple operations active and use locks over regions of the namespace • Allow concurrent mutations in the same directory • Prevent deadlock consistent total order

Replica Placement • Maximize data reliability and availability • Maximize network bandwidth utilization • Spread replicas across machines • Spread chunk replicas across the racks

Creation, Re-replication, Rebalancing • Creation • Demanded by writers • Re-replication • Number of available replicas fall down below a user-specifying goal • Rebalancing • For better disk space and load balancing

Garbage Collection • Lazy reclaim • Log deletion immediately • Rename to a hidden name with deletion timestamp • Remove 3 days later • Undelete by renaming back to normal • Regular scan • Heartbeat message exchange with each chunkserver • Identify orphaned chunks and erase the metadata

Stale Replica Detection • Maintain a chunk version number • Detect stale replicas • Remove stale replicas in regular garbage collection

FAULT TOLERANCE AND DIAGNOSIS

High Availability • Fast recovery • Restore state and start in seconds • Chunk replication • Different replication levels for different parts of the file namespace • Master clones existing replicas as chunkservers go offline or detect corrupted replicas through checksum verification

High Availability • Master replication • Operation log and checkpoints are replicated onmultiple machines • Master machine or disk fail • Monitoring infrastructure outside GFS starts new master process • Shadow master • Read-only access when primary master is down

Data Integrity • Checksum • To detect corruption • Every 64KB block in each chunk • In memory and stored persistently with logging • Read • Chunkserver verifies checksum before returning • Write • Append • Incrementally update the checksum for the last block • Compute new checksum

Data Integrity(Cont’d) • Write • Overwrite • Read and verify the first and last block then write • Compute and record new checksums • During idle periods • Chunkservers scan and verify inactive chunks

MEASUREMENTS

Micro-benchmarks • GFS cluster • 1 master • 2 master replicas • 16 chunkservers • 16 clients • Server machines connected to one switch • client machines connected to the other • Two switches are connected with 1 Gbps link.

Micro-benchmarks Figure 3: Aggregate Throughputs. Top curves show theoretical limits imposed by our network topology. Bottom curves show measured throughputs. They have error bars that show 95% confidence intervals, which are illegible in some cases because of low variance in measurements.

Real World Clusters Table2: characteristic Of two GFS clusters

Real World Clusters Table 3: Performance Metrics for Two GFS Clusters

Real World Clusters • In cluster B • Killed a single chunk server containing 15,000 chunks (600GB of data) • All chunks restored in 23.2minutes • Effective replication rate of 440MB/s • Killed two chunk servers each 16,000 chunks (660GB of data) • 266 chunks only have a single replica • Higher priority • Restored with in 2 minutes

Conclusions • Demonstrates qualities essential to support large-scale processing workloads • Treat component failure as the norm • Optimize for huge files • Extend and relax standard file system • Fault tolerance provide • Consistent monitoring • Replicating crucial data • Fast and automatic recovery • Use checksum to detect data corruption • High aggregate throughput

The Google File System