The Google File System

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-TakLeung Proceedings of the nineteenth ACM symposium on Operating systems principles, October 19-22, 2003, Bolton Landing, NY, USA

Introduction and Motivation

Introduction and Motivation • Background: • Goals of distributed file system: • Performance, scalability, reliability, and availability. • Motivation: • Departure from some earlier ﬁle system design assumptions based on Google’s application workloads and technological environment. • Point out in next few pages.

Different point(1/3) • Component failures are the norm rather than the exception. • The ﬁle system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts. • Things to do: • Constant monitoring • Error detection • Fault tolerance • Automatic recovery

Different point(2/3) • Files are huge by traditional standards(Multi-GB files are common): • Each file typically contains many application objects such as web documents. • We are regularly working with fast growing data sets of many TBs comprising billions of objects. • It is unwieldy to manage billions of traditionally KB-sized files. • Things to do: • Redesign I/O operation and block sizes.

Different point(3/3) • Most files are mutated by appending new data rather than overwriting existing data. • Random writes within a file are practically non-existent. • Once written, the files are only read, and often only sequentially. • E.g. data streams continuously generated by running applications • E.g. intermediate results produced on one machine and processed on another

Assumption

Six Design Assumption • The system is built from many inexpensive commoditycomponents that often fail. • The system stores a number of large ﬁles. • Small ﬁles must besupported, but we need not optimize for them. • The workloads primarily consist of two kinds of reads: • Large streaming reads and small random reads. • Performance-conscious applications often batchand sort their small reads.

Six Design Assumption(con’t) • The workloads also have many large, sequential writes that append data to files. • Once written, files are seldom modified again. • The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. • High sustained bandwidth is more important than low latency. • Most of our target applications place a premium on processing data in bulk at a high rate.

GFS Architecture

GFS Architecture • Roles: • A GFS cluster consists of a single master and multiple chunk serversand is accessed by multiple clients. • Files are divided into fixed-sized chunks. • Each chunk is identiﬁed by a globally unique 64 bit chunk handle. • Assigned by the master at the time of chunk creation.

Overview 1 2 3

Illustration of files and chunks Empty User’s File(Large) Chunk #1 Chunk #2 Chunk #3 Chunk #1 Size: 64MB ID: 64bit chunk handle Size: 64MB ID: 64bit chunk handle Size: 64MB ID: 64bit chunk handle Size: 64MB ID: 64bit chunk handle • Each chunk is hosted by 3(default) different chunk server. Empty User’s File(Small) • User needs to do: • Translate file name and offset to file name and chunk index. • Exchangefile name and chunk index forchunk handle and chunk location from master • Get File from chunk server with chunk handle and byte range

Single Master • Advantage: • Global knowledge to decide the location of chunks. • Possible drawback: • Bottleneck. • Solution: • Cache(chunk location and chunk handle) on client. • Larger chunk size. • A chunk may cover more region of a file.

Chunk Size • Larger size: • 64MB(much larger than typical file system block sizes). • Chunk is store in chunk server as a plain Linux file. • Benefits: • Reduce masters overhead when clients read or write. • It reduces the size of the metadata stored on the master. • Metadata(keeps in master server’s memory): • The file and chunk namespaces(describe later) • The mapping from files to chunks • Version of chunk

HeartBeat message and Operation Log • HeartBeat messages(periodically) • Let master controls all chunk placement and monitors chunk server status. • Operation Log: • Contains a historical record of critical metadata changes. • Serves as a logical time line that deﬁnes the order of concurrent operations. • Failover: • Checkpoint: replicate the whole metadata in memory to hard disk • Store checkpoint data and log both locally and remotely. • If fail: Replay the operation log from checkpoint

System Interaction

Lease and mutation order • A mutation is an operation that changes the contents or metadata of a chunk. • E.g. a write or an append operation. • In normal case, each mutation is performed at all the chunk’s replica. • Lease: • Maintain a consistent mutation order across replicas. • The master grants a chunk lease to one of the replicas, which we call the primary.

Lease and mutation order(con’t) • Lease(con’t): • The primary picks a serial order for all mutations to the chunk. • All replicas follow this order when applying mutations. • Minimize management overhead at the master • Timeout and extension: • A lease has an initial timeout of 60 seconds. • If a chunk is being mutated, the primary can request extension. • Piggybacked on the HeartBeatmessages.

Write control and data flow Asks the master which chunk server holds the current lease for the chunk and the locations of the other replicas. Step 1 Replies the identity of the primary and the locations of the other (secondary) replicas. (Client will cache this data) Step 2 The client pushes the data to all the replicas in any order. Data will be temporally stored in an internal buffer in chunk server. Step 3 The client sends a write request which identifies the data push earlier to the primary. Step 4 The primary decide the mutation order and forwards the write request to all secondary replicas. Step 5 Each secondary replica applies mutations in the same serial number order and then replies success message. Step 6 The primary replies to the client(including error or success). Step 7

Data flow • We decouple the flow of data from the flow of control to use the network efficiently. • Control flows: from the client to the primary and then to all secondaries. • Data flow: is pushed linearly along a carefully picked chain of chunk servers in a pipelined fashion. • Forwards the data to the “closest” machine in the network topology that has not received it. • Ideal time for transferring B bytes to R replicas: • T is the network throughput and L is latency to transfer bytes between two machine

Record Appends • In order to support concurrent writes from multiple clients. • The client specifies only the data.(no offset) • GFS appends it to the file at least once atomically (i.e., as one continuous sequence of bytes) at an offset of GFS’s choosing and returns that offset to the client. • The primary appends the data to its replica • Tells the secondariesto write the data at the exact offset where it has. • Replies success and offset to the client.

Record Appends(con’t) • At least once concept: • If a record append fails at any replica, the client retries the operation. • Replicas of the same chunk may contain diﬀerent data possibly including duplicates and record fragment. • Clients can use checksum containing in each record to filter record fragment. • Checksum and record functionality are in library code shared by Google applications

Master operation

Namespace Management and Locking • We allow multiple operations to be active in master by using locking to ensure proper serialization. • Recall that GFS does not have a per-directory data structure. • It only store file and chunks mapping. • So, GFS logically represents its namespace as a lookup table mapping full pathnames to metadata. • By using read/write lock on namespace tree to ensure serialization.

Replica Placement • The chunk replica placement policy serves two purposes: • Maximize data reliability and availability • Maximize network bandwidth utilization • The policy: • It is not enough to spread replicas across machines. • Did not consider network bandwidth utilization. • We must also spread chunk replicas across racks.

Creation, Re-replication, Rebalancing • Creation Policy: • Place new replicas on chunk servers with below-average disk space utilization. • For load balancing • Limit the number of “recent” creations on each chunk server. • A creation may imply imminent heavy traffic. • Re-replication Policy: • Re-replication if the number of available replicas falls below a user-specified goal. • Extend creation policy. • Bandwidth threshold • To keep cloning traffic from overwhelming client traffic.

Creation, Re-replication, Rebalancing(con’t) • The master rebalances replicas periodically: • Examines the current replica distribution. • Moves replicas for better disk space and load balancing. • Note: through this process, the master gradually ﬁlls up a new chunk server rather than instantly swamps it with new chunks.

Garbage collection • Garbage source: • After a ﬁle is deleted, GFS does not immediately reclaim the available physical storage. • Master only log the deletion operation. • Chunk creation may succeed on some chunk servers but not others. • Garbage collection mechanism • Periodically executed. • Merged with regular scans of namespaces and handshakes with chunk servers. • Any such replica not known to the master is “garbage.” • Including wrong version of files.

Master Availability • The master state is replicated for reliability. • Operation log and checkpoints are replicated on multiple machines. • Moreover, “shadow” masters provide read-only access to the ﬁle system. • They may lag the primary slightly.(Not mirror) • It polls chunk servers at startup (and infrequently thereafter).

Data Integrity • A chunk is broken up into 64 KB blocks and each has a corresponding 32 bit checksum. • This checksum is stored in chunk server’s memory. • For reads, the chunk server veriﬁes the checksum of data blocksthat overlap the read range before returning any data. • Low overhead: • Checksum calculation can often be overlapped with I/Os • During idle periods, chunk servers can scan and verify the contents of inactive chunks

Measurement

Clusters and measurement • Cluster A is used regularly for research and development by over a hundred engineers. • Cluster B is primarily used for production data processing

Read and Write Rate • The total workload consists of more reads than writes as we have assumed.

Fast Recovery • Experiment 1: • Killed a single chunk server in cluster B. • Containing: 15,000 chunks containing 600 GB of data. • All chunks were restored in 23.2 minutes, at an eﬀective replication rate of 440 MB/s. • Experiment 2: • Killed two chunk servers in cluster A • Each with roughly 16,000 chunks and 660 GB of data. • This double failure reduced 266 chunks to having a single replica. • All restored to at least 2x replication within 2 minutes

Workload Breakdown-Chunk Server Load

Conclusion • GFS demonstrates the qualities essential for supporting large-scale data processing workloads on commodity hardware. • We treat component failures as the norm rather than the exception. • Optimize for huge ﬁles that are mostly appended to (perhaps concurrently). • Our system provides fault tolerance by • Constant monitoring, replicating crucial data, and fast and automatic recovery. • High aggregate throughput to many concurrent readers and writers performing a variety of tasks.

Comment • The design of master and chunk server can be used in cloud storage service. • Maybe can host database but may have some technical issue. • Some designs are depends on workload type such as large file read/write. • It may not fit general case in cloud. • If we host file system in better hardware, can how we improve the performance by modify the design?

The Google File System