190 likes | 410 Views
GFS. - Slides by Jatin. Idealogy. Huge amount of data Ability to efficiently access data Large quantity of Cheap machines Component failures are the norm rather than the exception Atomic append operation so that multiple clients can append concurrently. Files in GFS.
E N D
GFS - Slides by Jatin
Idealogy • Huge amount of data • Ability to efficiently access data • Large quantity of Cheap machines • Component failures are the norm rather than the exception • Atomic append operation so that multiple clients can append concurrently
Files in GFS • Files are huge by traditional standards • Most files are mutated by appending new data rather than overwriting existing data • Once written, the files are only read, and often only sequentially. • Appending becomes the focus of performance optimization and atomicity guarantees
Architecture • GFS cluster consists of a single master and multiple chunk servers and is accessed by multiple clients. • Each of these is typically a commodity Linux machine running a user-level server process. • Files are divided into fixed-size chunks identified by an immutable and globally unique 64 bit chunk handle • For reliability, each chunk is replicated on multiple chunk servers • master maintains all file system metadata. • The master periodically communicates with each chunk server in HeartBeatmessages to give it instructions and collect its state • Neither the client nor the chunk server caches file data eliminating cache coherence issues. • Clients do cache metadata, however.
Read Process • Single master vastly simplifies design • Clients never read and write file data through the master. Instead, a client asks the master which chunk servers it should contact. • Using the fixed chunk size, the client translates the file name and byte offset specified by the application into a chunk index within the file • It sends the master a request containing the file name and chunk index. The master replies with the corresponding chunk handle and locations of the replicas. The client caches this information using the file name and chunk index as the key. • The client then sends a request to one of the replicas, most likely the closest one. The request specifies the chunk handle and a byte range within that chunk
Specifications • Chunk Size = 64 MB • Chunks stored as plain Unix files on chunk server. • A persistent TCP connection to the chunk server over an extended period of time (reduce network overhead) • cache all the chunk location information to facilitate small random reads. • Master keeps the metadata in memory • Disadvantages – Small files become Hotspots. • Solution – Higher replication for such files.
Metadata • File and chunk namespaces, • the mapping from files to chunks, • the locations of each chunk’s replicas. • Namespaces and file-to-chunk mapping are also kept persistent by logging mutations to an operation log stored on the master’s local disk and replicated on remote machines • Master does not store chunk location information persistently, instead asks each chunk server about its chunks at master startup and whenever a chunkserver joins the cluster.
In-Memory Data Structures • Tasks to be kept in mind: • Chunk garbage collection • re-replication in presence of chunk server failures • chunk migration to balance load and disk space usage across chunkservers • The master maintains less than 64 bytes of metadata for each 64 MB chunk • The file namespace data typically requires less then 64 bytes per file because it stores file names compactly using prefix compression • Thus memory limitations of Master is not a concern.
Chunk Locations • Master simply polls chunkservers for that information at startup and periodically thereafter • This eliminated the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on • chunkserver has the final word over what chunks it does or does not have on its own disks. • A consistent view at the Master need not be maintained.
Operation Log • contains a historical record of critical metadata changes • central to GFS • serves as a logical time line that defines the order of concurrent operations • must store it reliably and not make changes visible to clients until metadata changes are made persistent • respond to a client operation only after flushing the corresponding log record to disk both locally and remotely. • The master checkpoints its state whenever the log grows beyond a certain size. • The checkpoint is in a compact B-tree like form
Consistency Model • Relaxed • Mutations might lead a file region to different states.
GFS achieves “defined” status by • (a) applying mutations to a chunk in the same order on all its replicas • (b) using chunk version numbers to detect any replica that has become stale because it has missed mutations while its chunkserver was down • Clients cache chunk locations – Problem. • Solution: Cache timeout & append characteristics.
Failure Identification • Regular handshakes • Checksum data. • Versioning of Chunks.
Applications • Make applications mutate files by appending rather than overwriting. • Checksum each record prepared by Writer. • Reader has the ability to identify and discard extra padding.
The master grants a chunk lease to one of the replicas, which we call the primary. The primary picks a serial order for all mutations to the chunk. If a client request fails then the modified region at the primary is left in an inconsistent state.
Namespace Management and Locking • GFS does not have a per-directory data structure that lists all the files in that directory • GFS logically represents its namespace as a lookup table mapping full pathnames to metadata • When ever a modification needs to be done all nodes in the path till the file get locked (all read locks except the full path which gets write lock).
FAULT TOLERANCE • Achieved by: • Fast Recovery • Chunk Replication • Master Replication • “shadow” masters (Read only access to file system) • If Master’s machine or disk fails, monitoring infrastructure outside GFS starts a new master process elsewhere with the replicated operation log. Clients use only the canonical name of the master (e.g. gfs-test), which is a DNS alias that can be changed if the master is relocated to another machine.