340 likes | 571 Views
GFS : Google File System. Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing 25.03.2014. Overview. What is the GFS? Why is the GFS designed for ? Design Overview Assumptions Interfaces Architecture Master Server Chunk Server Metadata
E N D
GFS : Google File System Ömer Faruk İnce Fatih University - ComputerEngineering CloudComputing 25.03.2014
Overview • What is the GFS? • Why is the GFS designedfor? • Design Overview • Assumptions • Interfaces • Architecture • Master Server • Chunk Server • Metadata • SystemInteractions • MasterOperations • GarbageCollections • FaultTolerance • Data Integrity • System Interactions • Conclusions
What is the GFS ? • The GFS is designedformeetingtherapidlygrowingdemands of Google Data processingneedsand GFS sharesmany of thesameaims as previousdistributed file systemssuch as performance, scalability, reliabilityandavaliability.
Why is the GFS designedfor ? • Google needed a good distributed file system. • Redundant storage of massive amounts of data oncheapandunreliablecomputers. • Why not Googleuse an existing file system? Google’s problems are different from anyone else’s • Different workload and design priorities • GFS is designed for Google apps and workloads. • Google apps are designed for GFS.
Assumptions • System built from many inexpensive commodity components. • System stores modest number of large files. • Few million, each typically 100 MBorlarger in size. Multi-GB common. • Small files must be supported, but need not optimize. • Workload is primarily: • Large streaming reads • Small random reads • Many large sequential appends. • Must efficiently implement concurrent, atomic appends. • Producer-consumer queues. • Many-way merging.
Interface • GFS provides a familiar file system interface, though itdoes not implement a standard API. Files areorganizedhierarchically in directories and identified by pathnames.Itsupports the usual operations to create, delete,open, close, read, and write files. • Moreover, GFS has snapshot and record append operations.Snapshot creates a copy of a file or a directory treeat low cost. Record append allows multiple clients to appenddata to the same file concurrently while guaranteeing the atomicity of each individual client’s append. It is usefulfor implementing multi-way merge results and producerconsumerqueues that many clients can simultaneously appendto without additional locking.
Arhitecture • A GFS cluster consists of; • A singlemaster, • MultipleChunkservers, • MultipleClients as shown in Figure-1
Chunk Size • Filesstored as chunksFixed size (64MB) • Chunk size is one of thekeydesignparameters.
Metadata • Themasterstoresthreemajortypes of metadata: • File andchunknamespaces • Mapping from files to chunks • Locations of each chunk’s replicas • Allmetadata is kept in themaster’smemory. • Thefirsttwotypes(namespacesand file tochunkmapping) arealsokeptpersistentbyloggingmutationsto an operationlogstored on themaster’slocal disk andreplicated on remotemachines.
Master’sResponsibilities • Metadatastorage • Namespacemanagement/locking • Periodiccommunicationwithchunkservers • give instructions, collect state, track cluster health • Chunkcreation, re-replication, rebalancing • balance space utilization and access speed • spread replicas across racks to reduce correlatedfailures • re-replicate data if redundancy falls below threshold • rebalance data to smooth out storage and requestload
Master’sResponsibilities (2) • GarbageCollection • simpler, more reliable than traditional file delete • master logs the deletion, renames the file to a hidden name • lazily garbage collects hidden files • Stalereplicadeletion • detect “stale” replicas using chunk version numbers
SystemInteractions • The client asks the master which chunkserver holds the current lease for the chunk and the locations of the other replicas. • The master replies with the identity of the primary and the locations of the other (secondary) replicas. Client caches. • Client pre-pushes data to all replicas. • After all replicas acknowledge, client sends write request to primary. • Primary forwards write request to all replicas. • The secondaries all reply to the primary indicating that they have completed the operation. • Primary replies to client. Errors handled by retrying.
ReadAlgorithm 1. Application originates the read request 2. GFS client translates request and sends it to master 3. Master responds with chunk handle and replicalocations.
ReadAlgorithm 4. Client picks a location and sends the request 5. Chunkserver sends requested data to the client 6. Client forwards the data to the application
WriteAlgorithm 1. Applicationoriginatestherequest 2. GFS client translates request and sends it to master 3. Master responds with chunk handle and replicalocations
WriteAlgorithm 4.Client pushes write data to all locations. Data is stored in chunkserver’sinternalbuffers
WriteAlgorithm 5. Client sends write command to primary 6. Primary determines serial order for data instances inits buffer and writes the instances in that order to thechunk 7. Primary sends the serial order to the secondaries andtells them to perform the write
AtomicRecordAppend • Client pushes data to all replicas. • Sends request to primary. Primary check maximum size. • Within maximum size • Append the data to its replica • Exceed maximum size • Send error to client • Retried on the next chunk • Replicas of the same chunk may contain different data possibly including duplicates of the same record in whole or in part. These are handled by the client. • It only guarantees that the data is written at least once as an atomic unit.
Snapshot • A snapshot is a copy of a system at a moment in time. • To quickly create branch copies of huge data sets. • To checkpoint the current state before experimenting with changes that can later be committed or rolled-back easily . • When the master receives a snapshotrequest, it first revokes any outstanding leases on the chunksin the files it is about to snapshot.
MasterOperations • NamespaceManagementandLocking GFS logically represents its namespace as a lookup table mapping full pathnames to metadata. • Need locking to prevent: • Two clients from trying to create the same file at the same time. • Changes to a directory tree during snapshotting. • Solution: • Lock intervening directories in read mode. • Lock final file or directory in write mode. • For snapshot lock source and target in write mode.
ReplicaManagement • Maximize data reliability and availability • Maximize bandwidth utilization • Need to spread chunk replicas across machines and racks
Creation, Re-replication and Rebalancing • Replicas created for three reasons: • Chunk creation • Re-replication • Load balancing • Creation • place new replicas on chunkservers with below-average disk space utilization. • Spread replicas across racks. • Re-replication • re-replicates a chunkas soon as the number of available replicas falls below a user-specified goal. • Rebalancing • Periodically examines distribution and moves replicas for better disks pace and load balancing.
GarbageCollection • Storage reclaimed lazily by GC. • File first renamed to a hidden name. • Hidden files removes if more than three days old. • When hidden file removed, in-memory metadata is removed. • Regularly scans chunk namespace, identifying orphaned chunks. These are removed. • Chunkservers periodically report chunks they have and the master replies with the identity of all chunks that are no longer present in the master’s metadata. The chunkserver is free to delete its replicas of such chunks.
FaultTolerance • Highavailability • Fastrecovery • Master and Chunkserversrestartable in a few seconds • Chunkreplication • Eachchunk is replicated on multiplechunkservers on differenttracks. Users can specifydifferentlevelsfordifferentparts of the file namespace. • default: 3 replicas. • Shadowmasters • Data integrity • Checksum every 64KB block in each chunk
Data Integrity • Eachchunkserveruseschecksummingtodetectcorruption of stored data. • Checksums are kept in memory. • Separate from data. • On read error, error is reported to master. • Master will re-replicate the chunk. • Requestor read from other replicas
Performance Testing • GFS cluster consisting of: • One master • Two master replicas • 16 chunkservers • 16 clients • Machines were: • Dual 1.4 GHz PIII • 2 GB of RAM • 2 80 GB 5400 RPM disks • 100 Mbps full-duplex Ethernet to switch • Servers to one switch, clients to another. Switches connected via gigabit Ethernet.
Reads • N clients reading 4 MB region from 320 GB file set simultan-eously. • Read rate slightly lower as clients go up due to probability reading from same chunkserver. 75% 80%
Writes • N clients writing to N files simultaneously. • Low write rate is due to delay in propagating data among replicas. • Slow write is not major problem with aggregate write bandwidth to large clients. 50% 50%
Record Appends • N clients appending to a single file simultaneously. • Append rate slightly lower as clients go up due to network congestion by different clients. • Chunkserver network congestion is not major issue with large n clients appending to large shared files.
Conclusions • GFS demonstrates how to support large-scale processing workloads on commodity hardware • design to tolerate frequent component failures • optimize for huge files that are mostly appended andread • feel free to relax and extend FS interface as required • go for simple solutions (e.g., single master) • GFS has met Google’s storage needs, therefore good enough for them.