260 likes | 387 Views
The Google File System . presenter : Kim, youngjin. Introduction. Component failures are the norm Multi-GB files are common most files are mutated by appending new data rather than overwriting. Interface. Create, delete, open, close, read and write snapshot record append. Architecture.
E N D
The Google File System • presenter : Kim, youngjin
Introduction • Component failures are the norm • Multi-GB files are common • most files are mutated by appending new data rather than overwriting
Interface • Create, delete, open, close, read and write • snapshot • record append
Single Master • Simplify design • enable to make chunk placement and replication decisions using global knowledge • bottleneck -> minimize its involvement
Chunk size • one of the key design parameters • 64MB • advantages vs disadvantages
metadata • Three major type of metadata • the file and chunk namespace • the mapping from files to chunk • the locations of each chunk’s replicas
metadata(cont’d) • In-memory Data structure • Chunk Location • Operation Log
Consistency model • GFS has a relaxed consistency model • write • data to be written at an application-specified file offset • record appends • data to be appended atomically at least once
System Interaction • Minimize the master’s involvement • Leases and Mutation Order • primary
Data flow • goal : To fully utilize each machine’s network bandwidth, avoid network bottlenecks and high-latency links, and minimize the latency to push through all the data
Atomic Record Appends • Traditional write vs Record append
Snapshot • makes a copy of a file or a directory tree • to use a check point to roll back or commit
Master Operation • Goal • Keeping chunk fully replicated • balancing load across all the chunkservers • Reclaiming unused storage
Namespace Management and Locking • Use lock over regions of the namespace to ensure proper serialization • Read-Write lock per each namespace node • Allow concurrent mutations in the same directory
Replica Placement • Maximize data reliability and availability , and maximize network bandwidth utilization • spread chunk replicas across racks
Creation, Re-replication, Rebalancing • Chunk replicas are created for these 3 reasons • Creation • re-replication • Rebalancing
Garbage Collection • rename file name to hidden name including the deletion timestamp • keep the file for 3 days • orphaned chunk -> garbage • advantage vs disadvantage
Fault Tolerance and diagnosis • High Availability • Fast Recovery • Replication • chunk Replication • Master Replication
Data integrity • Checksum is used by each chunkservers • For detecting corruption of stored data • kept in memory -> fast lookup / comparison • optimized for record append