1 / 31

The Google File System

The Google File System. Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Google SOSP 2003 (19th ACM S ymposium on O perating S ystems P rinciples ). Contents. Introduction GFS Design Measurements Conclusion. Introduction(1/2). What is File System?

ipo
Download Presentation

The Google File System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Google File System Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Google SOSP 2003 (19th ACM Symposium on Operating Systems Principles )

  2. Contents • Introduction • GFS Design • Measurements • Conclusion

  3. Introduction(1/2) • What is File System? • A method of storing and organizing computer files and their data. • Be used on data storage devices such as a hard disks or CD-ROMs to maintain the physical location of the files. • What is Distributed File System? • Makes it possible for multiple users on multiple machines to share files and storage resources via a computer network. • Transparency in Distributed Systems • Make distributed system as easy to use and manage as a centralized system • Give a Single-System Image • A kind of network software operating as client-server system

  4. Introduction(2/2) • What is the Google File System? • A scalable distributed file system for large distributed data-intensive applications. • Shares many same goals as previous distributed file systems • performance, scalability, reliability, availability • Different design points from traditional choices in GFS • Component failures are the norm rather than the exception • Files are huge by traditional standards • Multi-GB files are common • Most files are mutated by appending new data rather than overwriting existing data • Co-designing the application and the FS API benefits the overall system by increasing flexibility

  5. Contents • Introduction • GFS Design • Measurements • Conclusion 1. Design Assumption 2. Architecture 3. Features 4. System Interactions 5. Master Operation 6. Fault Tolerance

  6. GFS Design 1. Design Assumption • Component failures are the norm • A number of cheap commodity hardware that often fail but unreliable • Scale up VS scale out • Problems : application bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies. • Solutions : constant monitoring, error detection, fault tolerance, and automatic recovery Google Server Computer

  7. GFS Design 1. Design Assumption • Files are HUGE • Multi-GB file sizes are the norm • Parameters for I/O operation and block sizes have to be revisited. • File access model: read / append only (not overwriting) • Most reads sequential • Large streaming reads and small random reads • Datastreams continuously generated by running application. • Many large, sequential writes that append data to files are also many in the workloads • Appending becomes the focus of performance optimization and atomicity guarantees, while caching data blocks in the client loses its appeal.

  8. GFS Design 1. Design Assumption • Multiple clients concurrently append to the same file • Atomicity with minimal synchronization overhead is essential • High sustained bandwidth is more important than low latency • Co-designing the applications and the file system API benefits the overall system • Increasing the flexibility.

  9. GFS Design 2. Architecture • GFS Cluster Component • 1. a single master • 2. multiple chunkservers • 3. multiple Clients

  10. GFS Design 2. Architecture • GFS Master • maintains all file system metadata. • names space, access control info, mapping file to chunk, chunk (including replicas) location, etc. • enables the master to make sophisticated chunk placement and replication decisions using global knowledge. • Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state • needs to minimize operations to prevent bottleneck

  11. GFS Design 2. Architecture • GFS Chunkserver • Files are broken into chunks. • Each chunk has a immutable globally unique 64-bit chunk-handle. • chunk-handle is assigned by the master at chunk creation • Chunk size is 64 MB (fixed-size chunk) • Pros • reduce interactions between client and master • reduce network overhead between client and chunkserver • reduce the size of the metadata stored on the master • Cons • small file in one chunk -> hot spot • Each chunk is replicated on 3 (default) servers

  12. GFS Design 2. Architecture • GFS Client • be linked to apps using the file system API. • Communicates with master and chunkservers for reading and writing • Master interactions only for metadata • Chunkserver interactions for data • Only caches metadata information • Data is too large to cache.

  13. GFS Design 3. Features • Metadata • The master stores three major types of metadata • the file and chunk namespaces • the mapping from files to chunks • the locations of each chunk’s replicas • All metadata is kept in the master’s memory (less 64byte per 64MB chunk) • For recovery, first two types are kept persistent by logging mutations to an operation log and replicated on remote machines • Periodically scan through metadata’s entire state in the background. • Chunk garbage collection, re-replication for fail, chunk migration for balancing

  14. GFS Design 3. Features • Operation log • A historical record of critical metadata changes • Defines the order of concurrent operations (identified by the logical times) • Critical! • Replicated on multiple remote machines • Respond to a client operation only after flushing the corresponding log record to disk both locally and remotely • Checkpoints • Recovering the file system state by replaying the operation log • The master checkpoints whenever the log grows beyond a certain size • Keeping a few older checkpoints and log files to guard against catastrophes

  15. GFS Design4. System Interactions • Mutation • Changing the contents or metadata of a chunk • A write or an append operation • Performed at all the chunk’s replicas • Using leases to maintain a consistent mutation order across the replicas • Minimized management overhead • Lease • Granted by the master to one of the replicas to become primary • Primary picks a serial order of mutation and all replicas follow • Global mutation order is defined first by the lease grant order chosen by master • With in a lease, the serial numbers assigned by the primary • 60 seconds timeout, can be extended • Can be revoked

  16. GFS Design 4. System Interactions • Client • Requests new file to write (1) • Master • Adds file to namespace • Selects 3 chunk servers • Designates primary chunk and grant lease • Replies to client (2) • Client • Sends data to all replicas (3) • Notifies primary when sent (4) • Primary • Writes data in order • Increment chunk version • Sequences secondary writes (5) • Secondary • Write data in sequence order • Increment chunk version • Notify primary write finished (6) • Primary • Notifies client when write finished (7) ※ Data write

  17. GFS Design4. System Interactions • Atomic record append operation • Client specifies only data • In a traditional write, specifying the offset at which data is to be written • GFS appends at least once atomically • One continuous sequence of bytes • Return an offset of GFS’s choosing to the client • Snapshot • Making a copy of a file or a directory tree almost instantaneously, while minimizing any interruptions • Steps • Revokes lease • Duplicates metadata, pointing to the same chunks • When the client want to write to a chunk first after the snapshot operation , creates real duplicate locally

  18. GFS Design 5. Master Operation • New chunk creation policy • created and placed by master • New replicas on below-average disk utilization • Limit # of “recent” creations on each chunkserver • Spread replicas of a chunk across racks • A chunk is re-replicated as soon as the number of available replicas falls below a user-specified goal • Re-replicates happen when a chunkserver becomes unavailable • Rebalancing • Periodically rebalance replicas for better disk space and load balancing • A new chunkserver is gradually filled up

  19. GFS Design 5. Master Operation • Garbage collection • When a client deletes a file, master logs it like other changes and changes filename to a hidden file. • Master removes hidden files for longer than 3 days when scanning file system name space • metadata is also erased • During HeartBeat messages, the chunkservers send the master a subset of its chunks, and the master tells it which files have no metadata. • Chunkserver removes these files on its own

  20. GFS Design 6. Fault Tolerance • High Availability • Fast recovery • Master and chunkservers can restart in seconds • Chunk Replication • Multiple chunk servers on different rack • Master Replication • Log and checkpoints are replicated • Master failures? • Monitoring infrastructure outside GFS starts a new master process • DNS alias

  21. GFS Design 6. Fault Tolerance • Data Integrity • Use checksums to detect data corruption • A chunk is broken up intro 64KB blocks with 32-bit checksum • No error propagation • For reads, the chunkserver verifies the checksum before returning • If a block does not match the recorded checksum, the chunk server returns error • The requestor will read from other replica • Record append • Incrementally update the checksum for the last block • Error will be detected when read • During idle periods, scanning and verifying the contents of inactive chunks

  22. GFS Design 6. Fault Tolerance • Master Failure • Operations Log • Persistent record of changes to master metadata • Used to replay events on failure • Replicated to multiple machines for recovery • Flushed to disk before responding to client • Checkpoint of master state at interval to keep ops log file small • Master recovery requires • Latest checkpoint file • Subsequent operations log file • Master recovery was initially a manual operation • Then automated outside of GFS to within 2 minutes • Now down to 10’s of seconds

  23. GFS Design 6. Fault Tolerance • Chunk Server Failure • Heartbeats sent from chunk server to master • Master detects chunk server failure • If chunk server goes down: • Chunk replica count is decremented on master • Master re-replicates missing chunks as needed • 3 chunk replicas is default (may vary) • Priority for chunks with lower replica counts • Priority for blocked clients • Throttling per cluster and chunk server • No difference in normal/abnormal termination • Chunk servers are routinely killed for maintenance

  24. Contents • Introduction • GFS Design • Measurements • Conclusion

  25. Measurements (1/5) • Micro-benchmarks • GFS cluster consists of • 1 master, 2 master replicas • 16 chunkservers • 16 clients • Machines are configured with • Dual 1.4 GHz PⅢ processors • 2GB of RAM • 80 GB 5400rpm disks x 2 • 100Mbps full-duplex Ethernet connection to an HP 2524 switch • The two switches are connected with 1 Gbps link.

  26. Measurements (2/5) • Micro-benchmarks • Cluster A: • Used by over a hundred engineers. • Typical task initiated by user and runs for a few hours. • Task reads MB’s-TB’s of data, transforms/analyzes the data, and writes results back. • Cluster B: • Used for production data processing. • Typical task runs much longer than a Cluster A task. • Continuously generates and processes multi-TB data sets. • Human users rarely involved. • Clusters had been running for about a week when measurements were taken.

  27. Measurements (3/5) • Micro-benchmarks • Many computers at each cluster • On average, cluster B file size is triple cluster A file size. • Metadata at chunkservers: • Chunk checksums. • Chunk Version numbers. • Metadata at master is small (48, 60 MB) • -> master recovers from crash within seconds.

  28. Measurements (4/5) • Micro-benchmarks (performance metrics for two GFS clusters) • more reads than writes. • Both clusters were in the middle of heavy read activity. • Cluster B was in the middle of a burst of write activity. • In both clusters, master was receiving 200-500 operations per second -> master is not a bottleneck. • Killed a single chunk server in B • 15,000 chunks containing 600 GB of data • All chunks were restored in 23.2 minutes, at an replication rate of 440MB/s

  29. Measurements (5/5) • Micro-benchmarks • Chunkserver workload • Bimodal distribution of small and large files • Ratio of write to append operations: 3:1 to 8:1 • Virtually no overwrites • Master workload • Most request for chunk locations and open files • Reads achieve 75% of the network limit • Writes achieve 50% of the network limit

  30. Contents • Introduction • GFS Design • Measurements • Conclusion

  31. Conclusion • GFS demonstrates how to support large-scale processing workloads on commodity hardware • GFS has different points in the design space • Component failures as the norm • Optimize for huge files • Most files are mutated by appending new data • GFS provides fault tolerance • Constant monitoring • Replicating data • Fast and automatic recovery • Checksumming • High aggregate throughput

More Related