1 / 23

Google File System (GFS)

Fault Tolerance and Recovery. Google File System (GFS). Agenda. Introduction Design Fault Tolerance & Recovery Issues Future. Cloud Architecture. Cloud Applications – App Engine. Built on Commodity Hardware Cells. Component failures are the norm. Guaranteed some will never recover.

Download Presentation

Google File System (GFS)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault Tolerance and Recovery Google File System (GFS)

  2. Agenda • Introduction • Design • Fault Tolerance & Recovery • Issues • Future

  3. Cloud Architecture

  4. Cloud Applications – App Engine

  5. Built on Commodity Hardware Cells Component failures are the norm. Guaranteed some will never recover.

  6. Original Use Case: Google Search • 24 Map Reductions in the indexing pipeline • multi GB files are common • write append only, no overwriting; large sequential read, small random read

  7. Design Assumptions • Built on commodity hardware that will fail regularly • A few million large files per cluster • Large sequential reads, small random reads • Large sequential writes - append only • Efficient concurrent access - producer/consumer • High sustained bandwidth over low latency • No POSIX support – this is a cloud file system

  8. Interface • Separate GFS client library (front-end) • No POSIX support, no v-node integration • Supports: create, delete, open, close, read, write • recordAppend() • Supports efficient multi-way merge support • E.g. 1,000 mapreduce producers to the same file • Atomically appends record at least once • Loose consistency across replicas • Idempotent record processing required • snapshot() – an efficient file clone()

  9. Architecture

  10. Single Master Design • Simplify the overall design problem • Central place to control replication, GC, etc. • Able to rollout GFS in 1 year with 3 engineers – time to market • Master stores metadata in main memory • File namespaces • File to chunk mappings • Chunk replica locations • Chunk servers provide authoritative list of chunk versions • Discovery simplifies membership changes, failures, etc. • Metadata • Only a few million files • 64 bytes per 64MB chunk – so it fits in master’s main memory • Checkpointed to disk at interval in b-tree format for fast startup

  11. Use Case: Write to a new file • Client • Requests new file (1) • Master • Adds file to namespace • Selects 3 chunk servers • Designates chunk primary and grant lease • Replies to client (2) • Client • Sends data to all replicas (3) • Notifies primary when sent (4) • Primary • Writes data in order • Increment chunk version • Sequences secondary writes (5) • Secondary • Write data in sequence order • Increment chunk version • Notify primary write finished (6) • Primary • Notifies client when write finished (7)

  12. Chunk Server Failure • Heartbeats sent from chunk server to master • Master detects chunk server failure • If chunk server goes down: • Chunk replica count is decremented on master • Master re-replicates missing chunks as needed • 3 chunk replicas is default (may vary) • Priority for chunks with lower replica counts • Priority for blocked clients • Throttling per cluster and chunk server • No difference in normal/abnormal termination • Chunk servers are routinely killed for maintenance

  13. Chunk Corruption • 32-bit checksums • 64MB chunks split into 64KB blocks • Each 64KB block has a 32-bit checksum • Chunk server maintains checksums • Checksums are optimized for appendRecord() • Verified for all reads and overwrites • Not verified during recordAppend() – only on next read • Chunk servers verify checksums when idle • If a corrupt chunk is detected: • Chunk server returns an error to the client • Master notified, replica count decremented • Master initiates new replica creation • Master tells chunk server to delete corrupted chunk

  14. Master Failure • Operations Log • Persistent record of changes to master metadata • Used to replay events on failure • Replicated to multiple machines for recovery • Flushed to disk before responding to client • Checkpoint of master state at interval to keep ops log file small • Master recovery requires • Latest checkpoint file • Subsequent operations log file • Master recovery was initially a manual operation • Then automated outside of GFS to within 2 minutes • Now down to 10’s of seconds

  15. High Availability • Chunk replication / active failover • Master replication / passive failover • Shadow masters - read only availability • Master performs ongoing chunk rebalancing • Chunk server load balancing • Optimize read / write traffic • Chunk server disk utilization • Ensures uniform free space distribution for new writes • Enables optimal new chunk allocation by master • Hotspots handled by higher replication policy • E.g. replica count = 10 • Can set on file or namespace

  16. Growth Over Past Decade • Application Mix: • R&D, MapReduce, BigTable • Gmail, Docs, YouTube, Wave • App Engine • Scale: • 36 Data Centers • 300+ GFSII Clusters • Upwards of 800K machines • 2-figure PB data per cluster • 1000TB = 1 PB

  17. Issue: Single Master Bottleneck • Storage size increase: • 3-figure TB to 2-figure PB • File count increase: • E.g. Gmail files < 64M == more master metadata • Master metadata increase (64 bytes / chunk) • Metadata still had to fit into main memory • Master scaled linearly with metadata size • Master metadata increase by 100-1,000x • Stampede of 1,000 MapReduce clients to the master • Master became a bottleneck • Latency, latency, latency … • GFS was originally designed for batch processing MapReduce jobs • 64MB chunk size favored throughput over latency • Interactive user facing apps need low latency against smaller files

  18. Workarounds • Multi-cell approach • Partition n masters over a set of chunk servers • Namespaces to wrap partitions • Tuning the GFS binary • Not typical at Google • Usually target decent performance, then scale out • Tune Google applications to better use GFS • Compact data into larger files • BigTable – distributed DB on top of GFS • Try to hide latency of GFS by being clever

  19. Future: GFSII “Colossus“ • Distributed multi-master model • 1MB average file size • 100M files per master ~100PB master • 100’s of masters per cluster ~2-figure Exabyte • Designed to take full advantage of BigTable • Using BigTable for GFSII metadata storage? • Helps address several issues including latency • Part of the new 2010 “Caffeine” infrastructure

  20. References • http://labs.google.com/papers/gfs.html • http://queue.acm.org/detail.cfm?id=1594206 • http://www.slideshare.net/hasanveldstra/the-anatomy-of-the-google-architecture-fina-lv11 • http://videos.webpronews.com/2009/08/11/breaking-news-matt-cutts-explains-caffeine-update/ • http://labs.google.com/papers/bigtable.html • http://labs.google.com/papers/mapreduce.html • http://www.morganclaypool.com/doi/abs/10.2200/S00193ED1V01Y200905CAC006 • http://www.byteonic.com/2009/why-java-is-a-better-choice-than-using-python-on-google-app-engine/ • http://infolab.stanford.edu/~backrub/google.html • http://code.google.com/apis/protocolbuffers/docs/overview.html

  21. Backup

  22. Protocol Buffers • Data description language • Language/platform neutral • Java, C++, Python • Used for serializing structured data for • Communications protocols, data storage, etc. • Two basic record types commonly used for GFS • logs - for mutable data as it’s being recorded • Sorted String Tables (SSTables) – immutable, indexed • BigTable implemented using logs and SSTables

  23. History • GFS based on BigFiles from original Larry/Sergey Stanford paper 9 • GFS implemented during 2002-2003 • GFSII implemented during 2007-2009

More Related