290 likes | 664 Views
Fault Tolerance and Recovery. Google File System (GFS). Agenda. Introduction Design Fault Tolerance & Recovery Issues Future. Cloud Architecture. Cloud Applications – App Engine. Built on Commodity Hardware Cells. Component failures are the norm. Guaranteed some will never recover.
E N D
Fault Tolerance and Recovery Google File System (GFS)
Agenda • Introduction • Design • Fault Tolerance & Recovery • Issues • Future
Built on Commodity Hardware Cells Component failures are the norm. Guaranteed some will never recover.
Original Use Case: Google Search • 24 Map Reductions in the indexing pipeline • multi GB files are common • write append only, no overwriting; large sequential read, small random read
Design Assumptions • Built on commodity hardware that will fail regularly • A few million large files per cluster • Large sequential reads, small random reads • Large sequential writes - append only • Efficient concurrent access - producer/consumer • High sustained bandwidth over low latency • No POSIX support – this is a cloud file system
Interface • Separate GFS client library (front-end) • No POSIX support, no v-node integration • Supports: create, delete, open, close, read, write • recordAppend() • Supports efficient multi-way merge support • E.g. 1,000 mapreduce producers to the same file • Atomically appends record at least once • Loose consistency across replicas • Idempotent record processing required • snapshot() – an efficient file clone()
Single Master Design • Simplify the overall design problem • Central place to control replication, GC, etc. • Able to rollout GFS in 1 year with 3 engineers – time to market • Master stores metadata in main memory • File namespaces • File to chunk mappings • Chunk replica locations • Chunk servers provide authoritative list of chunk versions • Discovery simplifies membership changes, failures, etc. • Metadata • Only a few million files • 64 bytes per 64MB chunk – so it fits in master’s main memory • Checkpointed to disk at interval in b-tree format for fast startup
Use Case: Write to a new file • Client • Requests new file (1) • Master • Adds file to namespace • Selects 3 chunk servers • Designates chunk primary and grant lease • Replies to client (2) • Client • Sends data to all replicas (3) • Notifies primary when sent (4) • Primary • Writes data in order • Increment chunk version • Sequences secondary writes (5) • Secondary • Write data in sequence order • Increment chunk version • Notify primary write finished (6) • Primary • Notifies client when write finished (7)
Chunk Server Failure • Heartbeats sent from chunk server to master • Master detects chunk server failure • If chunk server goes down: • Chunk replica count is decremented on master • Master re-replicates missing chunks as needed • 3 chunk replicas is default (may vary) • Priority for chunks with lower replica counts • Priority for blocked clients • Throttling per cluster and chunk server • No difference in normal/abnormal termination • Chunk servers are routinely killed for maintenance
Chunk Corruption • 32-bit checksums • 64MB chunks split into 64KB blocks • Each 64KB block has a 32-bit checksum • Chunk server maintains checksums • Checksums are optimized for appendRecord() • Verified for all reads and overwrites • Not verified during recordAppend() – only on next read • Chunk servers verify checksums when idle • If a corrupt chunk is detected: • Chunk server returns an error to the client • Master notified, replica count decremented • Master initiates new replica creation • Master tells chunk server to delete corrupted chunk
Master Failure • Operations Log • Persistent record of changes to master metadata • Used to replay events on failure • Replicated to multiple machines for recovery • Flushed to disk before responding to client • Checkpoint of master state at interval to keep ops log file small • Master recovery requires • Latest checkpoint file • Subsequent operations log file • Master recovery was initially a manual operation • Then automated outside of GFS to within 2 minutes • Now down to 10’s of seconds
High Availability • Chunk replication / active failover • Master replication / passive failover • Shadow masters - read only availability • Master performs ongoing chunk rebalancing • Chunk server load balancing • Optimize read / write traffic • Chunk server disk utilization • Ensures uniform free space distribution for new writes • Enables optimal new chunk allocation by master • Hotspots handled by higher replication policy • E.g. replica count = 10 • Can set on file or namespace
Growth Over Past Decade • Application Mix: • R&D, MapReduce, BigTable • Gmail, Docs, YouTube, Wave • App Engine • Scale: • 36 Data Centers • 300+ GFSII Clusters • Upwards of 800K machines • 2-figure PB data per cluster • 1000TB = 1 PB
Issue: Single Master Bottleneck • Storage size increase: • 3-figure TB to 2-figure PB • File count increase: • E.g. Gmail files < 64M == more master metadata • Master metadata increase (64 bytes / chunk) • Metadata still had to fit into main memory • Master scaled linearly with metadata size • Master metadata increase by 100-1,000x • Stampede of 1,000 MapReduce clients to the master • Master became a bottleneck • Latency, latency, latency … • GFS was originally designed for batch processing MapReduce jobs • 64MB chunk size favored throughput over latency • Interactive user facing apps need low latency against smaller files
Workarounds • Multi-cell approach • Partition n masters over a set of chunk servers • Namespaces to wrap partitions • Tuning the GFS binary • Not typical at Google • Usually target decent performance, then scale out • Tune Google applications to better use GFS • Compact data into larger files • BigTable – distributed DB on top of GFS • Try to hide latency of GFS by being clever
Future: GFSII “Colossus“ • Distributed multi-master model • 1MB average file size • 100M files per master ~100PB master • 100’s of masters per cluster ~2-figure Exabyte • Designed to take full advantage of BigTable • Using BigTable for GFSII metadata storage? • Helps address several issues including latency • Part of the new 2010 “Caffeine” infrastructure
References • http://labs.google.com/papers/gfs.html • http://queue.acm.org/detail.cfm?id=1594206 • http://www.slideshare.net/hasanveldstra/the-anatomy-of-the-google-architecture-fina-lv11 • http://videos.webpronews.com/2009/08/11/breaking-news-matt-cutts-explains-caffeine-update/ • http://labs.google.com/papers/bigtable.html • http://labs.google.com/papers/mapreduce.html • http://www.morganclaypool.com/doi/abs/10.2200/S00193ED1V01Y200905CAC006 • http://www.byteonic.com/2009/why-java-is-a-better-choice-than-using-python-on-google-app-engine/ • http://infolab.stanford.edu/~backrub/google.html • http://code.google.com/apis/protocolbuffers/docs/overview.html
Protocol Buffers • Data description language • Language/platform neutral • Java, C++, Python • Used for serializing structured data for • Communications protocols, data storage, etc. • Two basic record types commonly used for GFS • logs - for mutable data as it’s being recorded • Sorted String Tables (SSTables) – immutable, indexed • BigTable implemented using logs and SSTables
History • GFS based on BigFiles from original Larry/Sergey Stanford paper 9 • GFS implemented during 2002-2003 • GFSII implemented during 2007-2009