Fast Crash Recovery in RAMCloud

Fast Crash Recovery in RAMCloud

Motivation • The role of DRAM has been increasing • Facebook used 150TB of DRAM • For 200TB of disk storage • However, there are limitations • DRAM is typically used as cache • Need to worry about consistency and cache misses

RAMCloud • Keeps all data in RAM at all times • Designed to scale to thousands of servers • To host terabytes of data • Provides low-latency (5-10 µs) for small reads • Design goals • High durability and availability • Without compromising performance

Alternatives • 3x RAM replications • 3x cost and energy • Power failure • RAMCloud keeps one copy in RAM • Two copies on disks • To achieve good availability • Fast crash recovery (64GB in 1-2 seconds)

RAMCloud Basics • Thousands of off-the-shelf servers • Each with 64GB of RAM • With InfinibandNICs • Remote access below 10 µs

Data Model • Key-value store • Tables of objects • Object • 64-bit ID + 1MB array + 64-bit version number • No atomic updates to multiple objects

System Structure • A large number of storage servers • Each server hosts • A master, which manages local DRAM objects and service requests • A backup, which stores copies of objects from other masters on storage • A coordinator • Manages config info and object locations • Not involved in most requests

RAMCloud Cluster Architecture client coordinator master backup

More on the Coordinator • Maps objects to servers in units of tablets • Hold consecutive key ranges within a single table • For locality reasons • Small tables are stored on a single server • Large tables are split across servers • Clients can cache tablets to access servers directly

Log-structured Storage • Logging approach • Each master logs data in memory • Log entries are forwarded to backup servers • Backup servers buffer log entries • Battery-backed • Writes complete once all backup servers acknowledge • A backup server flushes its buffer when full • 8MB segment for logging, buffering, and IOs • Each server can handle 300K 100-byte writes/sec

Recovery • When a server crashes, its DRAM content must be reconstructed • 1-2 second recovery time is good enough

Using Scale • Simple 3 replica approach • Recovery based on the speed of three disks • 3.5 minutes to read 64GB of data • Scattered over 1,000 disks • Takes 0.6 seconds to read 64GB • Centralized recovery master becomes a bottleneck • 10 Gbps network means 1 min to transfer 64GB of data to the centralized master

RAMCloud • Uses 100 recovery masters • Cuts the time down to 1 second

Scattering Log Segments • Ideally uniform, but with more details • Need to avoid correlated failures • Need to account for heterogeneity of hardware • Need to coordinate machines not to overflow buffers on individual machines • Need to account for changing memberships of servers due to failures

Failure Detection • Periodic pings to random servers • With 99% chance to detect failed servers within 5 rounds • Recovery • Setup • Replay • Cleanup

Setup • Coordinator finds log segment replicas • By querying all backup servers • Detecting incomplete logs • Logs are self describing • Starting Partition Recoveries • Each master uploads a will periodically to the coordinator in the event of its demise • Coordinator carries out the will accordingly

Replay • Parallel recovery • Six stages of pipelining • At segment granularity • Same ordering of operations on segments to avoid pipeline stalls • Only the primary replicas is involved in recovery

Cleanup • Get master online • Free up segments from the previous crash

Consistency • Exactly-once semantics • Implementation not yet complete • ZooKeeper handles coordinator failures • Distributed configuration service • With its own replication

Additional Failure Modes • Current focus • Recover DRAM content for a single master failure • Failed backup server • Need to know what segments are lost from the server • Rereplicate those lost segments across remaining disks

Multiple Failures • Multiple servers fail simultaneously • Recover each failure independently • Some will involve secondary replicas • Based on projection • With 5,000 servers, recovering 40 masters within a rack will take about 2 seconds • Can’t do much when many racks are blacked out

Cold Start • Complete power outage • Backups will contact the coordinate as they reboot • Need to quorum of backups before starting reconstructing masters • Current implementation does not perform cold starts

Evaluation • 60-node cluster • Each node • 16GB RAM, 1 disk • Infiniband (25 Gbps) • User level apps can talk to NICs bypassing the kernel

Results • Can recover lost data at 22 GB/s • A crashed server with 35 GB storage • Can be recovered in 1.6 seconds • Recovery time stays nearly flat from 1 to 20 recovery masters, each talks to 6 disks • 60 recovery masters adds only 10 ms recovery time

Results • Fast recovery significantly reduces the risk of data loss • Assume recovery time of 1 sec • The risk of data loss for 100K node is 10-5 in one year • 10x improvement in recovery time, improves reliability by 1,000x • Assumes independent failures

Theoretical Recovery Speed Limit • Harder to be faster than a few hundred msec • 150 msec to detect failure • 100 msec to contact every backup • 100 msec to read a single segment from disk

Risks • Scalability study based on a small cluster • Can treat performance glitches as failures • Trigger unnecessary recovery • Access patterns can change dynamically • May lead to unbalanced load

Fast Crash Recovery in RAMCloud

Fast Crash Recovery in RAMCloud

Presentation Transcript

Crash Recovery

Crash Recovery

Crash Recovery

Crash Recovery

Crash Recovery

Crash Recovery

Crash Recovery

Crash Recovery

Crash Recovery

Crash Recovery

Crash Recovery

Crash Recovery

Crash Recovery

Crash Recovery

Crash Recovery

Crash Recovery

Crash Recovery

Crash Recovery

Crash Recovery

Crash recovery

Crash Recovery

Crash recovery