DStore: An Easy-to-Manage Persistent State Store

DStore: An Easy-to-Manage Persistent State Store Andy Huang and Armando FoxStanford University

Outline • Project overview • Consistency guarantees • Failure detection • Benchmarks • Next steps and bigger picture

Background: Scalable CHTs LAN LAN Frontends App Servers DBs Cluster hash tables (CHTs) Single-key-lookup data • Yahoo! user profiles • Amazon catalog metadata Underlying storage layer • Inktomi:wordID  docID listdocID  document metadata • DDS/Ninja:atomic compare-and-swap

DStore: An easy-to-manage CHT Failure detection • Fast detection is at odds with accurate detection Capacity planning • High scaling costs necessitate accurate load prediction C H A L L E N G E S Cheap recovery Predictably fast and predictably small impact on availability/performance • Lowers the cost of acting on false positive • Effective failure detection not contingent on accuracy • Our online repartitioning algorithm lowers scaling cost • Reactive scaling adjusts capacity to match current load B E N E F I T S Manage like stateless frontends

Cheap recovery: Principles and costs Write: send to all, wait for majority Read: read from majority dlib dlib Quorums • No recovery code to freeze writes & copy missed updates Single-phase writes • No locking and transactional logging T E C H N I Q U E S • Higher replication factor: 2N+1 bricks to tolerate N failures (vs. N+1 in ROWA) • Sacrifice some consistency: Well-defined guarantees that provide consistent ordering C O S T S Trade storage and consistency for cheap recovery

Nothing new under the sun, but… Technique Prior work DStore CHT Scalable performance Ease of management Quorums Availability during network partitions and Byzantine faults Availability during failures and recovery Relaxed consistency Availability and performance while nodes are unavailable Result High availability and performance (end goal) Cheap recovery (but that’s just the start…)

Cheap recovery simplifies state management Challenge Prior work DStore Failure detection Difficult to make fast and accurate Effective even if it is not highly accurate Online repartitioning Relatively new area [Aqueduct] Duration and impact is predictably small Capacity planning Predict future load Scale reactively based on current load Data reconstruction [RAID] [Future work] Result State management is costly (administration- and availability-wise) Manage state with techniques used for stateless frontends

Outline • Project overview • Consistency guarantees • Failure detection • Benchmarks • Next steps and bigger picture

Consistency guarantees A client issues a request Request forwarded to a random Dlib Dlib issues quorum r/w on bricks • Assumption: Clients share data, but otherwise act independently c dlib • Usage model: • Guarantee: For a key k, DStore enforces a global order of operations that is consistent with the order seen by individual clients. • C1 issues w1(k, vnew) to replace current hash table entry (k, vold) • w1 returns SUCCESS: subsequent reads return vnew • w1 returns FAIL: subsequent reads return vold • w1 return UNKNOWN (due to Dlib failure): two cases

Case 1: Another user U2 performs a read w1(k1,vnew) r1(k1) vold r2(k1) w2(k1,vnew) Delayed commit vnew (k1,vold) U2 r(k1) returns: vold – no user has read vnew vnew – no user will later read vold Dlib failure can cause a partial write, violating the quorum property If timestamps differ, read-repair restores majority invariant U1 B1 B2 B3 U2

Case 2: U1 performs a read w1(k1,vnew) r1(k1) w2(k1,vnew) vnew (k1,vold) U1 r(k1): write is immediately committed or aborted – all future readers see either vold or vnew A write-in-progress cookie can be used to detect partial writes and commit/abort on the next read U1 B1 B2 B3 U2

Consistency guarantees • C1 issues w1(k, vnew) to replace current hash table entry (k, vold) • w1 returns SUCCESS: subsequent reads return vnew • w1 returns FAIL: subsequent reads return vold • w1 return UNKNOWN (due to Dlib failure): U1 reads – w1 is immediately committed or aborted U2 reads – if vold is returned, no user has read vnewif vnew is returned, no user will later read vold

Two-phase commit vs. single phase writes Property 2-phase commit Single-phase writes Consistency Sequential consistency Consistent ordering Recovery Read log to complete in progress transactions No special-case recovery Availability Locking may cause request to block during failures No locking Performance 2 synchronous log writes2 roundtrips 1 synchronous update1 roundtrip Other costs None Read-repair (spreads out the cost of 2-PC to make common case faster)Write-in-progress cookie (spreads out the responsibility of 2-PC)

Recovery behavior Run at 100% capacity Typically, run at 60-70% max utilization Recovery Predictably fast and small impact

Application-generic failure detection Tarzan algorithm Failure detection techniques Operating statistics (CPU load, requests processed, etc.) Anomalies Beacon listener > treshold Median absolute deviation reboot Simple detection techniques “work” because resolution mechanism is cheap

Failure detection and repartitioning behavior Online repartitioning Fail-stutter Aggressive failure detection Low scaling cost Low cost of acting on false positives

Bigger picture: What is “self-managing”? reboot Indicator Brick performance a sign of system health Monitoring tests for potential problems Treatment low-impact resolution mechanism

Bigger picture: What is “self-managing”? Brick performance System load Disk failures

Bigger picture: What is “self-managing”? repartition reboot reconstruction Brick performance System load Disk failures Simple detection mechanisms & policies Key: low-cost mechanisms Constant “recovery”

DStore: An Easy-to-Manage Persistent State Store