190 likes | 261 Views
DStore: An Easy-to-Manage Persistent State Store. Andy Huang and Armando Fox Stanford University. Outline. Project overview Consistency guarantees Failure detection Benchmarks Next steps and bigger picture. Background: Scalable CHTs. LAN. LAN. Frontends. App Servers. DBs.
E N D
DStore: An Easy-to-Manage Persistent State Store Andy Huang and Armando FoxStanford University
Outline • Project overview • Consistency guarantees • Failure detection • Benchmarks • Next steps and bigger picture
Background: Scalable CHTs LAN LAN Frontends App Servers DBs Cluster hash tables (CHTs) Single-key-lookup data • Yahoo! user profiles • Amazon catalog metadata Underlying storage layer • Inktomi:wordID docID listdocID document metadata • DDS/Ninja:atomic compare-and-swap
DStore: An easy-to-manage CHT Failure detection • Fast detection is at odds with accurate detection Capacity planning • High scaling costs necessitate accurate load prediction C H A L L E N G E S Cheap recovery Predictably fast and predictably small impact on availability/performance • Lowers the cost of acting on false positive • Effective failure detection not contingent on accuracy • Our online repartitioning algorithm lowers scaling cost • Reactive scaling adjusts capacity to match current load B E N E F I T S Manage like stateless frontends
Cheap recovery: Principles and costs Write: send to all, wait for majority Read: read from majority dlib dlib Quorums • No recovery code to freeze writes & copy missed updates Single-phase writes • No locking and transactional logging T E C H N I Q U E S • Higher replication factor: 2N+1 bricks to tolerate N failures (vs. N+1 in ROWA) • Sacrifice some consistency: Well-defined guarantees that provide consistent ordering C O S T S Trade storage and consistency for cheap recovery
Nothing new under the sun, but… Technique Prior work DStore CHT Scalable performance Ease of management Quorums Availability during network partitions and Byzantine faults Availability during failures and recovery Relaxed consistency Availability and performance while nodes are unavailable Result High availability and performance (end goal) Cheap recovery (but that’s just the start…)
Cheap recovery simplifies state management Challenge Prior work DStore Failure detection Difficult to make fast and accurate Effective even if it is not highly accurate Online repartitioning Relatively new area [Aqueduct] Duration and impact is predictably small Capacity planning Predict future load Scale reactively based on current load Data reconstruction [RAID] [Future work] Result State management is costly (administration- and availability-wise) Manage state with techniques used for stateless frontends
Outline • Project overview • Consistency guarantees • Failure detection • Benchmarks • Next steps and bigger picture
Consistency guarantees A client issues a request Request forwarded to a random Dlib Dlib issues quorum r/w on bricks • Assumption: Clients share data, but otherwise act independently c dlib • Usage model: • Guarantee: For a key k, DStore enforces a global order of operations that is consistent with the order seen by individual clients. • C1 issues w1(k, vnew) to replace current hash table entry (k, vold) • w1 returns SUCCESS: subsequent reads return vnew • w1 returns FAIL: subsequent reads return vold • w1 return UNKNOWN (due to Dlib failure): two cases
Case 1: Another user U2 performs a read w1(k1,vnew) r1(k1) vold r2(k1) w2(k1,vnew) Delayed commit vnew (k1,vold) U2 r(k1) returns: vold – no user has read vnew vnew – no user will later read vold Dlib failure can cause a partial write, violating the quorum property If timestamps differ, read-repair restores majority invariant U1 B1 B2 B3 U2
Case 2: U1 performs a read w1(k1,vnew) r1(k1) w2(k1,vnew) vnew (k1,vold) U1 r(k1): write is immediately committed or aborted – all future readers see either vold or vnew A write-in-progress cookie can be used to detect partial writes and commit/abort on the next read U1 B1 B2 B3 U2
Consistency guarantees • C1 issues w1(k, vnew) to replace current hash table entry (k, vold) • w1 returns SUCCESS: subsequent reads return vnew • w1 returns FAIL: subsequent reads return vold • w1 return UNKNOWN (due to Dlib failure): U1 reads – w1 is immediately committed or aborted U2 reads – if vold is returned, no user has read vnewif vnew is returned, no user will later read vold
Two-phase commit vs. single phase writes Property 2-phase commit Single-phase writes Consistency Sequential consistency Consistent ordering Recovery Read log to complete in progress transactions No special-case recovery Availability Locking may cause request to block during failures No locking Performance 2 synchronous log writes2 roundtrips 1 synchronous update1 roundtrip Other costs None Read-repair (spreads out the cost of 2-PC to make common case faster)Write-in-progress cookie (spreads out the responsibility of 2-PC)
Recovery behavior Run at 100% capacity Typically, run at 60-70% max utilization Recovery Predictably fast and small impact
Application-generic failure detection Tarzan algorithm Failure detection techniques Operating statistics (CPU load, requests processed, etc.) Anomalies Beacon listener > treshold Median absolute deviation reboot Simple detection techniques “work” because resolution mechanism is cheap
Failure detection and repartitioning behavior Online repartitioning Fail-stutter Aggressive failure detection Low scaling cost Low cost of acting on false positives
Bigger picture: What is “self-managing”? reboot Indicator Brick performance a sign of system health Monitoring tests for potential problems Treatment low-impact resolution mechanism
Bigger picture: What is “self-managing”? Brick performance System load Disk failures
Bigger picture: What is “self-managing”? repartition reboot reconstruction Brick performance System load Disk failures Simple detection mechanisms & policies Key: low-cost mechanisms Constant “recovery”