140 likes | 243 Views
Free Recovery: A Step Towards Self-Managing State. Andy Huang and Armando Fox Stanford University. Persistent hash tables. Hash table. LAN. LAN. Frontends. DB. App Servers. Two state management challenges. Failure handling Consistency requirements Node recovery costly
E N D
Free Recovery: A Step Towards Self-Managing State Andy Huang and Armando FoxStanford University
Persistent hash tables Hash table LAN LAN Frontends DB App Servers
Two state management challenges Failure handling Consistency requirements Node recovery costly Reliable failure detection Relax internal consistency Fast, non-intrusive recovery (“free”) DStore an easy-to-manage cluster-based persistent hash table for Internet services System evolution • Large data sets • Repartitioning is costly • Good resources provisioning • Free recovery • Automatic, online repartitioning
DStore architecture Dlib Brick app server LAN DStore an easy-to-manage cluster-based persistent hash table for Internet services Dlib: exposes hash table API and is the “coordinator” for distributed operations Brick: stores data by writing synchronously to disk
Focusing on recovery Write: send to all, wait for majority Read: read from majority OK if some bricks’ data differs Failure = missing some writes Technique 1: Quorums Tolerant to brick inconsistency Technique 2: Single-phase writes No request relies on specific bricks • 2PC: failure between phases complicates protocol • 2nd phase depends on particular set of bricks • Relies on reliable failure detection Single-phase quorum writes: can be completed by any majority of bricks Simple, non-intrusive recovery Any brick can fail at any time
Considering consistency write(1) read 0 read 1 x = 0 Dlib failure can cause a partial write, violating the quorum property If timestamps differ, read-repair restores majority invariant Delayed commit Dl1 B1 B2 B3 Dl2
Considering consistency write(1) read write 1 x = 0 A write-in-progress cookie can be used to detect partial writes and commit/abort on the next read An individual client’s view of DStore is consistent with that of a single centralized server (Bayou) Dl1 B1 B2 B3 Dl2
Benchmark: Free recovery Worst-case behavior(100% cache hit rate) Expected behavior(85% cache hit rate) Recovery: fast and non-intrusive Brick killed Recovery
Benchmark: Automatic failure detection Modest policy(anomaly threshold = 8) Aggressive policy(anomaly threshold = 5) Fail-stutter: detected by Pinpoint Fail-stutter False positives: low cost
Online repartitioning 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 • Take brick offline • Copy data to new brick • Bring both bricks online Appears as if brick just failed and recovered
Benchmark: Automatic online repartitioning Evenly-distributed load(3 to 6 bricks) Hotspot in 01 partition(6 to 12 bricks) Naive Naive Repartitioning: non-intrusive Brick selection: effective
Next up for free recovery • Perform online checkpoints • Take checkpointing brick offline • Just like failure+recovery • See if free recovery can simplify online data reconstruction after hard failures • Any other state management challenges you can think of?
Summary Cost: extra overprovisioning Cost: temporarily violates “majority” invariant Gain: fast, non-intrusive recovery Gain: any brick can fail at any time Mechanism: simple reboot • Mechanism: automatic, online repartitioning Policy: aggressively reboot anomalous bricks Policy: dynamically add and remove nodes based on predicted load DStore = Decoupled Storage Quorums [spacial decoupling] Single-phase ops [temporal decoupling] Free recovery Failure handling fast, non-intrusive System evolution “plug-and-play” Managed like a stateless Web farm
DStore an easy-to-manage cluster-based persistent hash table for Internet services andy.huang@stanford.edu