170 likes | 290 Views
Decoupled Storage: “Free the Replicas!”. Andy Huang and Armando Fox Stanford University. What is decoupled storage (DeStor)?. Goal: application-level persistent storage system for Internet services Good recovery behavior Predictable performance Related projects
E N D
Decoupled Storage: “Free the Replicas!” Andy Huang and Armando Fox Stanford University
What is decoupled storage (DeStor)? • Goal: application-level persistent storage system for Internet services • Good recovery behavior • Predictable performance • Related projects • Decoupled version of DDS (Gribble) • Federated Array of Bricks (HP Labs) at the application level • Session State Server (Ling), but for persistent state
Outline • Dangers of coupling • Techniques for decoupling • Consequences
ROWA – coupling and recovery don’t mix • Read One (i.e., any) • All copies must be consistent • Availability coupling: data locked during recovery to bring replica up-to-date • Write All • Writes proceed at the rate of the slowest replica • Performance coupling:system can grind to a halt if one replica degrades • Possible causes of degradation: cache warming and garbage collection
Decoupled ROWA – allow replicas to say “No” • Write All (but replicas can say “No”) • Performance coupling: write can complete without waiting for a degraded replica • Availability coupling: allowing stale values eliminates the need for locked data during recovery • Issue: read may return a stale value • Read One (but read all timestamps) • Replicas can say “No” to a read_timestamp request • Use quorums to make sure enough replicas say “Yes”
Quorums – use up-to-date information • Quorums 101 • Perform reads and writes on a majority of the replicas • Use timestamps to determine the correct value of a read • Performance coupling • Problem: requests distributed using static information • Consequence: one degraded node can slow down over 50% of writes • Load-balanced quorums • Use current load information to select quorum participants
DeStor – two ways to look at it • Decoupled ROWA • “Write all” is best-effort, but write to at least a majority • Read majority of timestamps to check staleness • Load-balanced quorums (w/ read optimization) • Use dynamic load information • Read one value and majority of timestamps
DeStor write Write Issue write(key,val) to N replicas Wait for majority to ack before returning success Else, timeout and retry or return fail R1 v.6 write v.7 v.5 R2 C write v.7 v.6 R3 v.6 R4 R1 v.7 v.7 R2 C success v.6 R3 v.7 R4
DeStor read Read Issue {v,tv}=read(key) to random replica Issue get_timestamp(key) to N replicas Find most recent timestamp t*T={t1,t2,…} If (tv=t*), return v Else, issue read(key) to replica with tn=t* R1 v.7 read v.7 R2 C read time v.6 R3 v.7 R4 R1 v.7 value,v.7 v.7 R2 C value v.6 v.6 R3 v.7 v.7 R4
Decoupling further – unlock the data x=1 C1 DeStor C2 R1 x=1 C1 x=1 R2 w v.7 r x=2 x=2 R3 C2 v.6 x=2 R4 x=1 R1 (1,0) C1 (1,2) R2 y=2 (1,2) R3 C2 (0,2) R4 Client-generated physical timestamps API: Single-operation transactions with no partial updates Assumption: clients operate independently • 2-phase commit – ensures atomicity among replicas • Couples replicas between phases • Locking complicates the implementation and recovery • 2PC not needed for DeStor?
Client failure – what can happen w/o locks R1 v.7 R2 v.6 C1 R3 v.6 R4 v.6 R1 v.7 R2 v.7 C2 R3 v.6 R4 v.7 • Issue • Less than majority are written • R2 and R3 v.6 • R1 and R2/R3 v.7 • Serializability • Once v.7 is read, make sure it is the majority • Idea: write v.7 didn’t happen until it was read
Timestamps – loose synchronization is sufficient • Unsynchronized clocks • Issue: client’s writes are “lost” because other writers’ timestamps are always more recent • Why that’s okay: clients are independent, so they can’t differentiate a “lost write” from an overwritten value • Caveat: a user is often behind the client requests • User sees inter-request causality • NTP synchronizes clocks within milliseconds, which is sufficient for human-speed interactions
Consequence – behavior is more restricted • Good recovery behavior • Data available throughout crash and recovery • Performance degradation during cache warming doesn’t affect other replicas • Predictable performance • DeStor vs. ROWA: DeStor has better write throughput and latency at the cost of read throughput and latency • Key: better degradation characteristics more predictable performance
Performance: predictable Twrite T1= throughput of a single replica D1= % degradation of one replicaD = % system degradation = [−slope/Tmax]D1 ROWA:Tmax= T1slope = -T1D = D1 DeStor:Tmax= (N/Q)T1 T1 ≤ Tmax ≤ 2T1 slope = −T1/Q = −2T1/(N+1)D = D1/N T N=7 N=5 T1 N=3 D1 0 1
Performance: slightly degraded Tread ROWA:Tmax = NT1slope = −T1D = D1/N DeStor: depends on overhead of read_timeout requestTmax = NT1 – (N/Q)[overhead]slope = –T1 – (T1/Q)[overhead]D ≈ D1/N T NT1 (N-1)T1 T2 T1 D1 0 1
Research issues – once replicas are free… • Next step: simulate ROWA and DeStor • Measure: read and write throughput/latency • Factors: object size, working set size, read-write mix • Opens up new options for system administration • Online repartitioning, scaling, and replica replacement • Raises new issues for performance optimizations • When in-memory replication is persistent enough (non-write-through replicas)
Summary • Application-level persistent storage system • Replication scheme • Write all, wait for majority • Read any, read majority of timestamps • Consequences • Data availability throughout recovery • Predictable performance when replicas degrade