1 / 17

Decoupled Storage: “Free the Replicas!”

Decoupled Storage: “Free the Replicas!”. Andy Huang and Armando Fox Stanford University. What is decoupled storage (DeStor)?. Goal: application-level persistent storage system for Internet services Good recovery behavior Predictable performance Related projects

majed
Download Presentation

Decoupled Storage: “Free the Replicas!”

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decoupled Storage: “Free the Replicas!” Andy Huang and Armando Fox Stanford University

  2. What is decoupled storage (DeStor)? • Goal: application-level persistent storage system for Internet services • Good recovery behavior • Predictable performance • Related projects • Decoupled version of DDS (Gribble) • Federated Array of Bricks (HP Labs) at the application level • Session State Server (Ling), but for persistent state

  3. Outline • Dangers of coupling • Techniques for decoupling • Consequences

  4. ROWA – coupling and recovery don’t mix • Read One (i.e., any) • All copies must be consistent • Availability coupling: data locked during recovery to bring replica up-to-date • Write All • Writes proceed at the rate of the slowest replica • Performance coupling:system can grind to a halt if one replica degrades • Possible causes of degradation: cache warming and garbage collection

  5. Decoupled ROWA – allow replicas to say “No” • Write All (but replicas can say “No”) • Performance coupling: write can complete without waiting for a degraded replica • Availability coupling: allowing stale values eliminates the need for locked data during recovery • Issue: read may return a stale value • Read One (but read all timestamps) • Replicas can say “No” to a read_timestamp request • Use quorums to make sure enough replicas say “Yes”

  6. Quorums – use up-to-date information • Quorums 101 • Perform reads and writes on a majority of the replicas • Use timestamps to determine the correct value of a read • Performance coupling • Problem: requests distributed using static information • Consequence: one degraded node can slow down over 50% of writes • Load-balanced quorums • Use current load information to select quorum participants

  7. DeStor – two ways to look at it • Decoupled ROWA • “Write all” is best-effort, but write to at least a majority • Read majority of timestamps to check staleness • Load-balanced quorums (w/ read optimization) • Use dynamic load information • Read one value and majority of timestamps

  8. DeStor write Write Issue write(key,val) to N replicas Wait for majority to ack before returning success Else, timeout and retry or return fail R1 v.6 write v.7 v.5 R2 C write v.7 v.6 R3 v.6 R4 R1 v.7 v.7 R2 C success v.6 R3 v.7 R4

  9. DeStor read Read Issue {v,tv}=read(key) to random replica Issue get_timestamp(key) to N replicas Find most recent timestamp t*T={t1,t2,…} If (tv=t*), return v Else, issue read(key) to replica with tn=t* R1 v.7 read v.7 R2 C read time v.6 R3 v.7 R4 R1 v.7 value,v.7 v.7 R2 C value v.6 v.6 R3 v.7 v.7 R4

  10. Decoupling further – unlock the data x=1 C1 DeStor C2 R1 x=1 C1 x=1 R2 w v.7 r x=2 x=2 R3 C2 v.6 x=2 R4 x=1 R1 (1,0) C1 (1,2) R2 y=2 (1,2) R3 C2 (0,2) R4 Client-generated physical timestamps API: Single-operation transactions with no partial updates Assumption: clients operate independently • 2-phase commit – ensures atomicity among replicas • Couples replicas between phases • Locking complicates the implementation and recovery • 2PC not needed for DeStor?

  11. Client failure – what can happen w/o locks R1 v.7 R2 v.6 C1 R3 v.6 R4 v.6 R1 v.7 R2 v.7 C2 R3 v.6 R4 v.7 • Issue • Less than majority are written • R2 and R3  v.6 • R1 and R2/R3  v.7 • Serializability • Once v.7 is read, make sure it is the majority • Idea: write v.7 didn’t happen until it was read

  12. Timestamps – loose synchronization is sufficient • Unsynchronized clocks • Issue: client’s writes are “lost” because other writers’ timestamps are always more recent • Why that’s okay: clients are independent, so they can’t differentiate a “lost write” from an overwritten value • Caveat: a user is often behind the client requests • User sees inter-request causality • NTP synchronizes clocks within milliseconds, which is sufficient for human-speed interactions

  13. Consequence – behavior is more restricted • Good recovery behavior • Data available throughout crash and recovery • Performance degradation during cache warming doesn’t affect other replicas • Predictable performance • DeStor vs. ROWA: DeStor has better write throughput and latency at the cost of read throughput and latency • Key: better degradation characteristics  more predictable performance

  14. Performance: predictable Twrite T1= throughput of a single replica D1= % degradation of one replicaD = % system degradation = [−slope/Tmax]D1 ROWA:Tmax= T1slope = -T1D = D1 DeStor:Tmax= (N/Q)T1 T1 ≤ Tmax ≤ 2T1 slope = −T1/Q = −2T1/(N+1)D = D1/N T N=7 N=5 T1 N=3 D1 0 1

  15. Performance: slightly degraded Tread ROWA:Tmax = NT1slope = −T1D = D1/N DeStor: depends on overhead of read_timeout requestTmax = NT1 – (N/Q)[overhead]slope = –T1 – (T1/Q)[overhead]D ≈ D1/N T NT1 (N-1)T1 T2 T1 D1 0 1

  16. Research issues – once replicas are free… • Next step: simulate ROWA and DeStor • Measure: read and write throughput/latency • Factors: object size, working set size, read-write mix • Opens up new options for system administration • Online repartitioning, scaling, and replica replacement • Raises new issues for performance optimizations • When in-memory replication is persistent enough (non-write-through replicas)

  17. Summary • Application-level persistent storage system • Replication scheme • Write all, wait for majority • Read any, read majority of timestamps • Consequences • Data availability throughout recovery • Predictable performance when replicas degrade

More Related