1 / 32

Robustness in the Salus scalable block store

Robustness in the Salus scalable block store. Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin University of Texas at Austin. Storage: do not lose data. Past. Fsck, IRON, ZFS, …. Now & Future. Are they robust enough?.

ugo
Download Presentation

Robustness in the Salus scalable block store

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin University of Texas at Austin

  2. Storage: do not lose data Past Fsck, IRON, ZFS, … Now & Future Are they robust enough?

  3. Salus: Providing robustness to scalable systems Strong protections: arbitrary failure (BFT, End-to-end verification, …) Salus Scalable systems: crash failure + certain disk corruption (GFS/Bigtable, HDFS/HBase, WAS, FDS, …)

  4. Start from “Scalable” Robust Scalable

  5. Scalable architecture Clients Computation nodes: • Stateless • Data forwarding, garbage collection, etc • Tablet server (Bigtable), Region server (HBase),Stream Manager (WAS), … Infrequent metadata transfer Parallel data transfer Data servers Metadata server Data is replicated for durability and availability

  6. Challenges • Parallelism vs consistency • Prevent computation nodes from corrupting data • Read safely from one node • Do NOT increase replication cost ?

  7. Parallelism vs Consistency Clients Metadata server Write 2 Barrier Write 1 Data servers Disk-like systems require barrier semantic.

  8. Prevent computation nodes from corrupting data Clients Metadata server Data servers Single point of failure

  9. Read safely from one node Clients Metadata server Data servers Does simple checksum work? Cannot solve all problems. Single point of failure

  10. Outline • Challenges • Salus • Pipelined commit • Active storage • Scalable end-to-end checks • Evaluation

  11. Salus’ overview • Disk-like interface (Amazon EBS): • Volume: a fixed number of fixed-size blocks • Single client per volume: barrier semantic • Failure model: arbitrary but no signature forge • Key ideas: • Pipelined commit – write in parallel and in order • Active storage – safe write • Scalable end-to-end checks – safe read

  12. Salus’ key ideas End-to-end verification Clients Pipelined commit Metadata server Active storage

  13. Outline • Challenges • Salus • Pipelined commit • Active storage • Scalable end-to-end checks • Evaluation

  14. Salus: pipelined commit • Goal: barrier semantic • A request can be marked as a barrier. • All previous ones must be executed before it. • Naïve solution: • The client blocks at a barrier: lose parallelism • A weaker version of distributed transaction • Well-known solution: two phase commit (2PC)

  15. Pipelined commit – 2PC Previous leader Servers Batch i-1 committed Batch i 1 3 Prepared 1 2 3 Leader 2 Client Batch i+1 Prepared 4 5 4 5 Leader

  16. Pipelined commit – 2PC Previous leader Servers Batch i-1 committed Batch i 1 3 Commit 1 2 3 Leader 2 Client Sequential commit Parallel data transfer Batch i committed Batch i+1 Commit 4 5 4 5 Leader

  17. Pipelined commit - challenge Key challenge: Recovery • Order across transactions • Easy when failure-free. • Complicate recovery. • Is 2PC slow? • Locking? Single writer, no locking • Phase 2 network overhead? Small compare to Phase 1. • Phase 2 disk overhead? Eliminate that, push complexity to recovery

  18. Outline • Challenges • Salus • Pipelined commit • Active storage • Scalable end-to-end checks • Evaluation

  19. Salus: active storage • Goal: a single node cannot corrupt data • Well-known solution: BFT replication • Problem: higher replication cost • Salus’ solution: use f+1 computation nodes • Require unanimous consent • How about availability/liveness?

  20. Active storage: unanimous consent Computation nodes Storage nodes

  21. Active storage: unanimous consent Computation nodes • Unanimous consent: • All updates must be agreed by f+1 computation nodes. • Additional benefit: • Collocate computation and storage: save network bw Storage nodes

  22. Active storage: restore availability Computation nodes • What if something goes wrong? • Problem: we may not know which one is faulty. • Replace all computation nodes Storage nodes

  23. Active storage: restore availability Computation nodes • What if something goes wrong? • Problem: we may not know which one is faulty. • Replace all computation nodes • We can do this because computation nodes are stateless. Storage nodes

  24. Discussion • Does Salus achieve BFT with f+1 replication? • No. Fork can happen in a combination of failures. • Is it worth adding f replicas to eliminate this case? • No. Fork can still happen as a result of geo-replication. • How to support multiple writers? • We do not know a perfect solution. • Give up either linearizability or scalability.

  25. Outline • Challenges • Salus • Pipelined commit • Active storage • Scalable end-to-end checks • Evaluation

  26. Evaluation

  27. Evaluation

  28. Evaluation

  29. Outline • Salus’ overview • Pipelined commit • Write in order and in parallel • Active storage: • Prevent a single node from corrupting data • Scalable end-to-end verification: • Read safely from one node • Evaluation

  30. Salus: scalable end-to-end verification • Goal: read safely from one node • The client should be able to verify the reply. • If corrupted, the client retries another node. • Well-known solution: Merkle tree • Problem: scalability • Salus’ solution: • Single writer • Distribute the tree among servers

  31. Scalable end-to-end verification Client maintains the top tree. Server 1 Server 3 Server 2 Server 4 Client does not need to store anything persistently. It can rebuild the top tree from the servers.

  32. Scalable and robust storage More hardware More failures More complex software

More Related