Robustness in the Salus scalable block store

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin University of Texas at Austin

Storage: Do not lose data Past Fsck, IRON, ZFS, … Now & Future

Salus: Provide robustness to scalable systems Strong protections: Arbitrary failure (BFT, End-to-end verification, …) Salus Scalable systems: Crash failure + certain disk corruption (GFS/Bigtable, HDFS/HBase, WAS, …)

Start from “Scalable” File system Key value Key value Block store Block store Database Robust Borrow scalable designs Scalable

Scalable architecture Clients Computing node: • No persistent storage Parallel data transfer Data servers Metadata server Storage nodes Data is replicated for durability and availability

Problems • No ordering guarantees for writes • Single points of failures: • Computing nodes can corrupt data. • Client can accept corrupted data. ?

1. No ordering guarantees for writes Clients 1 2 Metadata server Data servers Block store requires barrier semantics 1 2 3

2. Computing nodes can corrupt data Clients Metadata server Data servers Single point of failure

3. Client can accept corrupted data Clients Metadata server Data servers Does a simple disk checksum work? Can not prevent errors in memory, etc. Single point of failure

Salus: Overview End-to-end verification Block drivers (single driver per virtual disk) Pipelined commit Metadata server Failure model: Almost arbitrary but no impersonation Active storage

Pipelined commit • Goal: barrier semantics • A request can be marked as a barrier. • All previous ones must be executed before it. • Naïve solution: • Waits at a barrier: Lose parallelism • Look similar to distributed transaction • Well-known solution: Two phase commit (2PC)

Problem of 2PC Barriers Servers 1 2 3 4 5 Batch i 1 3 Prepared 1 2 3 Leader 2 Driver Batch i+1 Prepared 4 5 4 5 Leader

Problem of 2PC Servers Batch i 1 3 Commit 1 2 3 Leader 2 Driver Problem can still happen. Need ordering guarantee between different batches. Batch i+1 Commit 4 5 4 5 Leader

Pipelined commit Lead of batch i-1 Servers Batch i-1 committed Batch i 1 3 Commit 1 2 3 Leader 2 Driver Sequential commit Parallel data transfer Batch i committed Batch i+1 Commit 4 5 4 5 Leader

Outline • Challenges • Salus • Pipelined commit • Active storage • Scalable end-to-end checks • Evaluation

Active storage • Goal: a single node cannot corrupt data • Well-known solution: BFT replication • Problem: require at least 2f+1 replicas • Salus’ goal: Use f+1 computing nodes • f+1 is enough to guarantee safety • How about availability/liveness?

Active storage: unanimous consent Computing node Storage nodes

Active storage: unanimous consent Computing nodes • Unanimous consent: • All updates must be agreed by f+1 computing nodes. • Additional benefit: • Colocate computing and storage: save network bandwidth Storage nodes

Active storage: restore availability Computing nodes • What if something goes wrong? • Problem: we may not know which one is faulty. • Replace all computing nodes Storage nodes

Active storage: restore availability Computing nodes • What if something goes wrong? • Problem: we may not know which one is faulty. • Replace all computing nodes • Computing nodes do not have persistent states. Storage nodes

Discussion • Does Salus achieve BFT with f+1 replication? • No. Fork can happen in a combination of failures. • Is it worth using 2f+1 replicas to eliminate this case? • No. Fork can still happen as a result of other events (e.g. geo-replication).

Outline • Challenges • Salus • Pipelined commit • Active storage • Scalable end-to-end checks • Evaluation

Evaluation • Is Salus robust against failures? • Do the additional mechanisms hurt scalability?

Is Salus robust against failures? • Always ensure barrier semantics • Experiment: Crash block driver • Result: Requests after “holes” are discarded • Single computing node cannot corrupt data • Experiment: Inject errors into computing node • Result: Storage nodes reject it and trigger replacement • Block drivers never accept corrupted data • Experiment: Inject errors into computing/storage node • Result: Block driver detects errors and retries other nodes

Do the additional mechanisms hurt scalability?

Conclusion • Pipelined commit • Write in parallel • Provide ordering guarantees • Active Storage • Eliminate single point of failures • Consume less network bandwidth • High robustness ≠ Low performance

Salus: scalable end-to-end verification • Goal: read safely from one node • The client should be able to verify the reply. • If corrupted, the client retries another node. • Well-known solution: Merkle tree • Problem: scalability • Salus’ solution: • Single writer • Distribute the tree among servers

Scalable end-to-end verification Client maintains the top tree. Server 1 Server 3 Server 2 Server 4 Client does not need to store anything persistently. It can rebuild the top tree from the servers.

Scalable and robust storage More hardware More failures More complex software

Start from “Scalable” Robust Scalable

Pipelined commit - challenge Key challenge: Recovery • Order across transactions • Easy when failure-free. • Complicate recovery. • Is 2PC slow? • Locking? Single writer, no locking • Phase 2 network overhead? Small compare to Phase 1. • Phase 2 disk overhead? Eliminate that, push complexity to recovery

Scalable architecture Computation node Computation node: • No persistent storage • Data forwarding, garbage collection, etc • Tablet server (Bigtable), Region server (HBase)，Stream Manager (WAS), … Storage nodes

Salus: Overview • Block store interface (Amazon EBS): • Volume: a fixed number of fixed-size blocks • Single block driver per volume: barrier semantic • Failure model: arbitrary but no identity forge • Key ideas: • Pipelined commit – write in parallel and in order • Active storage – replicate computation nodes • Scalable end-to-end checks – client verifies replies

Complex recovery • Still need to guarantee barrier semantic. • What if servers fail concurrently? • What if different replicas diverge?

Support concurrent users • No perfect solution now • Give up linearizability • Casual or Fork-Join-Casual (Depot) • Give up scalability • SUNDR

How to handle storage node failures? • Computing nodes generate certificates when writing to storage nodes • If a storage node fails, computing nodes agree on the current states • Approach: look for longest and valid prefix • After that, re-replicate data • As long as one correct storage node is available, no data loss

Why barrier semantics? • Goal: provide atomicity • Example (file system): to append to a file, we need to modify a data block and the corresponding pointer to that block • It’s bad if pointer is modified but data is not. • Solution: modify data first and then pointer. Pointer is attached with a barrier.

Cost of robustness Fully BFT Use signature and N-way programming Cost Eliminate fork Use 2f+1 replication Salus Crash Robustness

Robustness in the Salus scalable block store