530 likes | 810 Views
Towards Reliable Cloud Storage. Haryadi Gunawi UC Berkeley. Outline. Research background FATE and DESTINI. Yesterday’s storage. Storage Servers. Local storage. Tomorrow’s storage. e.g. Google Laptop. Cloud Storage. Tomorrow’s storage. Internet-service FS:
E N D
Towards ReliableCloud Storage Haryadi Gunawi UC Berkeley
Outline • Research background • FATE and DESTINI
Yesterday’s storage Storage Servers Local storage
Tomorrow’s storage e.g. Google Laptop Cloud Storage
Tomorrow’s storage Internet-service FS: GoogleFS, HadoopFS, CloudStore, ... Key-Value Store: Cassandra, Voldemort, ... + Replication + Scale-up + Migration + ... Reliable? Structured Storage: Yahoo! PNUTS, Google BigTable, Hbase, ... “This is not just data. It’s my life. And I would be sick if I lost it” [CNN ’10] Highly-available? Custom Storage: Facebook Haystack Photo Store, Microsoft StarTrack(Map Apps), Amazon S3, EBS, ...
Headlines Cloudy with a chance of failure
Outline • Research background • FATE and DESTINI • Motivation • FATE • DESTINI • Evaluation Joint work with: Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph Hellerstein, Andrea Arpaci-Dusseau, RemziArpaci-Dusseau, Koushik Sen, Dhruba Borthakur
Failure recovery ... • Cloud • Thousands of commodity machines • “Rare (HW) failures become frequent” [Hamilton] • Failure recovery • “… has to come from the software” [Dean] • “… must be a first-class op” [Ramakrishnan et al.]
... hard to get right • Google Chubby (lock service) • Four occasions of data loss • Whole-system down • More problems after injecting multiple failures (randomly) • Google BigTable (key-value store) • Chubby bug affects BigTable availability • More details? • How often? Other problems and implications?
... hard to get right • HDFS JIRA study • 1300 issues, 4 years (April 2006 to July 2010) • Select recovery problems due to hardware failures • 91 recovery bugs/issues • Open-source cloud projects • Very valuable bug/issue repository! • HDFS (1400+), ZooKeeper (900+), Cassandra (1600+), Hbase (3000+), Hadoop (7000+)
Why? • Testing is not advanced enough [Google] • Failure model: multiple, diverse failures • Recovery is under-specified [Hamilton] • Lots of custom recovery • Implementation is complex • Need two advancements: • Exercise complex failure modes • Write specifications and test the implementation
Cloud testing FATE Cloud software X2 X1 Failure Testing Service DESTINI Declarative Testing Specifications Violate specs?
Outline • Research background • FATE and DESTINI • Motivation • FATE • Architecture • Failure exploration • DESTINI • Evaluation
Setup Stage M C 1 2 3 4 M C 1 2 3 HadoopFS (HDFS) Write Protocol Alloc Req Data Transfer X1 X2 No failures M C 1 2 3 Data Transfer Recovery: Continue on surviving nodes (1, 2) Setup Recovery: Recreate fresh pipeline (1, 2, 4)
Failures and FATE M C 1 2 3 4 • Failures • Anytime: different stages different recovery • Anywhere: N2 crash, and then N3 • Any type: bad disks, partitioned nodes/racks • FATE • Systematically exercise multiple, diverse failures • How? need to “remember” failures – via failure IDs M C 1 2 3
Failure IDs • Abstraction of I/O failures • Building failure IDs • Intercept every I/O • Inject possible failures • Ex: crash, network partition, disk failure (LSE/corruption) Node2 Node3 X Note: FIDs A, B, C, ...
FATE Architecture Workload Driver while (new FIDs) { hdfs.write() } Target system (e.g. Hadoop FS) I/O Info Failure Server (fail/no fail?) AspectJ Failure Surface Java SDK
A A A A A A B B B B B C C C Brute-force exploration 1 failure / run 2 failures / run M C 1 2 3 M C 1 2 3 Exp #1: A AB Exp #2: B AC Exp #3: C BC
Outline • Research background • FATE and DESTINI • Motivation • FATE • Architecture • Failure exploration challenge and solution • DESTINI • Evaluation
Combinatorial explosion • Exercised over 40,000 unique combinations of 1, 2, and 3 failures per run • 80 hours of testing time! • New challenge: Combinatorial explosion • Need smarter exploration strategies A1 B1 A2 B2 A3 B3 1 2 3 2 failures / run A1 A2 A1 B2 B1 A2 B1 B2 ...
Pruning multiple failures • Properties of multiple failures • Pairwise dependent failures • Pairwise independent failures • Goal: exercise distinctrecovery behaviors • Key: some failures result in similar recovery • Result: > 10x faster, and found the same bugs
Dependent failures • Failure dependency graph • Inject single failures first • Record subsequent dependent IDs • Ex: X depends on A • Brute-force: AX, BX, CX, DX, CY, DY • Recovery clustering • Two clusters: {X} and {X, Y} • Only exercise distinct clusters • Pick a failureID that triggers the recovery cluster • Results: AX,CX,CY FID Subseq FIDs A X B X C X, Y D X, Y A B C D X Y
Independent failures (1) • Independent combinations • FP2 x N (N – 1) • FP = 2, N = 3, Tot = 24 • Symmetric code • Just pick two nodes • N (N – 1) 2 • FP2 x 2 1 2 3 B3 A3 A1 B1 A2 B2 B3 A1 B1 A2 B2 A3 1 2 3
Independent failures (2) • FP2bottleneck • FP = 4, total = 16 • Real example, FP = 15 • Recovery clustering • Cluster A and B if: fail(A) == fail(B) • Reduce FP2to FP2clustered • E.g.15 FPs to 8 FPsclustered A1 B1 A2 B2 C1 D1 C2 D2 A1 B1 A2 B2 C1 D1 C2 D2
FATE Summary • Exercise multiple, diverse failures • Via failure IDs abstraction • Challenge: combinatorial explosion of multiple failures • Smart failure exploration strategies • > 10x improvement • Built on top of failure IDs abstraction • Current limitations • No I/O reordering • Manual workload setup
Outline • Research background • FATE and DESTINI • Motivation • FATE • DESTINI • Overview • Building specifications • Evaluation
DESTINI: declarative specs • Is the system correct under failures? • Need to write specifications X1 X2 Specs [It is] great to document (in a spec) the HDFS write protocol ... …, but weshouldn't spend toomuch time on it, … aformal spec may be overkill for a protocol we plan to deprecate imminently. Implemen- tation
Declarative specs • How to write specifications? • Developer friendly (clear, concise, easy) • Existing approaches: • Unit test (ugly, bloated, not formal) • Others too verbose and long • Declarative relational logic language (Datalog) • Key: easy to express logical relations
Specs in DESTINI • How to write specs? • Violations • Expectations • Facts • How to write recovery specs? • “... recovery is under specified” [Hamilton] • Precise failure events • Precise check timings • How to test implementation? • Interpose I/O calls (lightweight) • Deduce expectations and facts from I/O events Specs Implemen- tation
Outline • Research background • FATE and DESTINI • Motivation • FATE • DESTINI • Overview • Building specifications • Evaluation
Specification template “Throw a violation if an expectation is different from the actual behavior” violationTable(…) :- expectationTable(…), NOT-IN actualTable(…) head() :- predicates(), … :- derivation , AND Datalog syntax:
Data transfer recovery M C 1 2 3 “Replicas should exist in surviving nodes” X Data Transfer B B incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);
Recovery bug X B B M C 1 2 3 incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);
Building expectations 1 2 3 M C • Ex: which nodes should have the blocks? • Deduce expectations from I/O events (italic) X2 C M getBlockPipe(…) Give me 3 nodes for B [Node1, Node2, Node3] expectedNodes (B, N) :- getBlockPipe(B, N); #1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);
Updating expectations M C 1 2 3 DEL expectedNodes (B, N) :- expectedNodes (B, N), fateCrashNode(N) X B B #1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N); #2: expectedNodes(B, N) :- getBlockPipe(B,N)
Precise failure events M C 1 2 3 4 DEL expectedNodes(B, N) :- expectedNodes (B, N), fateCrashNode(N), writeStage(B, Stage), Stage == “Data Transfer”; Different stages different recovery behaviors Precise failure events Data transfer recovery vs. Setup stage recovery M C 1 2 3 #1: incorrectNodes(B,N) :- expectedNodes(B,N), NOT-IN actualNodes(B,N) #2: expectedNodes(B,N) :- getBlockPipe(B,N) #3: expectedNodes(B,N) :- expectedNodes(B,N), fateCrashNode(N), writeStage(B,Stage), Stage == “DataTransfer” #4: writeStage(B, “DataTr”) :- writeStage(B,“Setup”), nodesCnt(Nc), acksCnt (Ac), Nc==Ac #5: nodesCnt (B, CNT<N>) :- pipeNodes (B, N); #6: pipeNodes (B, N) :- getBlockPipe(B, N); #7: acksCnt (B, CNT<A>) :- setupAcks (B, P, “OK”); #8: setupAcks (B, P, A) :- setupAck(B, P, A);
Facts • Ex: which nodes actually store the valid blocks? • Deduced from disk I/O events in 8 datalog rules X M C 1 2 3 B #1: incorrectNodes(B,N) :- expectedNodes(B,N), NOT-IN actualNodes(B,N)
Violation and check-timing • Recovery ≠ invariant • If recovery is ongoing, invariants are violated • Don’t want false alarms • Need precisecheck timings • Ex: upon block completion #1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N), completeBlock(B);
Data-transfer recovery spec Failure event (crash) Expectation Actual Facts I/O Events
More detailed specs M C 1 2 3 M C 1 2 3 • The spec: “something is wrong” • Why? Where is the bug? • Let’s write more detailed specs X X B B B
Adding specs M C 1 2 3 X • First analysis • Client’s pipeline excludes Node2, why? • Maybe, client gets a bad ack for Node2 errBadAck(N) :- dataAck (N, “Error”), liveNodes (N) • Second analysis • Client gets a bad ack for Node2, why? • Maybe, Node1 could not communicate to Node2 errBadConnect(N, TgtN) :- dataTransfer (N, TgtN, “Terminated”), liveNodes (TgtN) • We catch the bug! • Node2 cannot talk to Node3 (crash) • Node2 terminates all connections (including Node1!) • Node1 thinks Node2 is dead B
Catch bugs with more specs • More detailed specs • Catch bugs closer to the source and earlier in time Datanode’s view Client’s view Global view
More ... • Design patterns • Add detailed specs • Refine existing specs • Write specs from different views (global, client, dn) • Incorporate diverse failures (crashes, net partitions) • Express different violations (data-loss, unavailability)
Outline • Research background • FATE and DESTINI • Motivation • FATE • DESTINI • Evaluation
Evaluation • Implementation complexity • ~6000 LOC in Java • Target 3 popular cloud systems • HadoopFS (primary target) • Underlying storage for Hadoop/MapReduce • ZooKeeper • Distributed synchronization service • Cassandra • Distributed key-value store • Recovery bugs • Found 22 new HDFS bugs (confirmed) • Data loss, unavailability bugs • Reproduced 51 old bugs
Availability bug “Ifmultiple racks are available (reachable), a block should be stored in a minimum of two racks” “Throw a violation if a block is only stored in one rack, but the rack is connected to another rack” errorSingleRack(B) :- rackCnt(B,Cnt), Cnt==1, blkRacks(B,R), connected(R,Rb), endOfReplicationMonitor (_); rackCnt B, 1 blkRacks B, R1 connected R1, R2 errorSingleRack B Rack #1 Rack #2 Availability bug! #replicas = 3, locations are not checked, B is not migrated to R2 B B Client B B FATE injects rack partitioning Replication Monitor
Pruning Efficiency 7720 • Reduce #experiments by an order of magnitude • Each experiment = 4-9 seconds • Found the same number of bugs • (by experience) 5000 # Exps Brute Force Pruned 618 Write + 2 crashes Append + 2 crashes Write + 3 crashes Append + 3 crashes
Specification simplicity • Compared to other related work
Summary • Cloud systems • All good, but must manage failures • Performance, reliability, availability depend on failure recovery • FATE and DESTINI • Explore multiple, diverse failures systematically • Facilitate declarative recovery specifications • A unified framework • Real-world adoption in progress
END Research background FATE and DESTINI Thanks! Questions?