Data-Centric Reconfiguration with Network-Attached Disks

Data-Centric Reconfiguration with Network-Attached Disks Alex Shraer (Technion) Joint work with: J.P. Martin, D. Malkhi, M. K. Aguilera (MSR) I. Keidar (Technion)

Preview • The setting: data-centric replicated storage • Simple network-attached storage-nodes • Our contributions: • First distributed reconfigurable R/W storage • Asynch. VS. consensus-based reconfiguration Allows to add/remove storage-nodes dynamically

Enterprise Storage Systems • Highly reliable customized hardware • Controllers, I/O ports may become a bottleneck • Expensive • Usually not extensible • Different solutions for different scale • Example(HP): High end - XP (1152 disks), Mid range – EVA (324 disks)

Alternative – Distributed Storage • Made up of many storage nodes • Unreliable, cheap hardware • Failures are the norm, not an exception • Challenges: • Achieving reliability and consistency • Supporting reconfigurations

Distributed Storage Architecture write read • Unpredictable network delays (asynchrony) Storage Clients Cloud Storage Dynamic, Fault-prone LAN/ WAN Storage Nodes Fault-prone

A Case for Data-Centric Replication • Client-side code runs replication logic • Communicates with multiple storage nodes • Simple storage nodes (servers) • Can be network-attached disks • Not necessarily PCs with disks • Do not run application-specific code • Less fault-prone components • Simply respond to client requests • High throughput • Do not communicate with each other • If storage-nodes communicate, their failure is likely to be correlated! • Oblivious to where other replicas of each object are stored • Scalable, same storage node can be used for many replication sets not-so-thinclient thin storage node

RealSystemsAreDynamic reconfig {–C, +F,…, +I} reconfig{–A, –B} F LAN/ WAN A E G H D B C I The challenge: maintain consistency , reliability,availability

Pitfall of Naïve Reconfiguration E {A, B, C, D, E} {A, B, C, D, E} {A, B, C, D} D reconfig {+E} {A, B, C, D} {A, B, C, D, E} C delayed delayed {A, B, C, D, E} {A, B, C, D} B delayed delayed {A, B, C} {A, B, C, D} A reconfig {-D} {A, B, C} {A, B, C, D} {A, B, C} {A, B, C, D}

Pitfall of Naïve Reconfiguration E X = “Spain”, 2 X = “Italy”, 1 {A, B, C, D, E} {A, B, C, D, E} D write x “Spain” Split Brain! X = “Spain”, 2 X = “Italy”, 1 {A, B, C, D, E} C X = “Spain”, 2 X = “Italy”, 1 Returns “Italy”! {A, B, C, D, E} B X = “Italy”, 1 {A, B, C} A read x X = “Italy”, 1 {A, B, C} {A, B, C}

Reconfiguration Option 1: Centralized • Tomorrow Technion servers will be down for maintenance from 5:30am to 6:45am • Virtually Yours, • Moshe Barak • Can be automatic • E.g., Ursa Minor [Abd-El-Malek et al., FAST 05] • Downtime • Most solutions stop R/W while reconfiguring • Single point of failure • What if manager crashes while changing the system?

Reconfiguration Option 2: Distributed Agreement • Servers agree on next configuration • Previous solutions not data-centric • No downtime • In theory, might never terminate [FLP85] • In practice, we have partial synchrony so it usually works

Reconfiguration Option 3: DynaStore[Aguilera, Keidar, Malkhi, S., PODC09] • Distributed & completely asynchronous • No downtime • Always terminates • Not data-centric

In this work: DynaDiskdynamic data-centric R/W storage • First distributed data-centric solution • No downtime • Tunable reconfiguration method • Modular design, coordination is separate from data • Allows easily setting/comparing the coordination method • Consensus-based VS. asynchronous reconfiguration • Many shared objects • Running a protocol instance per object too costly • Transferring all state at once might be infeasible • Our solution: incremental state transfer • Built with an external (weak) location service • We formally state the requirements from such a service

Location Service • Used in practice, ignored in theory • We formalize the weak external service as an oracle: • Not enough to solve reconfiguration • oracle.query( ) returns some “legal” configuration • If reconfigurations stop and oracle. query() invoked infinitely many times, it eventually returns last system configuration

The Coordination Module in DynaDisk Storage devices in a configuration conf = {+A, +B, +C} A B C y x z x y y x z z next config: next config: next config:    Distributed R/W objects Updated similarly to ABD Distributed “weak snapshot” object API: update(set of changes)→OK scan() → set of updates

Coordination with Consensus update : scan: read & write-back next config from majority • every scan returns +D or  A B C next config: next config: next config:    +D +D +D z x y x y x y z z +D +D +D Consensus –C +D reconfig({+D}) reconfig({–C})

Weak Snapshot – Weaker than consensus • No need to agree on the next configuration, as long as each process has a set of possible next configurations, and all such sets intersect • Intersection allows to converge and again use a single config • Non-empty intersection property of weak snapshot: • Every two non-empty sets returned by scan( ) intersect • Example: Client 1’s scan Client 2’s scan {+D} {+D} {–C} {+D, –C} {+D} {–C} Consensus

Coordination without consensus update : scan: read & write-back proposals from majority (twice) A B C next config: next config: next config:        +D  –C –C z z y y z y x 2 1 1 0 1 2 2 0 0 WRITE ({–C}, 0) OK OK CAS({–C}, , 0) CAS({–C}, , 1) +D  reconfig({+D}) reconfig({–C})

Tracking Evolving Config’s • With consensus: agree on next configuration • Without consensus – usually a chain, sometimes a DAG: +D  C • A,B,C,D • A, B, C • A, B, D • Inconsistent updates found and merged scan() returns {+D} weak snapshot • A,B,C,D  C +D • A, B, C • A, B, D  C scan() returns {+D, -C} • A,B +D • All non-empty scans intersect

Consensus-based VS. Asynch. Coordination • Two implementations of weak snapshots • Asynchronous • Partially synchronous (consensus-based) • Active Disk Paxos[Chockler, Malkhi, 2005] • Exponential backoff for leader-election • Unlike asynchronous coordination, consensus-based might not terminate [FLP85] • Storage overhead • Asynchronous: vector of updates • vector size ≤ min(#reconfigs, #members in config) • Consensus-based: 4 integers and the chosen update • Per storage device and configuration

Strong progress guarantees are not for free Slightly better,much more predictable reconfig latency when many reconfig execute simultaneously Consensus-based Asynchronous (no consensus) Significant negative effect on R/W latency The same when no reconfigurations

Future & Ongoing Work • Combine asynch. and partially-synch. coordination • Consider other weak snapshot implementations • E.g., using randomized consensus • Use weak snapshots to reconfigure other services • Not just for R/W

Summary • DynaDisk – dynamic data-centric R/W storage • First decentralized solution • No downtime • Supports many objects, provides incremental reconfig • Uses one coordination object per config. (not per object) • Tunable reconfiguration method • We implemented asynchronous and consensus-based • Many other implementations of weak-snapshots possible • Asynchronous coordination in practice: • Works in more circumstances → more robust • But, at a cost – significantly affects ongoing R/W ops

Data-Centric Reconfiguration with Network-Attached Disks

Data-Centric Reconfiguration with Network-Attached Disks

Presentation Transcript

Network Attached Storage Architecture

Network Attached Storage Architecture

IP43 Building Virtual Infrastructures With Network Attached Storage

Network Attached Storage NAS100

Data-Centric Reconfiguration with Network-Attached Disks

Network-attached storage

Network Attached Tape Drive

Network Attached Storage (NAS)

Network-Attached Storage

NETWORK CENTRIC COMPUTING

Network-Attached Storage

Network-Attached Storage

Network-attached Storage Market

Network Attached IP Access Control System with Warranty

TOP NETWORK ATTACHED STORAGE

Network-Attached Storage Market

Network-Attached Storage Market

Dual-Centric Data Center Network Architectures

CSH4283 Centric Network

Network Attached Tape Drive

Network Attached Storage (NAS)