OceanStore: An Infrastructure for Global-Scale Persistent Storage

OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, Ben Zhao A few slides are taken from John Kubiatowicz’s presentation

Vision • what is Oceanstore? • “a utility infrastructure to span the globe and provide continuous access to persistent information” Source: Berkeley OceanStore Website

Where does the data come from • All kinds of information • Desktop, laptop, palmtop • Cell phones, embedded devices

Persistence of data • Data should be affected even if device is lost or replaced • reliable, durable data • “deep archival” (will last forever) • Automatic maintenance

Continuous access to data • Connectivity • even to tiniest devices, possibly intermittent • variable bandwidth, latency • Availability • Highly available, replicated in a global scale • comparable to LAN-based networked storage • fault-tolerant, DoS-tolerant

How much data? • scale • geographically distributed • Assume 1010 users • Each use has 10,000 files or objects • 1014 files / objects • Each file = 1 MB • 1020 bytes = 100 Exabytes

Data service • Economics • Pay monthly fee to ISPs / data service providers • Essentially, you use others’ storage capacity via subscription • Cloud computing was not introduced then …

Assumptions • Untrusted infrastructure • Servers may crash or leak information, but • Most of the servers functioning correctly • “Financially responsible” servers ensure integrity but only clients are trusted with cleartext • Nomadic data • Data divorced from location, but flows freely within the storage infrastructure • Promiscuouscaching: “anywhere, anytime” • Proximity of location important for performance

System overview • persistent object • Globally Unique Identifier GUID for each object: 160-bit SHA-1 hash • secure identification – globally unique and unforgeable • 280 unique objects before collisions (birthday paradox) • Encrypted data, unless data is public • read • try fast probabilistic replica search (Bloom filter) • fallback to slower deterministic search (Tapestry) • [Tapestry is a practical P2P network that uses Plaxton routing]

System overview • Write • Update with predicates [as in Bayou] – [what is Bayou?] • Update creates new version, in principle • need to group updates, retire objects [Think of a shared calendar]

What is Bayou The Bayou System is a platform of replicated, highly-available, databases on which to build collaborative applications. Supports weak consistency models

System overview • application interface • sessions: sequence of read/writes • session guarantees [Bayou] • weak consistency levels, ACID • Unix File Sharing semantics • active and archival forms • active: latest version, with update handle • archive: “erasure coded” read-only version

Comparison with Bayou • Similarities • update with predicates [Example: compare version: Check if this version number is greater than X] • creates new version, in principle • Uses replication to improve availability at the expense of consistency • Differences • anti-entropy vs. promiscuous caching • Encryption (unlike Bayou) Designated servers Anytime, anywhere

naming • self-certifying path names (Mazières) (a mechanism that avoids central naming & certification & key distribution) • object GUID = hash of owner key and readable name • other objects • server GUID = hash of public key • archival GUID = hash of data • read restriction (through client encryption of data) • write restriction (associate ACL lists with object, respected by servers

addressing and routing • Address an object by its GUID • message: GUID, small predicate • route to closest GUID replica matching predicate • combines data location and routing: • no central name service to attack

addressing and routing • fast, probabilistic search algorithm • Bloom filter • probabilistic set membership test using bit vector • n-bit vector generated from n hashes of each set element • filter is union (OR) of all bit vectors • attenuated Bloom filter • array of d Bloom filters • i th Bloom filter is union of all <i -hop nodes • slow, deterministic algorithm • Tapestry

Bloom Filter • Quickly tests if an element w belongs to a set S K=3 (Taken from Wikipedia) For each element of the set S, compute Khash functions, and enter them into the corresponding slots of the array. There is a very small risk of false positive. Note that w does not belong to the set S

Attenuated Bloom Filter Quickly tests if an object w is stored in a site S, and if not, then in which direction to route the query W It gives a hint of how far away a match can possibly be found

addressing and routing deterministic probabilistic

Attenuated Bloom Filter

Tapestry uses Plaxton Routing Using the routing tables, the query is forwarded towards the root node of the object. The root points to the server storing the object. Extra copies may be discovered earlier as the query moves towards the root. Root of O (O,S’,3) (O,S’,1) (O,S’,2) Server S’ (O,S,2) (O,S,1) (O,S) Server S

updates • based on versioning and conflict resolution • i.e. no locking • update: actions with predicates • commit – apply action of first true predicate • abort – no true predicates • conflict resolution on encrypted data • possible predicates: • compare-version, compare-size, compare-block, search • possible actions: • replace-block, insert-block, delete-block, append

Update on ciphertext

updates • serializing updates • will not trust any single server • Byzantine agreement among primary tierservers • secondary tier gossips tentative data during commit • multicast dissemination of commit to secondary tier primary secondary

archival • produced when objects idle • use erasure codes (redundant fragmentation) • simplest example: parity bit • need any (n-1) out of n fragments • Reed-Solomon codes • fragmentation improves reliability File Any 4 can Reconstruct the block Any 4 can Reconstruct the block

dynamic optimization (introspection) • observation modules • collect and summarize information • incrementally update system database • and optimization modules • periodically process the observation database • replica management: maintain replica count and location • periodic migration: work-home-work-home… • etc

OceanStore: An Infrastructure for Global-Scale Persistent Storage

OceanStore: An Infrastructure for Global-Scale Persistent Storage

Presentation Transcript

OceanStore: An Infrastructure for Global-Scale Persistent Storage

Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage

OceanStore Global-Scale Persistent Storage

Persistent Storage

OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage

The Oceanic Data Utility: (OceanStore) Global-Scale Persistent Storage

OceanStore/Tapestry Toward Global-Scale, Self-Repairing, Secure and Persistent Storage

OceanStore An Architecture for Global-scale Persistent Storage

OceanStore: An Architecture for Global - Scale Persistent Storage

OceanStore: An Architecture for Global-Scale Persistent Storage

OceanStore: An Architecture for Global-Scale Persistent Storage

OceanStore An Architecture for Global-scale Persistent Storage

OceanStore: An Architecture for Global-Scale Persistent Storage

An OceanStore Retrospective

OceanStore: An Architecture for Global-Scale Persistent Storage

Persistent Storage

OceanStore Global-Scale Persistent Storage

OceanStore Exploiting Peer-to-Peer for a Self-Repairing, Secure and Persistent Storage Utility