260 likes | 274 Views
Learn about OceanStore, a utility infrastructure for continuous access to persistent information worldwide. Explore data persistence, continuous access, system overview, and more. Discover the Bayou system for collaborative applications.
E N D
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, Ben Zhao A few slides are taken from John Kubiatowicz’s presentation
Vision • what is Oceanstore? • “a utility infrastructure to span the globe and provide continuous access to persistent information” Source: Berkeley OceanStore Website
Where does the data come from • All kinds of information • Desktop, laptop, palmtop • Cell phones, embedded devices
Persistence of data • Data should be affected even if device is lost or replaced • reliable, durable data • “deep archival” (will last forever) • Automatic maintenance
Continuous access to data • Connectivity • even to tiniest devices, possibly intermittent • variable bandwidth, latency • Availability • Highly available, replicated in a global scale • comparable to LAN-based networked storage • fault-tolerant, DoS-tolerant
How much data? • scale • geographically distributed • Assume 1010 users • Each use has 10,000 files or objects • 1014 files / objects • Each file = 1 MB • 1020 bytes = 100 Exabytes
Data service • Economics • Pay monthly fee to ISPs / data service providers • Essentially, you use others’ storage capacity via subscription • Cloud computing was not introduced then …
Assumptions • Untrusted infrastructure • Servers may crash or leak information, but • Most of the servers functioning correctly • “Financially responsible” servers ensure integrity but only clients are trusted with cleartext • Nomadic data • Data divorced from location, but flows freely within the storage infrastructure • Promiscuouscaching: “anywhere, anytime” • Proximity of location important for performance
System overview • persistent object • Globally Unique Identifier GUID for each object: 160-bit SHA-1 hash • secure identification – globally unique and unforgeable • 280 unique objects before collisions (birthday paradox) • Encrypted data, unless data is public • read • try fast probabilistic replica search (Bloom filter) • fallback to slower deterministic search (Tapestry) • [Tapestry is a practical P2P network that uses Plaxton routing]
System overview • Write • Update with predicates [as in Bayou] – [what is Bayou?] • Update creates new version, in principle • need to group updates, retire objects [Think of a shared calendar]
What is Bayou The Bayou System is a platform of replicated, highly-available, databases on which to build collaborative applications. Supports weak consistency models
System overview • application interface • sessions: sequence of read/writes • session guarantees [Bayou] • weak consistency levels, ACID • Unix File Sharing semantics • active and archival forms • active: latest version, with update handle • archive: “erasure coded” read-only version
Comparison with Bayou • Similarities • update with predicates [Example: compare version: Check if this version number is greater than X] • creates new version, in principle • Uses replication to improve availability at the expense of consistency • Differences • anti-entropy vs. promiscuous caching • Encryption (unlike Bayou) Designated servers Anytime, anywhere
naming • self-certifying path names (Mazières) (a mechanism that avoids central naming & certification & key distribution) • object GUID = hash of owner key and readable name • other objects • server GUID = hash of public key • archival GUID = hash of data • read restriction (through client encryption of data) • write restriction (associate ACL lists with object, respected by servers
addressing and routing • Address an object by its GUID • message: GUID, small predicate • route to closest GUID replica matching predicate • combines data location and routing: • no central name service to attack
addressing and routing • fast, probabilistic search algorithm • Bloom filter • probabilistic set membership test using bit vector • n-bit vector generated from n hashes of each set element • filter is union (OR) of all bit vectors • attenuated Bloom filter • array of d Bloom filters • i th Bloom filter is union of all <i -hop nodes • slow, deterministic algorithm • Tapestry
Bloom Filter • Quickly tests if an element w belongs to a set S K=3 (Taken from Wikipedia) For each element of the set S, compute Khash functions, and enter them into the corresponding slots of the array. There is a very small risk of false positive. Note that w does not belong to the set S
Attenuated Bloom Filter Quickly tests if an object w is stored in a site S, and if not, then in which direction to route the query W It gives a hint of how far away a match can possibly be found
addressing and routing deterministic probabilistic
Tapestry uses Plaxton Routing Using the routing tables, the query is forwarded towards the root node of the object. The root points to the server storing the object. Extra copies may be discovered earlier as the query moves towards the root. Root of O (O,S’,3) (O,S’,1) (O,S’,2) Server S’ (O,S,2) (O,S,1) (O,S) Server S
updates • based on versioning and conflict resolution • i.e. no locking • update: actions with predicates • commit – apply action of first true predicate • abort – no true predicates • conflict resolution on encrypted data • possible predicates: • compare-version, compare-size, compare-block, search • possible actions: • replace-block, insert-block, delete-block, append
updates • serializing updates • will not trust any single server • Byzantine agreement among primary tierservers • secondary tier gossips tentative data during commit • multicast dissemination of commit to secondary tier primary secondary
archival • produced when objects idle • use erasure codes (redundant fragmentation) • simplest example: parity bit • need any (n-1) out of n fragments • Reed-Solomon codes • fragmentation improves reliability File Any 4 can Reconstruct the block Any 4 can Reconstruct the block
dynamic optimization (introspection) • observation modules • collect and summarize information • incrementally update system database • and optimization modules • periodically process the observation database • replica management: maintain replica count and location • periodic migration: work-home-work-home… • etc