1 / 26

OceanStore: An Infrastructure for Global-Scale Persistent Storage

OceanStore: An Infrastructure for Global-Scale Persistent Storage. John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, Ben Zhao.

cphillip
Download Presentation

OceanStore: An Infrastructure for Global-Scale Persistent Storage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, Ben Zhao A few slides are taken from John Kubiatowicz’s presentation

  2. Vision • what is Oceanstore? • “a utility infrastructure to span the globe and provide continuous access to persistent information” Source: Berkeley OceanStore Website

  3. Where does the data come from • All kinds of information • Desktop, laptop, palmtop • Cell phones, embedded devices

  4. Persistence of data • Data should be affected even if device is lost or replaced • reliable, durable data • “deep archival” (will last forever) • Automatic maintenance

  5. Continuous access to data • Connectivity • even to tiniest devices, possibly intermittent • variable bandwidth, latency • Availability • Highly available, replicated in a global scale • comparable to LAN-based networked storage • fault-tolerant, DoS-tolerant

  6. How much data? • scale • geographically distributed • Assume 1010 users • Each use has 10,000 files or objects • 1014 files / objects • Each file = 1 MB • 1020 bytes = 100 Exabytes

  7. Data service • Economics • Pay monthly fee to ISPs / data service providers • Essentially, you use others’ storage capacity via subscription • Cloud computing was not introduced then …

  8. Assumptions • Untrusted infrastructure • Servers may crash or leak information, but • Most of the servers functioning correctly • “Financially responsible” servers ensure integrity but only clients are trusted with cleartext • Nomadic data • Data divorced from location, but flows freely within the storage infrastructure • Promiscuouscaching: “anywhere, anytime” • Proximity of location important for performance

  9. System overview • persistent object • Globally Unique Identifier GUID for each object: 160-bit SHA-1 hash • secure identification – globally unique and unforgeable • 280 unique objects before collisions (birthday paradox) • Encrypted data, unless data is public • read • try fast probabilistic replica search (Bloom filter) • fallback to slower deterministic search (Tapestry) • [Tapestry is a practical P2P network that uses Plaxton routing]

  10. System overview • Write • Update with predicates [as in Bayou] – [what is Bayou?] • Update creates new version, in principle • need to group updates, retire objects [Think of a shared calendar]

  11. What is Bayou The Bayou System is a platform of replicated, highly-available, databases on which to build collaborative applications. Supports weak consistency models

  12. System overview • application interface • sessions: sequence of read/writes • session guarantees [Bayou] • weak consistency levels, ACID • Unix File Sharing semantics • active and archival forms • active: latest version, with update handle • archive: “erasure coded” read-only version

  13. Comparison with Bayou • Similarities • update with predicates [Example: compare version: Check if this version number is greater than X] • creates new version, in principle • Uses replication to improve availability at the expense of consistency • Differences • anti-entropy vs. promiscuous caching • Encryption (unlike Bayou) Designated servers Anytime, anywhere

  14. naming • self-certifying path names (Mazières) (a mechanism that avoids central naming & certification & key distribution) • object GUID = hash of owner key and readable name • other objects • server GUID = hash of public key • archival GUID = hash of data • read restriction (through client encryption of data) • write restriction (associate ACL lists with object, respected by servers

  15. addressing and routing • Address an object by its GUID • message: GUID, small predicate • route to closest GUID replica matching predicate • combines data location and routing: • no central name service to attack

  16. addressing and routing • fast, probabilistic search algorithm • Bloom filter • probabilistic set membership test using bit vector • n-bit vector generated from n hashes of each set element • filter is union (OR) of all bit vectors • attenuated Bloom filter • array of d Bloom filters • i th Bloom filter is union of all <i -hop nodes • slow, deterministic algorithm • Tapestry

  17. Bloom Filter • Quickly tests if an element w belongs to a set S K=3 (Taken from Wikipedia) For each element of the set S, compute Khash functions, and enter them into the corresponding slots of the array. There is a very small risk of false positive. Note that w does not belong to the set S

  18. Attenuated Bloom Filter Quickly tests if an object w is stored in a site S, and if not, then in which direction to route the query W It gives a hint of how far away a match can possibly be found

  19. addressing and routing deterministic probabilistic

  20. Attenuated Bloom Filter

  21. Tapestry uses Plaxton Routing Using the routing tables, the query is forwarded towards the root node of the object. The root points to the server storing the object. Extra copies may be discovered earlier as the query moves towards the root. Root of O (O,S’,3) (O,S’,1) (O,S’,2) Server S’ (O,S,2) (O,S,1) (O,S) Server S

  22. updates • based on versioning and conflict resolution • i.e. no locking • update: actions with predicates • commit – apply action of first true predicate • abort – no true predicates • conflict resolution on encrypted data • possible predicates: • compare-version, compare-size, compare-block, search • possible actions: • replace-block, insert-block, delete-block, append

  23. Update on ciphertext

  24. updates • serializing updates • will not trust any single server • Byzantine agreement among primary tierservers • secondary tier gossips tentative data during commit • multicast dissemination of commit to secondary tier primary secondary

  25. archival • produced when objects idle • use erasure codes (redundant fragmentation) • simplest example: parity bit • need any (n-1) out of n fragments • Reed-Solomon codes • fragmentation improves reliability File Any 4 can Reconstruct the block Any 4 can Reconstruct the block

  26. dynamic optimization (introspection) • observation modules • collect and summarize information • incrementally update system database • and optimization modules • periodically process the observation database • replica management: maintain replica count and location • periodic migration: work-home-work-home… • etc

More Related