1 / 46

OceanStore: An Architecture for Global-Scale Persistent Storage

OceanStore: An Architecture for Global-Scale Persistent Storage. John Kubiatowicz, et al ASPLOS 2000. 10 10 users, each with 10,000 files. OceanStore. Global scale information storage. Mobile access to information in a uniform and highly available way. Servers are untrusted.

jamesprice
Download Presentation

OceanStore: An Architecture for Global-Scale Persistent Storage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OceanStore: An Architecture for Global-Scale Persistent Storage John Kubiatowicz, et al ASPLOS 2000

  2. 1010 users, each with 10,000 files OceanStore • Global scale information storage. • Mobile access to information in a uniform and highly available way. • Servers are untrusted. • Caches data anywhere, anytime. • Monitors usage patterns.

  3. OceanStore [Rhea et al. 2003]

  4. Main Goals • Untrusted infrastructure • Nomadic data

  5. Personal information management tools: calendars, emails, contact lists Example applications • Groupware and PIM • Email • Digital libraries • Scientific data repository

  6. Challenges: scaling, consistency, migration, network failures Example applications • Groupware and PIM • Email • Digital libraries • Scientific data repository

  7. Storage Organization • OceanStore data object ~= file • Ordered sequence of read-only versions • Every version of every object kept forever • Can be used as backup • An object contains metadata, data, and references to previous versions

  8. Storage Organization • A stream of objects identified by AGUID • Active globally-unique identifier • Cryptographically-secure hash of an application-specific name and the owner’s public key • Prevents namespace collisions

  9. Storage Organization • Each version of data object stored in a B-tree like data structure • Each block has a BGUID • Cryptographically-secure hash of the block content • Each version has a VGUID • Two versions may share blocks

  10. Storage Organization [Rhea et al. 2003]

  11. Access Control • Restricting readers: • Symmetric encryption key distributed to allowed readers. • Restricting writers: • ACL. • Signed writes. • ACL for object chosen with signed certificate.

  12. Find 11010 Location and RoutingAttenuated Bloom Filters

  13. Location and RoutingPlaxton-like trees

  14. Updating data • All data is encrypted. • A set of predicates is evaluated in order. • The actions of the earliest true predicate are applied. • Update is logged if it commits or aborts. • Predicates: • compare-version, compare-block, compare-size, search • Actions • replace-block, insert-block, delete-block, append

  15. Application-Specific Consistency • An update is the operation of adding a new version to the head of a version stream • Updates are applied atomically • Represented as an array of potential actions • Each guarded by a predicate

  16. Application-Specific Consistency • Example actions • Replacing some bytes • Appending new data to an object • Truncating an object • Example predicates • Check for the latest version number • Compare bytes

  17. Application-Specific Consistency • To implement ACID semantic • Check for readers • If none, update • Append to a mailbox • No checking • No explicit locks or leases

  18. Application-Specific Consistency • Predicate for reads • Examples • Can’t read something older than 30 seconds • Only can read data from a specific time frame

  19. Replication and Consistency • A data object is a sequence of read-only versions, consisting of read-only blocks, named by BGUIDs • No issues for replication • The mapping from AGUID to the latest VGUID may change • Use primary-copy replication

  20. Serializing updates • A small primary tier of replicas run a Byzantine agreement protocol. • A secondary tier of replicas optimistically propagate the update using an epidemic protocol. • Ordering from primary tier is multicasted to secondary replicas.

  21. The Full Update Path

  22. Deep Archival Storage • Data is fragmented. • Each fragment is an object. • Erasure coding is used to increase reliability.

  23. Introspection computation observation optimization • Uses: • Cluster recognition • Replica management • Other uses

  24. Software Architecture • Java atop the Staged Event Driven Architecture (SEDA) • Each subsystem is implemented as a stage • With each own state and thread pool • Stages communicate through events • 50,000 semicolons by five graduate students and many undergrad interns

  25. Software Architecture

  26. Language Choice • Java: speed of development • Strongly typed • Garbage collected • Reduced debugging time • Support for events • Easy to port multithreaded code in Java • Ported to Windows 2000 in one week

  27. Language Choice • Problems with Java: • Unpredictability introduced by garbage collection • Every thread in the system is halted while the garbage collector runs • Any on-going process stalls for ~100 milliseconds • May add several seconds to requests travel cross machines

  28. Experimental Setup • Two test beds • Local cluster of 42 machines at Berkeley • Each with 2 1.0 GHz Pentium III • 1.5GB PC133 SDRAM • 2 36GB hard drives, RAID 0 • Gigabit Ethernet adaptor • Linux 2.4.18 SMP

  29. Experimental Setup • PlanetLab, ~100 nodes across ~40 sites • 1.2 GHz Pentium III, 1GB RAM • ~1000 virtual nodes

  30. Storage Overhead • For 32 choose 16 erasure encoding • 2.7x for data > 8KB • For 64 choose 16 erasure encoding • 4.8x for data > 8KB

  31. The Latency Benchmark • A single client submits updates of various sizes to a four-node inner ring • Metric: Time from before the request is signed to the signature over the result is checked • Update 40 MB of data over 1000 updates, with 100ms between updates

  32. The Latency Benchmark

  33. The Throughput Microbenchmark • A number of clients submit updates of various sizes to disjoint objects, to a four-node inner ring • The clients • Create their objects • Synchronize themselves • Update the object as many time as possible for 100 seconds

  34. The Throughput Microbenchmark

  35. Archive Retrieval Performance • Populate the archive by submitting updates of various sizes to a four-node inner ring • Delete all copies of the data in its reconstructed form • A single client submits reads

  36. Archive Retrieval Performance • Throughput: • 1.19 MB/s (Planetlab) • 2.59 MB/s (local cluster) • Latency • ~30-70 milliseconds

  37. The Stream Benchmark • Ran 500 virtual nodes on PlanetLab • Inner Ring in SF Bay Area • Replicas clustered in 7 largest P-Lab sites • Streams updates to all replicas • One writer - content creator – repeatedly appends to data object • Others read new versions as they arrive • Measure network resource consumption

  38. The Stream Benchmark

  39. The Tag Benchmark • Measures the latency of token passing • OceanStore 2.2 times slower than TCP/IP

  40. The Andrew Benchmark • File system benchmark • 4.6x than NFS in read-intensive phases • 7.3x slower in write-intensive phases

  41. Bloom Filters • Compact data structures for a probabilistic representation of a set • Appropriate to answer membership queries [Koloniari and Pitoura]

  42. Bloom Filters (cont’d) Query for b: check the bits at positions H1(b), H2(b), ..., H4(b). back

  43. V0: (x0) V0 : (x0) write x write x write x V1 : (x1) V2: (x2) V3 : (x3) V4 : (x4) V5: (x5) Pair-Wise Reconciliation Site A Site B Site C V0 : (x0) [Kang et al. 2003]

  44. H0 H0 H0 H1 H3 H2 H0 H0 H1 H2 H3 H2 H1 H4 H4 H5 Hash History Reconciliation back Site A Site B Site C V0 H0 V2 V3 V1 V4 V5 Hi = hash (Vi)

  45. n Erasure Codes n Message Encoding Algorithm cn Encoding Transmission Received Decoding Algorithm n Message back [Mitzenmacher]

More Related