460 likes | 477 Views
OceanStore: An Architecture for Global-Scale Persistent Storage. John Kubiatowicz, et al ASPLOS 2000. 10 10 users, each with 10,000 files. OceanStore. Global scale information storage. Mobile access to information in a uniform and highly available way. Servers are untrusted.
E N D
OceanStore: An Architecture for Global-Scale Persistent Storage John Kubiatowicz, et al ASPLOS 2000
1010 users, each with 10,000 files OceanStore • Global scale information storage. • Mobile access to information in a uniform and highly available way. • Servers are untrusted. • Caches data anywhere, anytime. • Monitors usage patterns.
OceanStore [Rhea et al. 2003]
Main Goals • Untrusted infrastructure • Nomadic data
Personal information management tools: calendars, emails, contact lists Example applications • Groupware and PIM • Email • Digital libraries • Scientific data repository
Challenges: scaling, consistency, migration, network failures Example applications • Groupware and PIM • Email • Digital libraries • Scientific data repository
Storage Organization • OceanStore data object ~= file • Ordered sequence of read-only versions • Every version of every object kept forever • Can be used as backup • An object contains metadata, data, and references to previous versions
Storage Organization • A stream of objects identified by AGUID • Active globally-unique identifier • Cryptographically-secure hash of an application-specific name and the owner’s public key • Prevents namespace collisions
Storage Organization • Each version of data object stored in a B-tree like data structure • Each block has a BGUID • Cryptographically-secure hash of the block content • Each version has a VGUID • Two versions may share blocks
Storage Organization [Rhea et al. 2003]
Access Control • Restricting readers: • Symmetric encryption key distributed to allowed readers. • Restricting writers: • ACL. • Signed writes. • ACL for object chosen with signed certificate.
Find 11010 Location and RoutingAttenuated Bloom Filters
Updating data • All data is encrypted. • A set of predicates is evaluated in order. • The actions of the earliest true predicate are applied. • Update is logged if it commits or aborts. • Predicates: • compare-version, compare-block, compare-size, search • Actions • replace-block, insert-block, delete-block, append
Application-Specific Consistency • An update is the operation of adding a new version to the head of a version stream • Updates are applied atomically • Represented as an array of potential actions • Each guarded by a predicate
Application-Specific Consistency • Example actions • Replacing some bytes • Appending new data to an object • Truncating an object • Example predicates • Check for the latest version number • Compare bytes
Application-Specific Consistency • To implement ACID semantic • Check for readers • If none, update • Append to a mailbox • No checking • No explicit locks or leases
Application-Specific Consistency • Predicate for reads • Examples • Can’t read something older than 30 seconds • Only can read data from a specific time frame
Replication and Consistency • A data object is a sequence of read-only versions, consisting of read-only blocks, named by BGUIDs • No issues for replication • The mapping from AGUID to the latest VGUID may change • Use primary-copy replication
Serializing updates • A small primary tier of replicas run a Byzantine agreement protocol. • A secondary tier of replicas optimistically propagate the update using an epidemic protocol. • Ordering from primary tier is multicasted to secondary replicas.
Deep Archival Storage • Data is fragmented. • Each fragment is an object. • Erasure coding is used to increase reliability.
Introspection computation observation optimization • Uses: • Cluster recognition • Replica management • Other uses
Software Architecture • Java atop the Staged Event Driven Architecture (SEDA) • Each subsystem is implemented as a stage • With each own state and thread pool • Stages communicate through events • 50,000 semicolons by five graduate students and many undergrad interns
Language Choice • Java: speed of development • Strongly typed • Garbage collected • Reduced debugging time • Support for events • Easy to port multithreaded code in Java • Ported to Windows 2000 in one week
Language Choice • Problems with Java: • Unpredictability introduced by garbage collection • Every thread in the system is halted while the garbage collector runs • Any on-going process stalls for ~100 milliseconds • May add several seconds to requests travel cross machines
Experimental Setup • Two test beds • Local cluster of 42 machines at Berkeley • Each with 2 1.0 GHz Pentium III • 1.5GB PC133 SDRAM • 2 36GB hard drives, RAID 0 • Gigabit Ethernet adaptor • Linux 2.4.18 SMP
Experimental Setup • PlanetLab, ~100 nodes across ~40 sites • 1.2 GHz Pentium III, 1GB RAM • ~1000 virtual nodes
Storage Overhead • For 32 choose 16 erasure encoding • 2.7x for data > 8KB • For 64 choose 16 erasure encoding • 4.8x for data > 8KB
The Latency Benchmark • A single client submits updates of various sizes to a four-node inner ring • Metric: Time from before the request is signed to the signature over the result is checked • Update 40 MB of data over 1000 updates, with 100ms between updates
The Throughput Microbenchmark • A number of clients submit updates of various sizes to disjoint objects, to a four-node inner ring • The clients • Create their objects • Synchronize themselves • Update the object as many time as possible for 100 seconds
Archive Retrieval Performance • Populate the archive by submitting updates of various sizes to a four-node inner ring • Delete all copies of the data in its reconstructed form • A single client submits reads
Archive Retrieval Performance • Throughput: • 1.19 MB/s (Planetlab) • 2.59 MB/s (local cluster) • Latency • ~30-70 milliseconds
The Stream Benchmark • Ran 500 virtual nodes on PlanetLab • Inner Ring in SF Bay Area • Replicas clustered in 7 largest P-Lab sites • Streams updates to all replicas • One writer - content creator – repeatedly appends to data object • Others read new versions as they arrive • Measure network resource consumption
The Tag Benchmark • Measures the latency of token passing • OceanStore 2.2 times slower than TCP/IP
The Andrew Benchmark • File system benchmark • 4.6x than NFS in read-intensive phases • 7.3x slower in write-intensive phases
Bloom Filters • Compact data structures for a probabilistic representation of a set • Appropriate to answer membership queries [Koloniari and Pitoura]
Bloom Filters (cont’d) Query for b: check the bits at positions H1(b), H2(b), ..., H4(b). back
V0: (x0) V0 : (x0) write x write x write x V1 : (x1) V2: (x2) V3 : (x3) V4 : (x4) V5: (x5) Pair-Wise Reconciliation Site A Site B Site C V0 : (x0) [Kang et al. 2003]
H0 H0 H0 H1 H3 H2 H0 H0 H1 H2 H3 H2 H1 H4 H4 H5 Hash History Reconciliation back Site A Site B Site C V0 H0 V2 V3 V1 V4 V5 Hi = hash (Vi)
n Erasure Codes n Message Encoding Algorithm cn Encoding Transmission Received Decoding Algorithm n Message back [Mitzenmacher]