500 likes | 638 Views
Pond. The OceanStore Prototype. Introduction. Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte storage within three years Solution: OceanStore. OceanStore. Internet-scale Cooperative file system High durability
E N D
Pond The OceanStore Prototype
Introduction • Problem: Rising cost of storage management • Observations: • Universal connectivity via Internet • $100 terabyte storage within three years • Solution: OceanStore
OceanStore • Internet-scale • Cooperative file system • High durability • Universal availability • Two-tier storage system • Upper tier: powerful servers • Lower tier: less powerful hosts
More on OceanStore • Unit of storage: data object • Applications: email, UNIX file system • Requirements for the object interface • Information universally accessible • Balance between privacy and sharing • Simple and usable consistency model • Data integrity
OceanStore Assumptions • Infrastructure untrusted except in aggregate • Most nodes are not faulty and malicious • Infrastructure constantly changing • Resources enter and exit the network without prior warning • Self-organizing, self-repairing, self-tuning
OceanStore Challenges • Expressive storage interface • High durability on untrusted and changing base
Data Model • The view of the system that is presented to client applications
Storage Organization • OceanStore data object ~= file • Ordered sequence of read-only versions • Every version of every object kept forever • Can be used as backup • An object contains metadata, data, and references to previous versions
Storage Organization • A stream of objects identified by AGUID • Active globally-unique identifier • Cryptographically-secure hash of an application-specific name and the owner’s public key • Prevents namespace collisions
Storage Organization • Each version of data object stored in a B-tree like data structure • Each block has a BGUID • Cryptographically-secure hash of the block content • Each version has a VGUID • Two versions may share blocks
Application-Specific Consistency • An update is the operation of adding a new version to the head of a version stream • Updates are applied atomically • Represented as an array of potential actions • Each guarded by a predicate
Application-Specific Consistency • Example actions • Replacing some bytes • Appending new data to an object • Truncating an object • Example predicates • Check for the latest version number • Compare bytes
Application-Specific Consistency • To implement ACID semantic • Check for readers • If none, update • Append to a mailbox • No checking • No explicit locks or leases
Application-Specific Consistency • Predicate for reads • Examples • Can’t read something older than 30 seconds • Only can read data from a specific time frame
System Architecture • Unit of synchronization: data object • Changes to different objects are independent
Virtualization through Tapestry • Resources are virtual and not tied to particular hardware • A virtual resource has a GUID, globally unique identifier • Use Tapestry, a decentralized object location and routing system • Scalable overlay network, built on TCP/IP
Virtualization through Tapestry • Use GUIDs to address hosts and resources • Hosts publish the GUIDs of their resources in Tapestry • Hosts also can unpublish GUIDs and leave the network
Replication and Consistency • A data object is a sequence of read-only versions, consisting of read-only blocks, named by BGUIDs • No issues for replication • The mapping from AGUID to the latest VGUID may change • Use primary-copy replication
Replication and Consistency • The primary copy • Enforces access control • Serializes concurrent updates
Archival Storage • Replication: 2x storage to tolerate one failure • Erasure code is much better • A block is divided into m fragments • m fragments encoded into n > m fragments • Any m fragments can restore the original object
Caching of Data Objects • Reconstructing a block from erasure code is an expensive process • Need to locate m fragments from m machines • Use whole-block caching for frequently-read objects
Caching of Data Objects • To read a block, look for the block first • If not available • Find block fragments • Decode fragments • Publish that the host now caches the block • Amortize the cost of erasure encoding/decoding
Caching of Data Objects • Updates are pushed to secondary replicas via application-level multicast tree
The Full Update Path • Serialized updates are disseminated via the multicast tree for an object • At the same time, updates are encoded and fragmented for long-term storage
The Primary Replica • Primary servers run Byzantine agreement protocol • Need more than 2/3 nonfaulty participants • Messages required grow quadratic in the number of participants
Public-Key Cryptography • Too expensive • Use symmetric-key message authentication codes (MACs) • Two to three orders of magnitude faster • Downside: can’t prove the authenticity of a message to the third party • Used only for the inner ring • Public-key cryptography for outer ring
Proactive Threshold Signatures • Byzantine agreement guarantees correctness if not more than 1/3 servers fail during the life of the system • Not practical for a long-lived system • Need to reboot servers at regular intervals • Key holders are fixed
Proactive Threshold Signatures • Proactive threshold signatures • More flexibility in choosing the membership of the inner ring • A public key is paired with a number of private keys • Each server uses its key to generate a signature share
Proactive Threshold Signatures • Any k shares may be combined to produce a full signature • To change membership of an inner ring • Regenerate signature shares • No need to change the public key • Transparent to secondary hosts
The Responsible Party • Who chooses the inner ring? • Responsible party: • A server that publishes sets of failure-independent nodes • Through offline measurement and analysis
Software Architecture • Java atop the Staged Event Driven Architecture (SEDA) • Each subsystem is implemented as a stage • With each own state and thread pool • Stages communicate through events • 50,000 semicolons by five graduate students and many undergrad interns
Language Choice • Java: speed of development • Strongly typed • Garbage collected • Reduced debugging time • Support for events • Easy to port multithreaded code in Java • Ported to Windows 2000 in one week
Language Choice • Problems with Java: • Unpredictability introduced by garbage collection • Every thread in the system is halted while the garbage collector runs • Any on-going process stalls for ~100 milliseconds • May add several seconds to requests travel cross machines
Experimental Setup • Two test beds • Local cluster of 42 machines at Berkeley • Each with 2 1.0 GHz Pentium III • 1.5GB PC133 SDRAM • 2 36GB hard drives, RAID 0 • Gigabit Ethernet adaptor • Linux 2.4.18 SMP
Experimental Setup • PlanetLab, ~100 nodes across ~40 sites • 1.2 GHz Pentium III, 1GB RAM • ~1000 virtual nodes
Storage Overhead • For 32 choose 16 erasure encoding • 2.7x for data > 8KB • For 64 choose 16 erasure encoding • 4.8x for data > 8KB
The Latency Benchmark • A single client submits updates of various sizes to a four-node inner ring • Metric: Time from before the request is signed to the signature over the result is checked • Update 40 MB of data over 1000 updates, with 100ms between updates
The Throughput Microbenchmark • A number of clients submit updates of various sizes to disjoint objects, to a four-node inner ring • The clients • Create their objects • Synchronize themselves • Update the object as many time as possible for 100 seconds
Archive Retrieval Performance • Populate the archive by submitting updates of various sizes to a four-node inner ring • Delete all copies of the data in its reconstructed form • A single client submits reads
Archive Retrieval Performance • Throughput: • 1.19 MB/s (Planetlab) • 2.59 MB/s (local cluster) • Latency • ~30-70 milliseconds
The Stream Benchmark • Ran 500 virtual nodes on PlanetLab • Inner Ring in SF Bay Area • Replicas clustered in 7 largest P-Lab sites • Streams updates to all replicas • One writer - content creator – repeatedly appends to data object • Others read new versions as they arrive • Measure network resource consumption
The Tag Benchmark • Measures the latency of token passing • OceanStore 2.2 times slower than TCP/IP
The Andrew Benchmark • File system benchmark • 4.6x than NFS in read-intensive phases • 7.3x slower in write-intensive phases