270 likes | 402 Views
GLACIERS HIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE. PRESENTED BY ANILA JAGANNATHAM. INTRODUCTION.
E N D
GLACIERSHIGHLY DURABLE, DECENTRALIZED STORAGE DESPITE MASSIVE CORRELATED FAILURE PRESENTED BY ANILA JAGANNATHAM
INTRODUCTION GLACIER IS A DISTRIBUTED STORAGE SYSTEM THAT RELIES ON MASSIVE REDUNDANCY TO MASK THE EFFECT OF LARGE SCALE CORRELATED FAILURES. AIM: TO PROVIDE HIGHLY DURABLE STORAGE DESPITE CORRELATED BYZANTINE FAILURES OF MAJORITY OF PARTICIPATING NODES. Ex: INTERNET WORM ATTACKS.
WHY IS IT DIFFERENT ? • In Oceanstore and Phoenix, Introspection is used where an accurate failure model is assumed. Problem: Observation doesn’t reveal low-incidence failures and humans cannot predict all sources of correlated failures. • Glacier is very different from Oceanstore or Phoenix as it doesn’t make any assumption about the nature of failure . Uses abundant but unreliable storage space on the nodes to provide durable storage for critical data.
REQUIREMENTS • Nodes form an Overlay Network. • Directory service – To map the keys to the address of a live node, that is currently responsible for the key • Keys form a circular space • Each node is responsible for a uniformly sized segment of key space • Node Identifiers are assigned pseudo-randomly to prevent Sybil attacks. • Glacier has to reliably identify, authenticate and communicate with the node that is currently responsible for a given key.
ARCHITECTURE • Glacier operates along side a primary store • Primary store- Provides R/W access and short-term availability by masking individual failures • Glacier- acts as archival storage • Aggregation layer – aggregates small objects prior to insertion into Glacier.
INTEFACE TO APPLICATION • Lease – Used to control the life time of stored objects. • When Lease expires the object is removed from storage. • Lease period is chosen to exceed the assumed maximal duration of a large-scale failure (several weeks or months). • Application interact with glacier using following methods • put (i ,v, o, l) – to STORE an object o , under identifier i , version v and lease period l • get( i , v) – to retrieve a stored object • refresh ( i, v, l) – to extend the lease of an existing object.
FRAGMENTS AND MANIFESTS • Glacier uses erasure codesto reduce storage overhead. • An object O of size |O| is recorded in ‘n’ Fragments F1,F2,…Fn of size |O|/r, any r of which contain sufficient information to restore the entire object. • Object is stored under key ‘k’ Fragment as (k, i ,v) – where i – index , v- version • Authenticator Ao = (H(O), H(F1), H(F2)...,H(Fn) ,v ,l) where H(f) denotes a secure hash (e.g., SHA-1) Used to detect and remove corrupted fragments during recovery. • Manifest Mo = Authenticator + Cryptographic signature to authenticate the object and each of the fragments
FRAGMENT PLACEMENT • Glacier uses a Placement function ‘P’ to determine the node which stores a particular fragment (k , i , v). • Requirements for the Placement function • Fragments of same objects should be placed on different pseudo-randomly chosen node. • Ability to locate a fragment after failure with only the Object key. • Fragments with similar keys should be grouped together to allow aggregation • Placement function should be stable i.e., node should change rarely. • Glacier uses: P( k , i, v) = k + i /(n+1) + H( v) - maps the primary replica at position k and its n fragments to n+1 equidistant points in the circular id space
FRAGMENT PLACEMENT • Insert a new object • Glacier sends a probe message to each location P( k , i , v) ( where i= 1..n). • If owner of P( k, i , v) is currently online it responds to the message and Glacier sends the fragment directly to that node. Otherwise fragment is discarded and restored later by maintenance mechanism. • If fewer than r nodes are online , temporary fragment holders are used.
FRAGMENT MAINTENANCE • Maintenance mechanism is needed as nodes may miss fragment insertions due to short term churn. • Maintenance Uses the fact that Fragments with similar keys are assigned to similar set of nodes. • Each fragment holder has N-1 peers which are storing fragments of exact same objects as itself. • Protocol: • Node compile a list of all keys (k,v) in its local store and send it to some of its peers • Each peers checks it against its own store and replies with a list of manifests, one for each object missing in list • For each object, node requests k fragments from its peers and validate each of the fragments against the manifest and computes the fragment that has to be stored locally
RECOVERY • Maintenance mechanism has to restore full redundancy. • If compromised node fails permanently – other nodes take over the key segments. • If compromised node recovers and rejoins the system the fragments have to be restored. • To prevent congestive collapse during recovery – Glacier limits the number of simultaneous fragment reconstruction to Rmax.
CONFIGURATION • Durability : If a failure affects a fraction f<= fmax of the storage nodes, each object survives with probability P>= Pmin. • The probability that an Object O can be reconstructed if at least r trial have a positive outcome is given by Bernoulli trails
Parameters N & r have to be chosen such that P meets desired level of durability. Probability that a collection of n objects survives the failure unscathed is PD(n) = Dn If value of fmax is accidentally chosen low, Glacier still offers protection, the survival probability degrades gracefully as the magnitude of the actual failure increases. Ex: fmax = 0.6 and Pmin = 0.999999 when f = 0.7 P = 0.9997 f = 0.8 P = 0.975.
OBJECT AGGREGATION • User access the system using one node at a time – called as user’s proxy is the only node trusted by the user • When user inserts the objects into the Glacier they are buffered at the user’s proxy node and inserted immediately to the primary store • After enough objects have been gathered or time has passed the buffered objects are placed as a single object into Glacier under an aggregate key. • If objects have to stored in Glacier immediately then Flush method is used. • Proxy maintains a local aggregate directory which maps object key to aggregate that contains the object. • To ensure recovery the owner’s aggregates form a linked list. The head of the list is stored in an application-specific object with a well known key.
OBJECT AGGREGATION • An aggregate contains references to multiple aggregates to prevent disconnection if an aggregate expires in order other than insertion order • Aggregates forms a DAG • Indegree of every aggregate is kept above dmin • An aggregate consists of tuples (oi, ki, vi)
RECOVERY • After failure – Information not in the Glacier is lost and has to be restored – Contents of the Primary store, Aggregate directories. • Aggregate Directories can be recovered by walking through the DAG. • First, the key of most recently inserted aggregate is retrieved using a well known key in Glacier. • Later- Aggregates are retrieved in sequence and objects contained are added to the aggregate directory. • Primary store can be populated lazily on demand by applications or eagerly while walking the aggregate DAG.
CONSOLIDATION • Glacier periodically checks the aggregate directory for aggregates whose leases will expire soon and decide whether to renew their leases. • Aggregate is SMALL or Majority of Object leases have expired then lease is not renewed. • Instead the non-expired objects are consolidated with new objects either from local buffers or other aggregates and new aggregate is created. • Consolidation is used to maintain low storage overhead. And particularly effective when leases are bimodal.
SECURITY • ATTACKS ON INTEGRITY – Malicious attacker can overwrite the fragments on nodes under control. Authenticator is used by fragment holder to validate fragments and replace corrupted fragments. • ATTACKS ON DURABILITY – If attacker can successfully delete all replicas and more than n-r fragments of an object then it is lost. Unlikely due to pseudo-random selection of nodes. • ATTACKS ON TIME SOURCE – Are avoided as the timestamps in the storage nodes are maintained as relative values. • SPACE-FILLING ATTACKS – Attacker can consume all the storage space available. This doesn’t affect existing data and storage can be reclaimed gradually as data expires. To prevent this incentive mechanisms can be added. • ATTACK ON GLACIER – Unlikely as code for deleting fragments ,handoff and expiration is very simple. • HAYSTACK-NEEDLE ATTACKS – Attacker can compromise personal node itself and insert large number of decoy objects making recovery infeasible. Can be overcome by periodically inserting reference objects with well known version numbers like current time stamp.
EXPERIMENTAL EVALUVATION • Tested in 2 –ways • First – As a storage layer for ePOST ( a cooperative serverless email system) for 140 days • Glacier maintains N = 48 fragments using an erasure code with r = 5. fmax = 60% and Pmin = 0.999999 • Epost has 35 nodes which are desktop PC’s running Linux, OS X and windows. • Glacier was able to handle all types of failures which included kernel panics, JVM crashes, Configuration error causing 16 nodes to be disconnected.
Fig 7 shows the cumulative size of all the objects inserted over time as well as objects that have not yet expired. Initial lease – 1month. • Fig 8 shows high number of small objects ranging between 1-10KB. And less than 1% of object larger than 600KB. Emails typically where small objects, Emails with attachment- larger objects. • Fig 9 shows the growth in the storage as new email enters the systems and increase in trash as the mails are deleted
ePOST RECOVERY • Randomly selected 13 nodes and copied their local fragments to 13 fresh nodes. • Started new overlay network with only these 13 nodes • Resulting situation corresponds to 58% failure which is close to fmax = 60% • Completely reinstalled epost on a 14th node and let it join the ring. • One of the user entered the email address and approximate date when he had last used the system. • Retrieval process took 1 hour after which epost was ready to use
SIMULATIONS • Used Trace driven simulations corresponding to 147 users , approx 10,000 nodes and wide range of failures • Explore the impact of Diurnal short term churn. • Modeled a ring of 250 nodes where M% will be unavailable between 5pm-7am & 2M% on weekends. • Fig 14 shows the decrease in insertion messages and increase in maintenance traffic.
Experiment shows that glacier is able to manage this large amount of data with surprisingly low maintenance overhead and that is it scalable both with respect to load and system size.
CONCLUSION • Glacier ensures durability of unrecoverable data in a cooperative, decentralized storage system, despite large scale correlated Byzantine failures. • It does not rely on Introspection which has inherent limitation to capture all sources of correlated failures. • Glacier uses raw, unreliable storage available at nodes to provide hard durability guarantees.