360 likes | 373 Views
Pastiche offers an innovative solution to backup challenges by leveraging excess disk capacity, redundancy, and content-based indexing for affordable and efficient backup services. Learn about Pastiche's use of Pastry, content-based indexing, and convergent encryption.
E N D
Introduction • Backup is cumbersome and expensive • ~$4/GB/Month • Small-scale solutions dominated by administrative efforts • Large-scale solutions require centralized management
Pastiche • Observation 1: disk is no longer full • Can use excess capacity for efficient, effective, and administration-free backup • Use untrusted machines to perform backup services • Need replication for reliability • Need to balance locality and reliability
Pastiche • Observation 2: Much of the data on a given machine is not unique • Office 2000: 217 MB footprint • Different installations are largely the same • It’s exploitation can achieve storage savings
Pastiche • Built on three pieces of research • Pastry: Peer-to-peer, self-administering, scalable routing • Content-based indexing: easy discovering of redundant data • Convergent encryption: use the same encrypted representation without sharing keys
Challenges • How to discover backup buddies without a centralized directory? • How can nodes reuse their own state to backup others? • How can nodes restore files/machines without requiring administrative intervention? • How can nodes detect unfaithful buddies?
Basic Idea • Summarize storage content with abstracts • Use abstracts to locate buddies • A skeleton tree is used to represent and restore an entire file system • Periodic queries of buddies for stored data
Enabling Technologies • Peer-to-peer routing • Content-based indexing • Convergent encryption
Peer-to-Peer Routing • Pastry: scalable, self-organizing, routing and object location infrastructure • Each node has a nodeID • IDs are uniformly distributed in the ID space • A proximity metric to measure the distance between two IDs
More on Pastry - I • Each node maintains three sets of states • Leaf set • Closest nodes in terms of nodeIDs • Neighborhood set • Closest nodes in terms of of the proximity metric • Routing table • Prefix routing
More on Pastry - II • Pastry is self organizing • Nodes come and go • seed discovery protocol
Prefix Routing • In each step, a node forwards the message to a node whose nodeID shares a prefix that is at least one digit longer than the prefix of the current nodeID • Destination: 1230 • Current NodeID: 1023 • Next Hop: 12--
Pastiche’s Use of Pastry • Uses two separate Pastry overlay networks during buddy discovery • Once a node is discovered, traffic is send directly via IP • Pastiche adds two mechanisms • Lighthouse sweep to discover distinct Pastry nodes • Distance metric based on the FS contents
Content-Based Indexing • Goal: identify file regions for sharing • Use Rabin fingerprints • A fingerprint is generated for each overlapping k-byte substring in a file • If the lower-order bits of a fingerprint match a predetermined value, that offset is marked as an anchor • Anchors divide files into chunks; each chunk isassociated with a secure hash value
Sharing with Confidentiality • Sharing encrypted data without sharing keys • Need to have a single encrypted representation • Use convergent encryption
Convergent Encryption • So…say…how do you share a door without sharing its corresponding keys?
Convergent Encryption • How about use different safes to stores those keys?
Convergent Encryption • And use different keys to access those keys
Implications of the Use of Convergent Encryption • If a backup node is not a participating group • Cannot decrypt the data • If not, a backup node knows the node also stores that data • Information leak vs. storage efficiency
Design • Pastiche data is stored in chunks • Chunk boundaries determined by content-based indexing • Encrypted with convergent encryption • Chunks carry owner lists
Design • When a newly written file is closed, it is scheduled for chunking • If a chunk already exists, the local host is added to the owner list • If not, encrypt the chunk and write it out • Chunking and writing deferred to avoid short-lived files
Design • Chunks are immutable • When a file is written, its set of chunk may change • A chunk is not deleted until the last reference to it is removed
Abstracts: Finding Redundancy • An ideal backup buddy is one that holds a superset of the new machine’s data • To find it, send the full signature (hashes) of the new node to candidate buddies • However, we need to transfer 1.3MB per GB of stored data • Solution: Abstracts—transfer only a random subset of signatures
Compare one disk to another Node1 signature Node2 signature 98 98 73 73 1 1 46 46 98 98 73 73 1 1 46 46 20 67 8 8 11 11 55 55 20 67 8 8 11 11 55 55 26 7 13 53 45 16 24 21 7 26 53 13 17 16 24 93 35 33 15 18 16 45 24 21 35 77 15 19 35 33 15 18 1 67 13 15 Node1 abstract
Overlays: Finding a Set of Buddies • A desirable buddy should have • A substantial overlap • Physically nearby (with at least one far away to survive geographically correlated failures)
Applied Use of Pastry • Pastiche uses two Pastry overlays to facilitate buddy discovery • One for network proximity • One for file system overlap • Coverage—the fraction of overlapping chunks stored on a site
Security Problems • A malicious node can • Under-report coverage to avoid being chosen as a buddy • Over-report coverage to attract clients just to discard their chunks
Backup Protocol • A Pastiche node has full control over the backup schedule • A snapshot consists of three things • Chunks to be added • Chunks to be removed • Metadata of those chunks
Restoration • A Pastiche node retains its archive skeleton, so performing partial restores is easy • To recover the whole machine, a node has to obtain its root node from one of the backup machines first…
Detecting Failure and Malice • A node randomly requests data from its buddies • Can bound probability of having failures and malicious nodes undetected
Preventing Greed • Someone can store things everywhere • Need to institute distributed quota • Very difficult • Some proposed solutions • Each node monitors the overall storage costs imposed by its backup clients • Problem: Sybil attacks (forge many entities that consumes little storage)
Preventing Greed • Force each node to solve puzzles proportional to storage consumption • Problems: • Needless expensive • Storage is traded against something other than storage • Heterogeneous computing power
Preventing Greed • Electronic currency • Problems: • Need to add atomic currency transactions • Complicated
Implementation • Chunkstore file system • Backup daemon