360 likes | 367 Views
Explore the concept of peer-to-peer backup and its benefits in ensuring data integrity and recoverability. Learn about different backup approaches and how to leverage P2P networks for affordable and easy backup solutions.
E N D
Peer-to-Peer Backup Presented by Yingwu Zhu
Overview • A short introduction to backup • Peer-to-Peer backup
Why Need Backup ? • User errors, e.g., accidental deletion or overwriting • Hardware failures, e.g., disk failures • Software errors, e.g., file-system corruption • Nature disaster, e.g., earthquake Using backup to recover system or user data
What is Good Backup ? • Two metrics • Performance(speed): as quickly as possible • Correctness: data integrity, recoverable • Data in backup is ensured not to be modified • Data can be recovered from the backup
Recovery • Corresponds to Backup • Why need Recovery ? • Disaster recovery, e.g. recover the whole file system • Stupidity recovery, e.g, recover a small set of files
Primary Backup/Restore Approaches • Logical backup, e.g., dump • Interpret metadata • Identify which files need backup • Physical backup • Ignore file structure • Block-based
Logical Backup • Benefits due to the use of the underlying FS • Flexible: backup/restore a subset of files • Fast individual file recovery • Drawback due to the use of the underlying FS • Slow • Traverse file/directory hierarchy • Write each file contiguously to backup medias
Physical Backup • Benefits • Quick: avoid costly seek operations • Simple: ingore file structure • Drawbacks • Non-portable: dependent on disk layout • Difficult to recover a subset of the FS: must full restore
On-line Backup • Different from many backups that require the FS to remain quiescent during backup • Allow users to continue accessing files during backup • Synchronous mirroring (remote) • Expensive in performance and network bandwidth • Strong consistency • Asynchronous mirroring (remote) • Use copy-on-write technique • Periodically transfer self-consistent snapshots • Some data loss, but better in performance and bandwidth consumption
Pastiche: Making Backup Cheap and Easy • OSDI’02 Paper: P2P Backup • Introduction • System Design • Evaluation • Summary
Introduction • Traditional backup • Highly reliable storage devices • Administrative efforts • Cost of storage media as well as managing the media and transferring it off-site • Internet backup • Very costly, charge a high fee (i.e., $15 for 4GB data per month, neither applications nor the OSs)
Introduction • Several facts • Large cheap disks, approaching tape, but with better access and restore time • Low write traffic, newly written data is a small portion • Excess storage capacity, e.g., 53% full on 5,000 machines Take advantage of low write traffic Take advantage of excess storage capacity
Introduction • Peer-to-peer backup • Exploits slack resources at participating machines • Administrative-free • Backup data on multiple machines, most of them are nearby (for performace), but at least one faraway against disaster • Consolidate similar files for effective storage, e.g., similar OS, Window2000/98 • Untrusted machines require data privacy and integrity
Introduction • Enabling technologies for P2P backup • Pastry (P2P location and routing infrastructure) • Content-based indexing • Convergent encrytion
Introduction • Pastry • Peer-to-peer routing • Locality-aware (e.g., network proximity metric) • Content-based indexing • Find similarity across versions of files, different files • Anchors using Rabin fingerprint: divide files into chunks • Editing a file only change the chunks it touch • Name each chunk by SHA-1 content hash • Coalesce same chunks across files
Introduction • Convergent encryption • Originally proposed by Farsite • Each file is encrypted by a key derived from the file’s contents by hashing • Data privacy and integrity • Data sharing
System Design • Data is stored as chunks by content-based indexing • Chunks carry owner lists (a set of nodes) • Naming and storing chunks, see Figure 1 • Chunks are immutable • Write chunks, reference count (+1) • Meta-data chunk for a file • A list of handles for its chunks, e.g.,<handle, chunkId> • Ownership, permission, create/modification times, etc • Encrypted • Mutable, to avoid cascading writes from the file to root
System Design • Using abstracts to find data redundacy • Signature: the list of chunkIds describing a node’s FS • Fact: the signature of a node doesn’t change much over time small amount of data updated • Initial backup of a node is expensive: backup all data to a backup site/node • Find an ideal backup buddy: holds a superset of the data of the node which needs backup
System Design • Using abstract to find data redundacy • How to find such a good buddy: ideal case or more overlap in signatures ? • Naive: compare two signature, impractical • Large size of signature • A node’ buddy set can change over time • Abstract: random subset of a signature • Tens of chunkIds for an abstract can work well
System Design • How to find a set of buddies ? • Requiremets for a set of buddies • Substantial overlap in signatures to reduce storage overhead • Most buddies should be nearby to reduce network load and improve restore performance • At least one buddy be faraway to provide geographic diversity
System Design • Using two Pastry overlays to facilitate buddy discovery • Standard Pastry overlay with network proximity • Second overlay with FS overlap metric • Lighthouse sweep: discovery request contains an abstract • Subsequent probes are generated by varying the first digit of the original nodeId
Backup • A backup node has full control over what, when, and how often to backup • Each discrete backup is a single snapshot • The skeleton for a snaphot is stored as a collection of persistent, per-file logs, see Figure 2 • The skeleton and retained snapshots: stored at local node + its backup nodes
Backup • Backup procedure (Asynchronous using copy-on-write) • The chunks to be added to the backup buddies • First check, then fetch if needed (by buddies) • The list of chunks to be removed • Delete the chunk which doesn’t belong to any snapshots • Public key to ensure correctness of requests • Deferred to the end of the snapshot process • The meta-data chunks in the skeleton that changes as a result • Overwite old meta-data chunks
Restoration • Partial restore is straightforward by retaining its archive skeleton • Try to restore from the nearest buddy • How to recover the entire machine? • Keeps a copy of its root meta-data object on each member of its leaf set • The root block contains the set of buddies which backup its FS
Detecting Failure and Malice • Untrusted buddy • Come and go at will • Claim to store chunks without actually doing so • A probabilistic mechanism to deal with it • Probe a buddy for a random subset of chunks it should store • If it passes the check, go on • Otherwise, replace it with another buddy candidate
Preventing Greed • Problem: a greedy node consumes too much storage • Proposed solutions • Contribution = consumption • Solve cryptographic puzzles in proportion to consumption of storage • Electronic currency
An Alternative Design • Distribute chunks to peer-to-peer storage system • Advantages • K backup copies of a chunk exist anywhere • Pastry takes care of failed nodes • Disadvantages • Do not consider network proximity, increase network load and restore latency • Difficult to deal with malicious nodes
Evaluation • Prototype: the chunkstore FS + a backup daemon • Evaluate • Performance of the chunkstore FS • Performance of backup / restore • How large must an abstract be? Is the lighthouse sweep able to find buddies?
Performance of the Chunkstore FS • Benchmark: MAB (Modified Andrew Benchmark) • Baseline: ext2fs • Slight overhead due to Rabin fingerprints
Performance of Backup/Restore • Workload: 13.4MB tree of 1641 files and 109 directories, total 4004 chunks
Buddy Discovery • Abstract size vs. signature overlap • An Win98 with an Office 2000 professional, 90,000 chunks • A Linux machine, running Debian unstable release, 270,000 chunks • Result: • Estimates are independ of smaple size • Small abstracts are effective if good buddies exist
Buddy Discovery • How effectively lighthouse sweeps find good buddies? • 50,000 nodes, a distribution of 11 types • 30%(type 1), 20%(2,3), 10%(4,5), 5%(6), 1%(7-11)
Summary • Automatic backup with no administrative costs • Exploit slack resources and coalesce duplicate copies of chunks across files • Backup to a set of buddies for good backup/restore performance • Handle security issues over untrusty buddies