1 / 36

Peer-to-Peer Backup

Explore the concept of peer-to-peer backup and its benefits in ensuring data integrity and recoverability. Learn about different backup approaches and how to leverage P2P networks for affordable and easy backup solutions.

estellej
Download Presentation

Peer-to-Peer Backup

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Peer-to-Peer Backup Presented by Yingwu Zhu

  2. Overview • A short introduction to backup • Peer-to-Peer backup

  3. Why Need Backup ? • User errors, e.g., accidental deletion or overwriting • Hardware failures, e.g., disk failures • Software errors, e.g., file-system corruption • Nature disaster, e.g., earthquake Using backup to recover system or user data

  4. What is Good Backup ? • Two metrics • Performance(speed): as quickly as possible • Correctness: data integrity, recoverable • Data in backup is ensured not to be modified • Data can be recovered from the backup

  5. Recovery • Corresponds to Backup • Why need Recovery ? • Disaster recovery, e.g. recover the whole file system • Stupidity recovery, e.g, recover a small set of files

  6. Primary Backup/Restore Approaches • Logical backup, e.g., dump • Interpret metadata • Identify which files need backup • Physical backup • Ignore file structure • Block-based

  7. Logical Backup • Benefits  due to the use of the underlying FS • Flexible: backup/restore a subset of files • Fast individual file recovery • Drawback  due to the use of the underlying FS • Slow • Traverse file/directory hierarchy • Write each file contiguously to backup medias

  8. Physical Backup • Benefits • Quick: avoid costly seek operations • Simple: ingore file structure • Drawbacks • Non-portable: dependent on disk layout • Difficult to recover a subset of the FS: must full restore

  9. On-line Backup • Different from many backups that require the FS to remain quiescent during backup • Allow users to continue accessing files during backup • Synchronous mirroring (remote) • Expensive in performance and network bandwidth • Strong consistency • Asynchronous mirroring (remote) • Use copy-on-write technique • Periodically transfer self-consistent snapshots • Some data loss, but better in performance and bandwidth consumption

  10. Pastiche: Making Backup Cheap and Easy • OSDI’02 Paper: P2P Backup • Introduction • System Design • Evaluation • Summary

  11. Introduction • Traditional backup • Highly reliable storage devices • Administrative efforts • Cost of storage media as well as managing the media and transferring it off-site • Internet backup • Very costly, charge a high fee (i.e., $15 for 4GB data per month, neither applications nor the OSs)

  12. Introduction • Several facts • Large cheap disks, approaching tape, but with better access and restore time • Low write traffic, newly written data is a small portion • Excess storage capacity, e.g., 53% full on 5,000 machines Take advantage of low write traffic Take advantage of excess storage capacity

  13. Introduction • Peer-to-peer backup • Exploits slack resources at participating machines • Administrative-free • Backup data on multiple machines, most of them are nearby (for performace), but at least one faraway against disaster • Consolidate similar files for effective storage, e.g., similar OS, Window2000/98 • Untrusted machines require data privacy and integrity

  14. Introduction • Enabling technologies for P2P backup • Pastry (P2P location and routing infrastructure) • Content-based indexing • Convergent encrytion

  15. Introduction • Pastry • Peer-to-peer routing • Locality-aware (e.g., network proximity metric) • Content-based indexing • Find similarity across versions of files, different files • Anchors using Rabin fingerprint: divide files into chunks • Editing a file only change the chunks it touch • Name each chunk by SHA-1 content hash • Coalesce same chunks across files

  16. Introduction • Convergent encryption • Originally proposed by Farsite • Each file is encrypted by a key derived from the file’s contents by hashing • Data privacy and integrity • Data sharing

  17. System Design • Data is stored as chunks by content-based indexing • Chunks carry owner lists (a set of nodes) • Naming and storing chunks, see Figure 1 • Chunks are immutable • Write chunks, reference count (+1) • Meta-data chunk for a file • A list of handles for its chunks, e.g.,<handle, chunkId> • Ownership, permission, create/modification times, etc • Encrypted • Mutable, to avoid cascading writes from the file to root

  18. System Design

  19. System Design • Using abstracts to find data redundacy • Signature: the list of chunkIds describing a node’s FS • Fact: the signature of a node doesn’t change much over time  small amount of data updated • Initial backup of a node is expensive: backup all data to a backup site/node • Find an ideal backup buddy: holds a superset of the data of the node which needs backup

  20. System Design • Using abstract to find data redundacy • How to find such a good buddy: ideal case or more overlap in signatures ? • Naive: compare two signature, impractical • Large size of signature • A node’ buddy set can change over time • Abstract: random subset of a signature • Tens of chunkIds for an abstract can work well

  21. System Design • How to find a set of buddies ? • Requiremets for a set of buddies • Substantial overlap in signatures to reduce storage overhead • Most buddies should be nearby to reduce network load and improve restore performance • At least one buddy be faraway to provide geographic diversity

  22. System Design • Using two Pastry overlays to facilitate buddy discovery • Standard Pastry overlay with network proximity • Second overlay with FS overlap metric • Lighthouse sweep: discovery request contains an abstract • Subsequent probes are generated by varying the first digit of the original nodeId

  23. Backup • A backup node has full control over what, when, and how often to backup • Each discrete backup is a single snapshot • The skeleton for a snaphot is stored as a collection of persistent, per-file logs, see Figure 2 • The skeleton and retained snapshots: stored at local node + its backup nodes

  24. Backup

  25. Backup • Backup procedure (Asynchronous using copy-on-write) • The chunks to be added to the backup buddies • First check, then fetch if needed (by buddies) • The list of chunks to be removed • Delete the chunk which doesn’t belong to any snapshots • Public key to ensure correctness of requests • Deferred to the end of the snapshot process • The meta-data chunks in the skeleton that changes as a result • Overwite old meta-data chunks

  26. Restoration • Partial restore is straightforward by retaining its archive skeleton • Try to restore from the nearest buddy • How to recover the entire machine? • Keeps a copy of its root meta-data object on each member of its leaf set • The root block contains the set of buddies which backup its FS

  27. Detecting Failure and Malice • Untrusted buddy • Come and go at will • Claim to store chunks without actually doing so • A probabilistic mechanism to deal with it • Probe a buddy for a random subset of chunks it should store • If it passes the check, go on • Otherwise, replace it with another buddy candidate

  28. Preventing Greed • Problem: a greedy node consumes too much storage • Proposed solutions • Contribution = consumption • Solve cryptographic puzzles in proportion to consumption of storage • Electronic currency

  29. An Alternative Design • Distribute chunks to peer-to-peer storage system • Advantages • K backup copies of a chunk exist anywhere • Pastry takes care of failed nodes • Disadvantages • Do not consider network proximity, increase network load and restore latency • Difficult to deal with malicious nodes

  30. Evaluation • Prototype: the chunkstore FS + a backup daemon • Evaluate • Performance of the chunkstore FS • Performance of backup / restore • How large must an abstract be? Is the lighthouse sweep able to find buddies?

  31. Performance of the Chunkstore FS • Benchmark: MAB (Modified Andrew Benchmark) • Baseline: ext2fs • Slight overhead due to Rabin fingerprints

  32. Performance of Backup/Restore • Workload: 13.4MB tree of 1641 files and 109 directories, total 4004 chunks

  33. Buddy Discovery • Abstract size vs. signature overlap • An Win98 with an Office 2000 professional, 90,000 chunks • A Linux machine, running Debian unstable release, 270,000 chunks • Result: • Estimates are independ of smaple size • Small abstracts are effective if good buddies exist

  34. Abstract Size

  35. Buddy Discovery • How effectively lighthouse sweeps find good buddies? • 50,000 nodes, a distribution of 11 types • 30%(type 1), 20%(2,3), 10%(4,5), 5%(6), 1%(7-11)

  36. Summary • Automatic backup with no administrative costs • Exploit slack resources and coalesce duplicate copies of chunks across files • Backup to a set of buddies for good backup/restore performance • Handle security issues over untrusty buddies

More Related