1 / 24

Pastiche: Making Backup Cheap and Easy

Pastiche: Making Backup Cheap and Easy. Presented by: Boon Thau Loo CS294-4. Outline. Motivation and Goals Enabling Technologies System Design Implementation and Evaluation Conclusion. Motivation. Majority of users do not backup their data. Those who do Don’t backup very often.

ivrit
Download Presentation

Pastiche: Making Backup Cheap and Easy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pastiche: Making Backup Cheap and Easy Presented by: Boon Thau Loo CS294-4

  2. Outline • Motivation and Goals • Enabling Technologies • System Design • Implementation and Evaluation • Conclusion

  3. Motivation • Majority of users do not backup their data. Those who do • Don’t backup very often. • Don’t backup everything. • Backup is a significant cost in large organizations. • Why not use excess disk space for backups? • File systems are only half-full on average. • Disks are cheap.

  4. Pastiche Goals • P2P Backup System • Target environment • Cooperation though untrusted machines. • End-user machines • Leverage common data when possible for space efficiency (backup “buddies”). • Preserve privacy. • Efficient, cost-free, administrative-free

  5. Enabling Technologies • Pastry for self-organizing routing and object location. • Content-based Indexing (Manber94, LBFS) • Identify boundary regions (anchors) that divide file into chunks • Rabin fingerprinting • Isolate changes in each chunk. • SHA-1 hash of each chunk • Convergent encryption (used by FARSITE) • Encrypt file using key derived from file’s contents. • Further encrypt using client’s key. • Encrypted key is stored with file in FARSITE

  6. System Design • Data Chunks • File meta-data • Abstracts • Joining Pastry • Finding Backup buddies • Backup Protocol • Restoration • Failures and Malicious Nodes • Greed Prevention

  7. Data Chunks • Data is stored on disk as immutable chunks. • Content-based indexing + convergent encryption • Chunks are stored for local host and/or on backup clients. • Each chunk carries owner lists andmaintains reference count. • When a newly written file is closed, it is scheduled for chunking: Hc – Handle Ic – Chunk ID Kc – Encryption key Chunk ID list forms file signature.

  8. Data Chunks (Cont…) • Backup Request: • Remote hosts must supply public key with backup request. • If chunk exist, add requesting host to owner list. • Local reference count is incremented. • Delete Request: • Requests from remote hosts must be signed by secret key. • Check against public key (cached from earlier backup request) • When reference count = 0, chunk is removed.

  9. File Meta-data • File meta-data • List of handles Hc for chunks comprising the file. • Ownership, permissions, creation and modification times. • Mutable with fixed Hc, Kc and Ic • File system root meta-data: Hc generated based on host-specific passphrase.

  10. Abstracts • Initial backup of a freshly installed machine is most expensive. • Goal: Find a good buddy that owns all or most of your data chunks. • Naïve solution: Ship full signature of new node around. • Expensive: 20 bytes per chunk for a 16KB chunk. • Solution: Send a random subset of signatures called an abstract.

  11. Joining Pastry • Pastry: • Self-organizing, p2p overlay • Each node maintains • Leaf set: L/2 closest smaller (larger) nodeIDs • Neighborhood set: Closest nodes according to proximity metric • Routing table: Prefix routing • Join Pastry overlay with nodeID set to Hash(hostname) • Find backup buddies…

  12. Finding Backup Buddies • After joining network, route Pastry message with abstract to a random nodeID. • Each node along the route returns its coverage (fraction of chunks in abstract stored locally)with the abstract • Lighthouse sweep: Rotating probe process repeated if there are insufficient candidate set by varying first digit of original nodeID

  13. Not Enough Buddies? • Each node tries to find 5 buddies. • What if you can’t find enough buddies? • Real possibility for rare installations • Create coverage-rate Pastry overlay • Replace network proximity distance metric with coverage-rate. • Pastry neighbor set: set of nodes encountered during join with best coverage available. • Find buddies in the neighborhood set • A is a buddy for B, but may not vice versa (no symmetry) • Possibility of malicious nodes to misreport coverage.

  14. Backup Protocol • Each Pastiche node controls its own archival plan. • Snapshot: a discrete backup event. • Meta-data skeleton for each snapshot stored on per-file logs. • State necessary for new snapshot: Add set, delete set, meta-data list

  15. Backup Protocol (Cont..) • Snapshot process (A stores snapshot on B): • A sends public key to B (for future validation) • A forwards chunkIDs of add set to B. • B fetch chunks not already stored locally. • A sends delete list (signed with A’s private key) • A sends updated meta-data. • A sends commit request, B responds when all changes are persistent.

  16. Restoration • Partial restores is straightforward. Obtain chunks from buddy. • Recover entire machine • Keep copy of root meta-data object in each member of leaf set. • Rejoin with same nodeID (based on hostname) • Retrieve root meta-data object from any node in leaf set. • Root block contain list of buddies.

  17. Detecting Failure and Malice • Failures: • Buddy can drop chunks if it runs out of disk space. • Buddy may crash or leave the network. • Malicious buddy may pretend to store your chunks. • Solutions: • Before taking a new snapshot, query buddies for random subset of chunks. Provides instantaneous assurance. • Periodic probing of buddy: Analysis shows that checking 0.1% of all chunks is enough. • Sybil attack? Malicious party occupy substantial fraction of nodeID space.

  18. Greed Prevention • Greedy host can consumes storage. • Three solutions: • Group backup clients based on resources consumed. • Cryptographic puzzles according to storage consumed. • Electronic currency • Currency accounting: requires atomicity between exchange of currency and backup.

  19. Implementation • Chunkstore file system • Container files – LRU cache of decrypted, recently used files for performance. • Chunks increase internal fragmentation. • Backup daemon • Server: Manages remote requests for storage and restoration. • Client: Supervises selection of buddies and snapshots.

  20. Evaluation • Compare ext2fs with chunkstore on modified Andrew benchmark: • Total overhead of 7.4% is reasonable. • Overheads due to meta-data management, and Rabin fingerprints computation (for finding anchors) • Backup and restore compares favorably to NFS cross-machine copy. • Conclusion: service does not penalize file system performance unduly.

  21. Evaluation (Cont…) Question: How large must the abstract be? Compare machines with a freshly installed machine Abstract size does not seem to matter much.

  22. Evaluation (Cont…) • Question: How effective is the lighthouse sweep in discovering buddies? • Simulation: 50000 Pastiche nodes with 11 types of nodes. • Lighthouse is good enough for common nodes (>=10%). • Rare nodes would require coverage-rate overlay.

  23. Evaluation (Cont…) • Question: How effective is the coverage-rate overlay in discovering buddies? • 10000 nodes • 3 types of nodes • One of a thousand species (Same species share 70% of content) • One of a hundred genera (30%) • One of ten orders (20%) • Only same species can back each other up. For a neighborhood size of 256, 85% were able to find at least one buddy. 72% found at least 5. Neighborhood size matters!

  24. Conclusion • Pastiche: P2P backup mechanism. • What is Pastiche engineered mostly for? • What do end-users backup? • Data files (Overlap is minimal) • Applications (Lots of overlaps, but would you back up your apps?) • Privacy? • Closely coupled with Pastry • Lighthouse sweep. • Needs large neighborhood set.

More Related