240 likes | 354 Views
Pastiche: Making Backup Cheap and Easy. Presented by: Boon Thau Loo CS294-4. Outline. Motivation and Goals Enabling Technologies System Design Implementation and Evaluation Conclusion. Motivation. Majority of users do not backup their data. Those who do Don’t backup very often.
E N D
Pastiche: Making Backup Cheap and Easy Presented by: Boon Thau Loo CS294-4
Outline • Motivation and Goals • Enabling Technologies • System Design • Implementation and Evaluation • Conclusion
Motivation • Majority of users do not backup their data. Those who do • Don’t backup very often. • Don’t backup everything. • Backup is a significant cost in large organizations. • Why not use excess disk space for backups? • File systems are only half-full on average. • Disks are cheap.
Pastiche Goals • P2P Backup System • Target environment • Cooperation though untrusted machines. • End-user machines • Leverage common data when possible for space efficiency (backup “buddies”). • Preserve privacy. • Efficient, cost-free, administrative-free
Enabling Technologies • Pastry for self-organizing routing and object location. • Content-based Indexing (Manber94, LBFS) • Identify boundary regions (anchors) that divide file into chunks • Rabin fingerprinting • Isolate changes in each chunk. • SHA-1 hash of each chunk • Convergent encryption (used by FARSITE) • Encrypt file using key derived from file’s contents. • Further encrypt using client’s key. • Encrypted key is stored with file in FARSITE
System Design • Data Chunks • File meta-data • Abstracts • Joining Pastry • Finding Backup buddies • Backup Protocol • Restoration • Failures and Malicious Nodes • Greed Prevention
Data Chunks • Data is stored on disk as immutable chunks. • Content-based indexing + convergent encryption • Chunks are stored for local host and/or on backup clients. • Each chunk carries owner lists andmaintains reference count. • When a newly written file is closed, it is scheduled for chunking: Hc – Handle Ic – Chunk ID Kc – Encryption key Chunk ID list forms file signature.
Data Chunks (Cont…) • Backup Request: • Remote hosts must supply public key with backup request. • If chunk exist, add requesting host to owner list. • Local reference count is incremented. • Delete Request: • Requests from remote hosts must be signed by secret key. • Check against public key (cached from earlier backup request) • When reference count = 0, chunk is removed.
File Meta-data • File meta-data • List of handles Hc for chunks comprising the file. • Ownership, permissions, creation and modification times. • Mutable with fixed Hc, Kc and Ic • File system root meta-data: Hc generated based on host-specific passphrase.
Abstracts • Initial backup of a freshly installed machine is most expensive. • Goal: Find a good buddy that owns all or most of your data chunks. • Naïve solution: Ship full signature of new node around. • Expensive: 20 bytes per chunk for a 16KB chunk. • Solution: Send a random subset of signatures called an abstract.
Joining Pastry • Pastry: • Self-organizing, p2p overlay • Each node maintains • Leaf set: L/2 closest smaller (larger) nodeIDs • Neighborhood set: Closest nodes according to proximity metric • Routing table: Prefix routing • Join Pastry overlay with nodeID set to Hash(hostname) • Find backup buddies…
Finding Backup Buddies • After joining network, route Pastry message with abstract to a random nodeID. • Each node along the route returns its coverage (fraction of chunks in abstract stored locally)with the abstract • Lighthouse sweep: Rotating probe process repeated if there are insufficient candidate set by varying first digit of original nodeID
Not Enough Buddies? • Each node tries to find 5 buddies. • What if you can’t find enough buddies? • Real possibility for rare installations • Create coverage-rate Pastry overlay • Replace network proximity distance metric with coverage-rate. • Pastry neighbor set: set of nodes encountered during join with best coverage available. • Find buddies in the neighborhood set • A is a buddy for B, but may not vice versa (no symmetry) • Possibility of malicious nodes to misreport coverage.
Backup Protocol • Each Pastiche node controls its own archival plan. • Snapshot: a discrete backup event. • Meta-data skeleton for each snapshot stored on per-file logs. • State necessary for new snapshot: Add set, delete set, meta-data list
Backup Protocol (Cont..) • Snapshot process (A stores snapshot on B): • A sends public key to B (for future validation) • A forwards chunkIDs of add set to B. • B fetch chunks not already stored locally. • A sends delete list (signed with A’s private key) • A sends updated meta-data. • A sends commit request, B responds when all changes are persistent.
Restoration • Partial restores is straightforward. Obtain chunks from buddy. • Recover entire machine • Keep copy of root meta-data object in each member of leaf set. • Rejoin with same nodeID (based on hostname) • Retrieve root meta-data object from any node in leaf set. • Root block contain list of buddies.
Detecting Failure and Malice • Failures: • Buddy can drop chunks if it runs out of disk space. • Buddy may crash or leave the network. • Malicious buddy may pretend to store your chunks. • Solutions: • Before taking a new snapshot, query buddies for random subset of chunks. Provides instantaneous assurance. • Periodic probing of buddy: Analysis shows that checking 0.1% of all chunks is enough. • Sybil attack? Malicious party occupy substantial fraction of nodeID space.
Greed Prevention • Greedy host can consumes storage. • Three solutions: • Group backup clients based on resources consumed. • Cryptographic puzzles according to storage consumed. • Electronic currency • Currency accounting: requires atomicity between exchange of currency and backup.
Implementation • Chunkstore file system • Container files – LRU cache of decrypted, recently used files for performance. • Chunks increase internal fragmentation. • Backup daemon • Server: Manages remote requests for storage and restoration. • Client: Supervises selection of buddies and snapshots.
Evaluation • Compare ext2fs with chunkstore on modified Andrew benchmark: • Total overhead of 7.4% is reasonable. • Overheads due to meta-data management, and Rabin fingerprints computation (for finding anchors) • Backup and restore compares favorably to NFS cross-machine copy. • Conclusion: service does not penalize file system performance unduly.
Evaluation (Cont…) Question: How large must the abstract be? Compare machines with a freshly installed machine Abstract size does not seem to matter much.
Evaluation (Cont…) • Question: How effective is the lighthouse sweep in discovering buddies? • Simulation: 50000 Pastiche nodes with 11 types of nodes. • Lighthouse is good enough for common nodes (>=10%). • Rare nodes would require coverage-rate overlay.
Evaluation (Cont…) • Question: How effective is the coverage-rate overlay in discovering buddies? • 10000 nodes • 3 types of nodes • One of a thousand species (Same species share 70% of content) • One of a hundred genera (30%) • One of ten orders (20%) • Only same species can back each other up. For a neighborhood size of 256, 85% were able to find at least one buddy. 72% found at least 5. Neighborhood size matters!
Conclusion • Pastiche: P2P backup mechanism. • What is Pastiche engineered mostly for? • What do end-users backup? • Data files (Overlap is minimal) • Applications (Lots of overlaps, but would you back up your apps?) • Privacy? • Closely coupled with Pastry • Lighthouse sweep. • Needs large neighborhood set.