350 likes | 621 Views
Iris: A Scalable Cloud File System with Efficient Integrity Checks. Cloud Storage. Dropbox. Enterprise. Amazon S3, EBS. Windows Azure Storage. Enterprise. SkyDrive. EMC Atmos. Mozy. iCloud. Google Storage. Can you trust the cloud?. User. Infrastructure bugs Malware
E N D
Iris: A Scalable Cloud File Systemwith Efficient Integrity Checks
Cloud Storage Dropbox Enterprise Amazon S3, EBS Windows Azure Storage Enterprise SkyDrive EMC Atmos Mozy iCloud Google Storage Can you trust the cloud? User • Infrastructure bugs • Malware • Disgruntled employees User User
Iris File System • Integrity verification (on the fly) • value read == value written (integrity) • value read == last value written (freshness) • data & metadata • Proof of Retrievability (PoR/PDP) • Verifies: ALL of the data is on the cloud or recoverable • More on this later • High performance (low overhead) • Hundreds of MB/s data rates • Designed for enterprises
Iris Deployment Scenario heavyweight (TBs to PBsof data) cloud enterprise lightweight (1 to 5 portals) portal portal(s)(distributed) (appliances) clients
Overview: File System Tree • Most file systems have file-system tree. • Contains: • Directory structure • File names • Timestamps • Permissions • Other attributes • Efficiently laid out on disk (e.g., using B-tree)
Overview: Merkle Trees A • Parents contain hash of children. • To verify an element (e.g., “y”) is in the tree: nodes accessed C B E D … x y
Iris: Unified File System + Merkle Tree • Binary • Balancing nodes • Directory Tree • Root node: • Directory attributes • Leafs: • Subdirectories • Files • File Version Tree • Root node: • File attributes • Leafs: • File block version numbers /u/ • File system tree is also a Merkle tree Free List /u/ b e g v/ c a f v/ b e g v/ c a f • Free List: stores deleted subtrees b e g Directory tree File version tree File blocks
File Version Tree • Each file has a version tree • Version numbers increase when blocks are modified. • Version numbers propagate upwards to version tree root 0 : 7 v1 v0 0 : 3 4 : 7 v0 v1 v0 v1 v0 0 : 1 4 : 7 4 : 5 6 : 7 v1 v0 v1 v0 v1 v0 0 1 2 3 4 5 6 7 v0 v0 v0 v0 v1 v1 v0 v1 v0 v0 v1 v1 v0
File Version Tree • Process repeats for every write • Unique version numbers after each write • Helps ensure freshness 0 : 7 v2 v1 0 : 3 4 : 7 v2 v1 v2 v1 v1 0 : 1 4 : 7 4 : 5 6 : 7 v1 v2 v1 v2 v0 v2 0 1 2 3 4 5 6 7 v1 v1 v1 v0 v2 v1 v2 v1 v2 v0 v2 v0
Integrity Verification: MACs 4 KB 4 KB • For each file, Iris generates a MAC file. • Later used to verify integrity of data blocks. • 4 KB blocks • MAC is computed over: • file id, block index, version number, block data b1 b2 b3 b4 b5 bi … … m1 m2 m3 m4 m5 mi = MAC(fid, i, vi, bi) 20 bytes 20 bytes
Merkle Tree Efficiency • Many FS operations access paths in the tree • Inefficient to access one path at a time • Paths share ancestor nodes • Accessing same nodes over and over • Unnecessary I/O • Redundant Merkle tree crypto • Latency bound • Accessing paths in parallel? • Naïve techniques can lead to corruption • Same ancestor node accessed in separate threads • Need a Merkle tree cache • Very important part of our system
Merkle Tree Cache Challenges • Nodes depend on each other • Parents contain hashes of children • Cannot evict parent before child • Asynchronous • Inefficient: one thread per node/path • Avoid unnecessary hashing • Nodes near the root of the tree often reused • Efficient sequential file operations • Inefficient: access path per block log overhead • Adjacent nodes must stay “long enough” in cache.
Merkle Tree Cache Pinned Nodes are read into the tree in parallel. verifying Unpinned To Verify reading Compacting Updating Hash Ready to Write writing
Reading a Path /u/ Path:“/u/v/b” v/ c a f b e g Directory tree File version tree Data file MAC File
Merkle Tree Cache When both siblings arrive, they are verified. Pinned verifying Top-down verification: parent verified before children Unpinned To Verify reading Compacting Updating Hash Ready to Write writing
Verification …. A …. …. B C verify …. …. D E verify
Merkle Tree Cache Verified nodes enter “pinned” state. Pinned Pinned nodes cannot be evicted. verifying Pinned nodes used by async file system operations. Unpinned To Verify reading Compacting While used by at least one operation, nodes remain pinned. Updating Hash Ready to Write writing
Merkle Tree Cache When node no longer used, it becomes “unpinned”. Pinned verifying Unpinned Unpinned nodes are eligible for eviction. To Verify reading Compacting When cache 75% full, eviction begins. Updating Hash Ready to Write writing
Merkle Tree Cache Eviction Step #1:Adjacent nodes with identical version numbers are compacted. Pinned verifying Unpinned To Verify reading Compacting Updating Hash Ready to Write writing
Compacting v2 0 : 15 • Keep: • if version ≠ parent version • for balancing • Stripped out redundant information v2 4 : 7 8 : 15 v1 8 : 9 14 : 15 v1 v1 v2 0 : 15 Often files are written sequentially and compact to a single node. v2 v2 0 : 7 8 : 15 v2 v2 0 : 3 4 : 7 8 : 11 12 : 15 v2 v1 8 : 9 10 : 11 12 : 13 14 : 15 v1 v2 v1 v2
Merkle Tree Cache Pinned Eviction Step #2: Hashes are then updated in bottom-up order. verifying Unpinned To Verify reading Compacting Updating Hash Ready to Write writing
Merkle Tree Cache Pinned Eviction Step #3:Nodes written to cloud storage. verifying Unpinned To Verify reading Compacting Updating Hash Ready to Write writing
Merkle Tree Cache Note:Node can be pinned at any time during eviction. Pinned verifying Unpinned To Verify Path to node becomes “pinned”. reading Compacting Updating Hash Ready to Write writing
Merkle Tree Cache:Crucial for Real-World Workloads • Iris benefits from locality • Very small cache required to achieve high throughput • Cache size: 5 MB to 10 MB
Sequential Workloads • Results • 250 to 300 MB/s • 100+ clients • Cache • Minimal cache size ( < 1 MB ) to achieve high throughput • Reason: Nodes get compacted • Usually network bound
Random Workloads • Results • Bound by disk seeks • Cache • Minimal cache size ( < 1 MB ) to achieve seek-bound throughput • Cache only used to achieve parallelism to combat latency. • Reason: Very little locality.
Other Workloads • Very workload dependent • Specifically • Depends on number of seeks • Iris is designed to reduce Merkle tree seek overhead via: • Compacting • Merkle tree Cache
Proofs of Retereivability • How can we be sure our data is still there? • Iris Continuously Verifies that the Cloud Possesses All Data • First sublinear solution to the open problem of Dynamic Proofs of Retreivability
Proofs of Retereivability • Iris verifies that cloud possesses 99.9% of data (with high probability). • Remaining 0.1% can be recovered using Iris parity data structure. • Custom designed error-correcting code (ECC) and parity data structure. • High throughput (300-550 MB/s).
ECC Challenges • Update efficiency • Want high-throughput file system • On-the-fly • ECC should not be a bottleneck • Reed–Solomon codes are too slow. • Hiding code structure • Adversary should not know which blocks to corrupt to make ECC fail. • Adversarially-secure ECC • Variable-length encoding • Handles: blocks, file attributes, Merkle tree nodes, etc
Iris Error Correcting Code File system: ECC Parity Stripes: Block on file system Pseudo-random Error-Correcting Code Mapping from file system position to corresponding parities: The cloud does not know the key , so it can’t determinewhich 0.1% subset of data to corrupt to make the ECC fail. Stripe Offset Stripe Offset
Iris Error Correcting Code File system: ECC Parity Stripes: Block on file system • Memory: • Update time: • Verification time: • Amortized cost Stripe Offset Stripe Offset
ECC Update Efficiency • Very fast • 300-550 MB/s • Not a bottleneck in Iris
Conclusion • Presented Iris file system • Integrity • Proofs of retreivability / data possession • On the fly • Very practical • Overall system throughput • 250-300 MB/s per Portal • Scales to enterprises