1 / 52

Wide-Area Cooperative Storage with CFS

Wide-Area Cooperative Storage with CFS. Morris et al. Presented by Milo Martin UPenn Feb 17, 2004 (some slides based on slides by authors). Overview. Problem: content distribution Solution: distributed read-only file system Implementation Provide file system interface

camila
Download Presentation

Wide-Area Cooperative Storage with CFS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wide-Area Cooperative Storage with CFS Morris et al. Presented by Milo Martin UPenn Feb 17, 2004 (some slides based on slides by authors)

  2. Overview • Problem: content distribution • Solution: distributed read-only file system • Implementation • Provide file system interface • Using a Distributed Hash Table (DHT) • Extend Chord with redundancy and caching

  3. Problem: Content Distribution node node • Serving static content with inexpensive hosts • open-source distributions • off-site backups • tech report archive node Internet node node

  4. Example: mirror open-source distributions • Multiple independent distributions • Each has high peak load, low average • Individual servers are wasteful • Solution: • Option 1: single powerful server • Option 2: distributed service • But how do you find the data?

  5. Assumptions • Storage is cheap and plentiful • Many participants • Many reads, few updates • Heterogeneous nodes • Storage • Bandwidth • Physical locality • Exploit for performance • Avoid for resilience

  6. Goals • Avoid hot spots due to popular content • Distribute the load (traffic and storage) • “Many hands make light work” • High availability • Using replication • Data integrity • Using secure hashes • Limited support for updates • Self managing/repairing

  7. CFS Approach • Break content into immutable “blocks” • Use blocks to build a file system • Identify blocks via secure hash • Unambiguous name for a block • Self-verifying • Distributed Hash Table (DHT) of blocks • Distribute blocks • Replicate and cache blocks • Algorithm for finding blocks (e.g., Chord) • Content search beyond the scope

  8. Outline • File system structure • Review: Chord distributed hashing • DHash block management • CFS evaluation • Brief overview of Oceanstore • Discussion

  9. Hash-based Read-only File System • Assume retrieval substrate • put(data): inserts data with key Hash(data) • get(h) -> data • Assume “root” identifier known • Build a read-only file system using this interface • Based on SFSRO [OSDI 2000]

  10. Client-server interface • Files have unique names • Files are read-only (single writer, many readers) • Publishers split files into blocks • Clients check files for authenticity [SFSRO] Insert file f Insert block FS Client server server Lookup block Lookup file f node node

  11. File System Structure • Build file system bottom up • Break files into blocks • Hash each block • Create a “file block” that points to hashes • Create “directory blocks” that point to files • Ultimately arrive at “root” hash • SFSRO: blocks can be encrypted

  12. … Directory Block File Block Root Block … … DataBlock DataBlock DataBlock … … File System Example

  13. … File Block Directory Block Root Block … … DataBlock DataBlock DataBlock … … File System Lookup Example • Root hash verifies data integrity • Recursive verification • Root hash is correct, can verify data • Prefetch data blocks

  14. File System Updates • Updates supported, but kludgey • Single updater • All or nothing updates • Export a new version of the filesystem • Requires a new root hash • Conceptually rebuild file system • Insert modified blocks with new hashes • Introduce signed “root block” • Key is hash of public key (not the hash of contents) • Allow updates if they are signed with private key • Sequence number prevents replay

  15. Directory Block Directory Block File Block File Block Root Block Root Block Unmodified DataBlock DataBlock DataBlock NewData Update Example Hash(Public Key) … … … … … …

  16. File System Summary • Simple interface • put(data): inserts data with key Hash(data) • get(h) -> data • put_root(signprivate(hash, seq#), public key) • refresh(key): allows discarding of stale data • Recursive integrity verification

  17. Review of Goals and Assumptions File SystemLayer • Data integrity • Using secure hashes • Limited support for updates • Distribute load; avoid hot spots • High availability • Self managing/repairing • Exploit locality • Assumptions • Many participants • Heterogeneous nodes

  18. Outline • File system structure • Review: Chord distributed hashing • DHash block management • CFS evaluation • Brief overview of Oceanstore • Discussion

  19. Server Structure DHash DHash Chord Chord Node 1 Node 2 • DHash stores, balances, replicates, caches blocks • DHash uses Chord [SIGCOMM 2001] to locate blocks

  20. Chord Hashes a Block ID to its Successor Block ID Node ID N10 B112, B120, …, B10 B99 N100 Circular ID Space N32 B11, B30 B65, B70 N80 N60 B33, B40, B52 • Nodes and blocks have randomly distributed IDs • Successor: node with next highest ID

  21. Basic Lookup - Linear Time N5 N10 “Where is block 70?” N110 N20 N99 N32 N40 “N80” N80 N60 • Lookups find the ID’s predecessor • Correct if successors are correct

  22. Successor Lists Ensure Robust Lookup 10, 20, 32 N5 20, 32, 40 N10 5, 10, 20 N110 32, 40, 60 N20 110, 5, 10 N99 40, 60, 80 N32 N40 60, 80, 99 99, 110, 5 N80 N60 80, 99, 110 • Each node stores r successors, r = 2 log N • Lookup can skip over dead nodes to find blocks

  23. Chord Finger Table Allows O(log N) Lookups ½ ¼ 1/8 1/16 1/32 1/64 1/128 N80 • See [SIGCOMM 2001] for table maintenance • Reasonable lookup latency

  24. DHash/Chord Interface • lookup() returns list with node IDs closer in ID space to block ID • Sorted, closest first Lookup(blockID) List of <node-ID, IP address> DHash server Chord finger table with <node IDs, IP address>

  25. DHash Uses Other Nodes to Locate Blocks N5 N10 N110 N20 N99 1. 2. N40 3. N50 N80 N60 N68 Lookup(BlockID=45)

  26. Availability and Resilience: Replicate blocks at r successors N5 N10 N110 original N20 N99 Block 17 N40 replicas N50 N80 N68 N60 • Hashed IP Addr. ensures independent replica failure • High storage cost for replication; does this matter?

  27. Lookups find replicas N5 N10 N110 2. original N20 1. 3. N99 Block 17 N40 4. RPCs: 1. Lookup step 2. Get successor list 3. Failed block fetch 4. Block fetch replicas N50 N80 N68 N60 Lookup(BlockID=17)

  28. First Live Successor Manages Replicas N5 N10 N110 N20 N99 Copy of 17 Block 17 N40 N50 N80 N68 N60 • Node can locally determine that it is the first live successor; automatically repair and re-replicate

  29. Reduce Overheads: Caches Along Lookup Path N5 N10 N110 1. N20 N99 2. N40 4. RPCs: 1. Chord lookup 2. Chord lookup 3. Block fetch 4. Send to cache N50 N80 3. N60 N68 Lookup(BlockID=45) Send only to second-to-last hop in routing path

  30. Caching at Fingers Limits Load N32 • Only O(log N) nodes have fingers pointing to N32 • This limits the single-block load on N32

  31. Virtual Nodes Allow Heterogeneity N10 N60 N101 N5 • Hosts may differ in disk/net capacity • Hosts may advertise multiple IDs • Chosen as SHA-1(IP Address, index) • Each ID represents a “virtual node” • Host load proportional to # v.n.’s • Manually controlled • automatic adaptation possible Node B Node A

  32. Physical Locality-aware Path Choice N115 N18 10ms N25 50ms N96 100ms 12ms N37 N90 B47 N48 N80 Lookup(47) N55 N70 • Each node monitors RTTs to its own fingers • Pick smallest RTT (a greedy choice) • Tradeoff: ID-space progress vs delay

  33. Why Blocks Instead of Files? • Cost: one lookup per block • Can tailor cost by choosing good block size • Need prefetching for high throughput • What is a good block size? • Higher latency? • Benefit: load balance is simple • For large files • Storage cost of large files is spread out • Popular files are served in parallel

  34. Block Storage • Long-term blocks are stored for a fixed time • Publishers need to refresh periodically • Cache uses LRU cache Long-term block storage disk:

  35. Preventing Flooding • What prevents a malicious host from inserting junk data? • Answer: not much • Kludge: capacity limit (e.g., 0.1%) • Per source/destination pair • How real is this problem? • Refresh requirement allows recovery

  36. Review of Goals and Assumptions File SystemLayer • Data integrity • Using secure hashes • Limited support for updates • Distribute load; avoid hot spots • High availability • Self managing/repairing • Exploit locality • Assumptions • Many participants • Heterogeneous nodes Dhash/ChordLayer

  37. Outline • File system structure • Review: Chord distributed hashing • DHash block management • CFS evaluation • Brief overview of Oceanstore • Discussion

  38. CFS Project Status • Working prototype software • Some abuse prevention mechanisms • SFSRO file system client • Guarantees authenticity of files, updates, etc. • Some measurements on RON testbed • Simulation results to test scalability

  39. Experimental Setup (12 nodes) To vu.nl lulea.se ucl.uk • One virtual node per host • 8Kbyte blocks • RPCs use UDP To kaist.kr, .ve • Caching turned off • Proximity routing turned off

  40. CFS Fetch Time for 1MB File Fetch Time (Seconds) Prefetch Window (KBytes) • Average over the 12 hosts • No replication, no caching; 8 KByte blocks

  41. Distribution of Fetch Times for 1MB 40 Kbyte Prefetch 24 Kbyte Prefetch 8 Kbyte Prefetch Fraction of Fetches Time (Seconds)

  42. CFS Fetch Time vs. Whole File TCP 40 Kbyte Prefetch Whole File TCP Fraction of Fetches Time (Seconds)

  43. Robustness vs. Failures Six replicas per block; (1/2)6 is 0.016 Failed Lookups (Fraction) Failed Nodes (Fraction)

  44. Much Related Work • SFSRO (Secure file system, read only) • Freenet • Napster • Gnutella • PAST • CAN • …

  45. Later Work: Read/Write Filesystems • Ivy and Eliot • Extend the idea to read/write file systems • Full NFS or AFS-like semantics • All the nasty issues of distributed file systems • Consistency • Partitioning • Conflicts • Oceanstore (earlier work, actually)

  46. Outline • File system structure • Review: Chord distributed hashing • DHash block management • CFS evaluation • Brief overview of Oceanstore • Discussion

  47. CFS Summary (from their talk) • CFS provides peer-to-peer r/o storage • Structure: DHash and Chord • It is efficient, robust, and load-balanced • It uses block-level distribution • The prototype is as fast as whole-file TCP

  48. OceanStore (Similarities to CFS) • Ambitious global-scale storage “utility” built from an overlay network • Similar goals: • Distributed, replicated, highly available • Explicit caching • Similar solutions: • Hash-based identifiers • Multi-hop overlay-network routing

  49. OceanStore (Differences) • “Utility” model • Implies some managed layers • Handles explicit motivation of participants • Many applications/interfaces • Filesystem, web content, PDA sync., e-mail

  50. OceanStore (More Differences) • Explicit support for updates • Built into system at almost every layer • Uses Byzantine commit • Requires “maintained” inner ring • Updates on encrypted data • Plaxton tree based routing • Fast, probabilistic method • Slower, reliable method • Erasure codes limit replication overhead

More Related