1 / 55

Taming Aggressive Replication in the Pangaea Wide-area File System

Explore the design, benefits, and structure of the Pangaea Wide-area File System, utilizing graph-based replica management and optimistic coordination for improved availability and network efficiency.

tketterer
Download Presentation

Taming Aggressive Replication in the Pangaea Wide-area File System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Taming Aggressive Replication in the Pangaea Wide-area File System Y. Saito, C. Kaamanolis, M. Karlsson, M. MahalingamPresented by Jason Waddle

  2. Pangaea: Wide-area File System • Support the daily storage needs of distributed users. • Enable ad-hoc data sharing.

  3. Speed Hide wide-area latency,file access time ~ local file system Availability & autonomy Avoid single point-of-failure Adapt to churn Network economy Minimize use of wide-area network Exploit physical locality Pangaea Design Goals

  4. Pangaea Assumptions (Non-goals) • Servers are trusted • Weak data consistency is sufficient (consistency in seconds)

  5. Symbiotic Design

  6. Symbiotic Design Autonomous Each server operates when disconnected from network.

  7. Symbiotic Design Autonomous Cooperative Each server operates when disconnected from network. When connected, servers cooperate to enhance overall performance and availability.

  8. Pervasive Replication • Replicate at file/directory level • Aggressively create replicas: whenever a file or directory is accessed • No single “master” replica • A replica may be read / written at any time • Replicas exchange updates in a peer-to-peer fashion

  9. Graph-based Replica Management • Replicas connected in a sparse, strongly- connected, random graph • Updates propagate along edges • Edges used for discovery and removal

  10. Benefits of Graph-based Approach • Inexpensive • Graph is sparse, adding/removing replicas O(1) • Available update distribution • As long as graph is connected, updates reach every replica • Network economy • High connectivity for close replicas,build spanning tree along fast edges

  11. Optimistic Replica Coordination • Aim for maximum availability over strong data-consistency • Any node issues updates at any time • Update transmission and and conflict resolution in background

  12. Optimistic Replica Coordination • “Eventual consistency” (~ 5s in tests) • No strong consistency guarantees:no support for locks, lock-files, etc.

  13. Pangaea Structure Region(<5ms RTT) Server or Node

  14. Server Structure I/O request(application) NFS protocol handler Pangaea server log Replication engine membership User space Kernel space Inter-node communication NFS client

  15. Server Modules • NFS protocol handler • Receives requests from apps, updates local replicas, generates requests to

  16. Server Modules • NFS protocol handler • Receives requests from apps, updates local replicas, generates requests to • Replication engine • Accepts local and remote requests • Modifies replicas • Forwards requests to other nodes

  17. Server Modules • NFS protocol handler • Receives requests from apps, updates local replicas, generates requests to • Replication engine • Accepts local and remote requests • Modifies replicas • Forwards requests to other nodes • Log module • Transaction-like semantics for local updates

  18. Server Modules • Membership module maintains: • List of regions, their members, estimated RTT between regions • Location of root directory replicas • Information coordinated by gossiping • “Landmark” nodes bootstrap newly joining nodes Maintaining RTT information: main scalability bottleneck

  19. File System Structure • Gold replicas • Listed in directory entries • Form clique in replica graph • Fixed number (e.g., 3) • All replicas (gold and bronze) • Unidirectional edges to all gold replicas • Bidirectional peer-edges • Backpointer to parent directory

  20. File System Structure /joe /joe/foo

  21. File System Structure struct Replica fid: FileID ts: TimeStamp vv: VersionVector goldPeers: Set(NodeID) peers: Set(NodeID) backptrs: Set(FileID, String) struct DirEntry fname: String fid: FileID downlinks: Set(NodeID) ts: TimeStamp

  22. File Creation • Select locations for g gold replicas (e.g., g=3) • One on current server • Others on random servers from different regions • Create entry in parent directory • Flood updates • Update to parent directory • File contents (empty) to gold replicas

  23. Replica Creation • Recursively get replicas for ancestor directories • Find a close replica (shortcutting) • Send request to the closest gold replica • Gold replica forwards request to its neighbor closest to requester, who then sends

  24. Replica Creation • Select m peer-edges (e.g., m=4) • Include a gold replica (for future shortcutting) • Include closest neighbor from a random gold replica • Get remaining nodes from random walks starting at a random gold replica • Create m bidirectional peer-edges

  25. Bronze Replica Removal • To recover disk space • Using GD-Size algorithm, throw out largest, least-accessed replica • Drop useless replicas • Too many updates before an access (e.g., 4) • Must notify peer-edges of removal; peers use random walk to choose new edge

  26. Replica Updates • Flood entire file to replica graph neighbors • Updates reach all replicas as long as the graph is strongly connected • Optional: user can block on update until all neighbors reply (red-button mode) • Network economy???

  27. Optimized Replica Updates • Send only differences (deltas) • Include old timestamp, new timestamp • Only apply delta to replica if old timestamp matches • Revert to full-content transfer if necessary • Merge deltas when possible

  28. Optimized Replica Updates • Don’t send large (e.g., > 1KB) updates to each of m neighbors • Instead, use harbingers to dynamically build a spanning-tree update graph • Harbinger: small message with update’s timestamps • Send updates along spanning-tree edges • Happens in two phases

  29. Optimized Replica Updates • Exploit Physical Topology • Before pushing a harbinger to a neighbor, add a random delay ~ RTT (e.g., 10*RTT) • Harbingers propagate down fastest links first • Dynamically builds an update spanning-tree with fast edges

  30. Update Example (Phase 1) B F A C D E

  31. Update Example (Phase 1) B F A C D E

  32. Update Example (Phase 1) B F A C D E

  33. Update Example (Phase 1) B F A C D E

  34. Update Example (Phase 1) B F A C D E

  35. Update Example (Phase 1) B F A C D E

  36. Update Example (Phase 2) B F A C D E

  37. Update Example (Phase 2) B F A C D E

  38. Update Example (Phase 2) B F A C D E

  39. Conflict Resolution • Use a combination of version vectors and last-writer wins to resolve • If timestamps mismatch, full-content is transferred • Missing update: just overwrite replica

  40. Regular File Conflict (Three Solutions) • Last-writer-wins, using update timestamps • Requires server clock synchronization • Concatenate both updates • Make the user fix it • Possibly application-specific resolution

  41. Directory Conflict alice$ mv /foo /alice/foo bob$ mv /foo /bob/foo

  42. Directory Conflict alice$ mv /foo /alice/foo bob$ mv /foo /bob/foo /bob replica set /alice replica set

  43. Directory Conflict alice$ mv /foo /alice/foo bob$ mv /foo /bob/foo Let the child (foo) decide! • Implement mv as a change to the file’s backpointer • Single file resolves conflicting updates • File then updates affected directories

  44. Temporary Failure Recovery • Log outstanding remote operations • Update, random walk, edge addition, etc. • Retry logged updates • On reboot • On recovery of another node • Can create superfluous edges • Retains m-connectedness

  45. Permanent Failures • A garbage collector (GC) scans for failed nodes • Bronze replica on failed node • GC causes replica’s neighbors to replace link with a new peer using random walk

  46. Permanent Failure • Gold replica on failed node • Discovered by another gold (clique) • Chooses new gold by random walk • Flood choice to all replicas • Update parent directory to contain new gold replica nodes • Resolve conflicts with last-writer-wins • Expensive!

  47. Performance – LAN Andrew-Tcl benchmarks, time in seconds

  48. Performance – Slow Link The importance of local replicas

  49. Performance – Roaming Compile on C1 then time compile on C2. Pangaea utilizes fast links to a peer’s replicas.

  50. Performance: Non-uniform Net A model of HP’s corporate network.

More Related