330 likes | 476 Views
www.inter-mezzo.org. A new Distributed File System Peter J. Braam, braam@cs.cmu.edu Carnegie Mellon University & Stelias Computing. Overview. Joint work with Michael Callahan & Phil Schwan Distributed file systems protocols, semantics, usage patterns InterMezzo
E N D
www.inter-mezzo.org A new Distributed File System Peter J. Braam, braam@cs.cmu.edu Carnegie Mellon University & Stelias Computing InterMezzo, PJ Braam, CMU
Overview • Joint work with • Michael Callahan & Phil Schwan • Distributed file systems • protocols, semantics, usage patterns • InterMezzo • purpose, design, implementation • Project plans InterMezzo, PJ Braam, CMU
Distributed File Systems InterMezzo, PJ Braam, CMU
Distributed File Systems • Purpose: make remote files behave as if local • Clients: receivers of files, suppliers of updates • Servers: suppliers of files, receivers of updates • Challenges: • Semantics and protocols of sharing • Performance • Implementation and correctness • Newer features • disconnection, reconnections, server replication, validation and conflict resolution InterMezzo, PJ Braam, CMU
Semantics • Unix I/O model: • shared memory model • writes visible to readers immediately • last write wins • Network file systems • Weak semantics: “aging”, “timeout” (NFS, SMB) • Unix semantics: Sprite, DCE/DFS, XFS • New semantics: Coda/InterMezzo/AFS InterMezzo, PJ Braam, CMU
Network Semantics • Propagate writes upon close: last close wins • Callbacks - guarantee currency • Client continues to use files until notified by server • No connected client ever sees stale data • Server maintains state • Permits/Tokens - guarantee exclusivity • Client propagates updates lazily until notified • Is major performance gain • Server maintains state • Validation after reconnecting:version stamps InterMezzo, PJ Braam, CMU
Tradeoffs • No semantics: • works amazingly well (so does C++ & US Government) • Unix semantics: • well defined, must propagate writes • not suitable with modest bandwidth network • suitable for SAN file systems • Networksemantics: • optimal for lower bandwidth situations, scales well • fails with heavy write/write sharing InterMezzo, PJ Braam, CMU
Our inspiration: Coda Features • disconnected operation • server replication • reintegration, resolution • bandwidth adaptation • good security model • write back caching InterMezzo, PJ Braam, CMU
Performance • Synchronous = BAD • rpc’s take long • context switch to cache manager takes long • disk writes take long • InterMezzo • exploits good disk file systems • normal case: speed of local disk file system • gives kernel autonomy • does write back caching at kernel level InterMezzo, PJ Braam, CMU
InterMezzo InterMezzo, PJ Braam, CMU
InterMezzo Strategy • Protocol • Retain much of Coda’s protocols and semantics • Performance & scalability: • leverage disk file systems for cache: filter driver • more kernel autonomy: kernel write back cache • Implementation: • make it SIMPLE • leverage existing code: TCP, diskfs, rsync • avoid threads: use async I/O with completions InterMezzo, PJ Braam, CMU
InterMezzo overview Application Lento (PERL): Cache Manager & Server Update propagation & fetching with InterMezzo server Syscall User level Kernel Level Upcalls mkdir... create... rmdir... unlink... link…. no VFS Filter: data fresh? Presto Local file system Kernel Update Journal Kernel modification log InterMezzo, PJ Braam, CMU
Example of kernel code presto_file_open(struct dentry *de) { if ( IAMLENTO ) { bottom_fops->open(de); mark_dentry(de, HAVE_DATA); return; } if ( !check_dentry(de, HAVE_DATA) { lento_open_file(de); } rc = bottom->open(de); if ( ! IAMLENTO ) journal(“open”, de->d_name); return rc; } Cache mgmt Access filter Upcall Write back caching InterMezzo, PJ Braam, CMU
Overview of functionality • Keep folder collections replicas in sync • Disconnected operation & reintegration InterMezzo, PJ Braam, CMU
Server 2. Reintegrate mkdir... create... rmdir… store... 3. Forward mkdir... create... rmdir… store... Client 1 Client 2 Client 3 1. Modify folder collection 4. Replicators synchronized InterMezzo, PJ Braam, CMU
Client 1 1. Retain journals for disconnected replicators store... create... rmdir… store... 1. Journal disconnected modifications store… store… create... Server • 2. Reconnect • a.Server forwards modification journals • b. Handle conflicts • c. Reintegrate client journals • 3. Client and server synchronized InterMezzo, PJ Braam, CMU
File Service Protocol InterMezzo, PJ Braam, CMU
Client Server Protocol • File Service: • FetchDir • FetchFile • Modification Service: • Reintegrate • Consistency: • GetPermit/BreakPermit • Validate/BreakCallback InterMezzo, PJ Braam, CMU
Client/Server Symmetry • Typical use: • client A fetches files • serverfetches modified files from client A • client A sends modification log to server • serversends modification log to replicators • Code reuse: both need • Modification Log & Fileservice • Policy different on client and server InterMezzo, PJ Braam, CMU
InterMezzo implementation • Coda & XFS experience • threading is complicated • distributed state & locking: hard to track • don’t implement your own cache • don’t accumulate 500,000 lines of C • Lento learn from other efforts: • Ericson, Teapot, XFS, ACE: async request processing • Completion routines & state machine • Verify protocol correctness with Murphy • High level language or framework InterMezzo, PJ Braam, CMU
Blocking operations • Disk & network I/O • Proactive Reactor: • start asynchronous operation • give continuation & context to reactor • reactor activates completion routine • Advantages: • avoid threading, locking • very concise code describing protocols • state localized InterMezzo, PJ Braam, CMU
PERL for our prototype InterMezzo, PJ Braam, CMU
State Machine Approach • Introduce POE: Perl Object Environment • can dynamically create sessions • hand blocking operations to the POE kernel • sessions have: • parents • state on a heap (or inline, in object or class) • sessions do: • post events to other sessions • handle events posted to them InterMezzo, PJ Braam, CMU
Example session: fetchfile Fetchfile = new session ( { init => { if (!have_attr) post(conn, fetch_attr, have_attr); else post(conn, fetch_data, complete); }, have_attr => { if (status == success) post(conn, fetch_data, complete) else { destruct_session(error); } }, new_filefetch => { queue_event(this) ; }, complete => { reply_to_caller; handle_queue; destruct_session; }, …... } ); InterMezzo, PJ Braam, CMU
Wheels, drivers, filters • Wheels are modify sessions • exploit asynchronous drivers e.g.: • read/write • socketfactory (accept clients) • filters: deliver “whole” packets e.g.: • full request or data packets • unpacked kernel requests • when I/O completes: • post to static sessions … or … • create dynamic session as wheel output InterMezzo, PJ Braam, CMU
Our wheels... • Wheels: • Upcall: kernel requests (unpack filter) • Packets: network rpc/data traffic (xdr filter) • SocketFactory: to accept new conns • Instantiate request handlers: • net requests • kernel upcall requests InterMezzo, PJ Braam, CMU
InterMezzo Wheels Timers Upcall/Netreq Sessions Kernel Static Dynamic sessions ReqDispatcher Session packets upcalls connects Sockets Presto SocketFactory …………….Wheels ………….. InterMezzo, PJ Braam, CMU
Net Request Processing request sessions reply data endreq enddata req got_error reqdispatcher got_error req acceptor(port) - list of client sessions - peer, port, etc. _start Connection got_wheel PacketWheel SocketFactory InterMezzo, PJ Braam, CMU
got_upcall ReqDispatcher UpcallWheel _start upcall sessions --> resolve paths to volumes & servers new req reply data endreq enddata get_connection got_connection got_error Server object: - connector session - volumes hosted there UpcallProcessing _start connector(host, port) - list of client sessions - peer, port, etc. _start Connection got_wheel got_error SocketFactory PacketWheel InterMezzo, PJ Braam, CMU
Project See: www.inter-mezzo.org InterMezzo, PJ Braam, CMU
What we have done • So far mostly Linux • 2,500 lines of C: Linux kernel code • 3,800 lines of Perl • went through 4 total rewrites! • Connected & disconnected: solid • Reintegration: mostly working • Usable, not many features yet InterMezzo, PJ Braam, CMU
Principal targets • Focus on replication not general caching • scalable server replication • laptop/desk home directory synchronization • Clusters • install & administer one machine • use InterMezzo to manage all of them InterMezzo, PJ Braam, CMU
Forthcoming features • Security • Conflict handling • Better admin tools • Cache manager in C • Variants with different semantics (locking, write sharing) • Windows clients (?) • … InterMezzo, PJ Braam, CMU