1 / 33

Coda Server Internals

Coda Server Internals. Peter J Braam. Contents. Data structure overview Volumes Vnodes Inodes. Data Structure Overview. Object. Purpose. Resides where. Inodes. File Contents. /vicep* partitions. Volumes Vnodes Directory cnts ACL Reslogs. Meta Data & Dir contents. RVM.

maia
Download Presentation

Coda Server Internals

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Coda Server Internals Peter J Braam

  2. Contents • Data structure overview • Volumes • Vnodes • Inodes

  3. Data Structure Overview Object Purpose Resides where Inodes File Contents /vicep* partitions Volumes Vnodes Directory cnts ACL Reslogs Meta Data & Dir contents RVM Volume location VLDB, VRDB: RW db files Volinfo records VSGDB, .pdb, .tk files: dynamic RO db files VSGDB Pdb records Tokens Security Servers/SCM Partitions Startup flags Skipvolumes LOG & DATA & DB Locators Configuration Data Static data

  4. RVM layout (coda_globals.h) • Already_initialized (int) • struct VolHead[MAXVOLS] • struct VnodeDiskObject *SmallVnodeFreeLists[SM_FREESIZE] • short SmallVnodeIndex • …. Same for large … • MaxVolId (unsigned long) • Remainder is dynamically allocated

  5. Volume zoo (volume.h, camprivate.h) • RVM: structures • VolumeData • VolHead • VolumeHeader • VolumeDiskData • VM: structures • Volume • VolumeInfo ……..

  6. VolHead VolumeHeader VolumeHeader VolumeData stamp id parentid type *volumeDiskData *smallVnodeLists nsmallVnodes nsmallLists -- same for big -- A volume in RVM contains pointer to rvm malloced data

  7. VolumeDiskData (rvm) • Lots of stuff: • Identity & location: partition, name, • runtime info: use, inService, blessed, salvaged • Vnode related: next uniquefier • Versionvector • Resolution flags, pointer to recov_vol_log • Quota • Resource usage: filecount, diskused etc

  8. Volumes in VM • struct Volumes sit in VolHash with copies of RVM data structures • Salvage before “attaching” to VolHash • Model of operation (FS): • GetVolume copy out from RVM • Do your mods in VM • PutVolume does RVM transaction • Model of operation (Volutil): • operate on RVM

  9. Volumes in Venus RPC’s • One RPC: GetVolInfo • used for mount point traversal • Only relates to • volume location database • volume replication database • VSGDB • Could sit in separate Volume Location Server

  10. Vnodes (cvnode.h) • Small & large: large for directories • difference is ACL at back of large vnodes • Inode field: • small vnodes: points to diskfile inode number • large vnodes: is RVM address of dir inode • Contain important small structure: vv_t • Pointers to reslog entries • VM: cvnode’s with hash table, freelists etc

  11. Vnodes in RVM • RVM: VnodeDiskinfo (rvm_malloced) • vnodes sit on rec_smolists • each link points to a DiskVnode • lists link vnodes with identical vnodenumbers but different uniquefiers • new vnodes grabbed from FreeLists (index.cc, recov{a,b,c}.cc) • volumes have arrays of rec_smolists which grow when they are full

  12. Vnodes in action • Model: • GetFSObj calls GetVnode • work is done • PutFS Objects calls • rvm_begin_transaction • ReplaceVnode - copies data from VM to RVM • rvm_end_transaction • Getting a vnode takes 3 pointer derefs, possibly 3 page faults vs. 1 for local file systems. • Is this necessary? Probably not. Cure it: yes!

  13. Directories (rvm) • DirInode • page table and “copy on write” refcount • DirPages 2048 bytes each • build up the directory • divided into 64 32byte blobs • Hash table for fast name lookups • Blob Freelist • Array of free blobs per page

  14. Directories • More than one vnode can point to directory (copy on write) • VM: hash table of DirHandles • point to VM contiguous copy of dir • point to DirInode • have a lock etc • Model: as for volumes & vnodes • Critique: too baroque

  15. Files • Vnode references file by InodeNumber • Files are copy on write • There are “FileInodes” like dir inodes, but they are held in external DB or in inode itself • Server always reads/writes whole files (could be exploited)

  16. Volinit and salvage • Set up volume hash table, serverlist, DiskPartitionList • Cycle through partitions, check each for • list of inodes • every inode has a vnode • every vnode has a directory name • every directory name has a vnode • Put volume in a VM hash table

  17. Server connection info • Array of HostEntry (a “venus”) • Contains a linked list of connections • Contains a callback connection id • Connection setup • first binding creates a host & callback conn • new binding creates a new connection and verifies callback • in RPC2_NewBinding & ViceNewConnectFS

  18. Callbacks • Hashtable of FileEntries: • each contains Fid • number of users • linked list of callbacks • Callbacks: point to HostEntry • Ops: • RPC: BreakCallBack • Local: placing, delete, deleteVenus

  19. Callbacks • Connection is non-authenticated. Should be fixed. Session key for CB connection should not expire. • Side effect of callback connection is used for BackFetch bulk transfer of files during reintegration.

  20. RPC processing • Venus RPC’s: • srvproc.cc - standard file ops • srvproc2.cc - standard volume ops • codaproc.cc - repair stuff • codaproc2.cc - reintegration stuff • Volutil RPC’s: • vol-your-rpc.cc (in coda-src/volutil) • Resolution: below

  21. RPC processing • RPC structure: • ValidateParms: validate, hand off COP2, cid • GetObject: vm copy, lock objects • CheckSemantics: • Concurrency, Integrity, Permissions • Perform operations: • BulkTransfer, UpdateObjects, OutParms • PutObject: rvm transactions, inode deletions

  22. vlists • GetFSObjects: instantiate a vlist • RPC needs list of objects copied from RVM • Modification status is held there (did CopyOnWrite kick in etc) • PutObjects • rvm_begin_transaction • walk through the list, copy, rvm_set_range, unlock • rvm_end_transaction

  23. COP2 handling • In COP2 Venus give final VV to server • are sent out by Venus (with some delay) often piggybacked in bulk • server knows about pending COP2 entries in hash table (coppend.cc) • Manager thread CopPendingManager • Runs every minute. • Removes entries more than 900 secs old

  24. Cop2 to RVM • Data can be • PiggyBacked on another rpc • sent in ViceCop2 rpc. • Both cases call InternalCop2 (srvproc.cc) • InternalCop2 (codaproc.cc) • notifies the manager to dequeue • gets the FS objects listed for the COP2 • installs final VV’s into RVM (rvm transaction!)

  25. COP2 Problems • Easy cause of conflicts in replicated volumes when clients access objects in rapid succession. (Can be fixed easily during the writeback caching operation) • Not optimized for singly replicated volume.

  26. Resolution • Initiated by client with RPC to coordinator • ViceResolve (codaproc.cc) • coordinator • sets up connections in VSG (unauthenticated) • LockAndFetch (res/reslock, resutil): • lock volumes, • collect “closure”

  27. Resolution - special cases • RegResDirRequired (rvmres/rvmrescoord.cc) • check for • unresolved ancestors • already inconsistent • runts (missing objects) • weak equality (identical storeid)

  28. RecovDirResolve • Phase II: (rvmres/{rescoord,subphase?}.cc) • coordinator request logs from other servers • subordinates lock affected dirs,marshall logs • coordinator merges logs • Phase III: • ship merged log to subordinates • perform operations on VM copies • Return results to coordinator

  29. Resolution • Phase IV: (is old Phase 3 …) • collect results, compute new VV’s ship to subordinates • commit results

  30. Comments on resolution • Old versions of resolution: • OldDirResolve: resolve only runts and weak • DirResolve: resolve only in VM • Remove these • resolve directory has nothing to do with resolution: should be called librepair. Srv uses merely one function in it - repair uses the rest

  31. Volume Log • During FS operations, log entries are created for use during resolution • Different format per operation (rvmres/recov_vollog.cc) • Added to the vlist by SpoolVMLogRecord • Put in RVM at commit time

  32. Repair • Venus makes ViceRepair RPC. • File and symlink repair: BulkTransfer the object • Directory repair, BulkTransfer the repair file and replay operations • Venus follows this with a COP2 multi rpc • For directory repair Venus invokes asynchronous resolve

  33. Future • Good: • Design is simple and efficient • There is little C++: should eliminate • easy to multi-thread • Bad: • Scalability ~8GB in practice, ~40GB in theory • Data handling is bad: tricky to fix • Volume code was & is worst: rewrite

More Related