• 330 likes • 461 Views
Coda Server Internals. Peter J Braam. Contents. Data structure overview Volumes Vnodes Inodes. Data Structure Overview. Object. Purpose. Resides where. Inodes. File Contents. /vicep* partitions. Volumes Vnodes Directory cnts ACL Reslogs. Meta Data & Dir contents. RVM.
E N D
Coda Server Internals Peter J Braam
Contents • Data structure overview • Volumes • Vnodes • Inodes
Data Structure Overview Object Purpose Resides where Inodes File Contents /vicep* partitions Volumes Vnodes Directory cnts ACL Reslogs Meta Data & Dir contents RVM Volume location VLDB, VRDB: RW db files Volinfo records VSGDB, .pdb, .tk files: dynamic RO db files VSGDB Pdb records Tokens Security Servers/SCM Partitions Startup flags Skipvolumes LOG & DATA & DB Locators Configuration Data Static data
RVM layout (coda_globals.h) • Already_initialized (int) • struct VolHead[MAXVOLS] • struct VnodeDiskObject *SmallVnodeFreeLists[SM_FREESIZE] • short SmallVnodeIndex • …. Same for large … • MaxVolId (unsigned long) • Remainder is dynamically allocated
Volume zoo (volume.h, camprivate.h) • RVM: structures • VolumeData • VolHead • VolumeHeader • VolumeDiskData • VM: structures • Volume • VolumeInfo ……..
VolHead VolumeHeader VolumeHeader VolumeData stamp id parentid type *volumeDiskData *smallVnodeLists nsmallVnodes nsmallLists -- same for big -- A volume in RVM contains pointer to rvm malloced data
VolumeDiskData (rvm) • Lots of stuff: • Identity & location: partition, name, • runtime info: use, inService, blessed, salvaged • Vnode related: next uniquefier • Versionvector • Resolution flags, pointer to recov_vol_log • Quota • Resource usage: filecount, diskused etc
Volumes in VM • struct Volumes sit in VolHash with copies of RVM data structures • Salvage before “attaching” to VolHash • Model of operation (FS): • GetVolume copy out from RVM • Do your mods in VM • PutVolume does RVM transaction • Model of operation (Volutil): • operate on RVM
Volumes in Venus RPC’s • One RPC: GetVolInfo • used for mount point traversal • Only relates to • volume location database • volume replication database • VSGDB • Could sit in separate Volume Location Server
Vnodes (cvnode.h) • Small & large: large for directories • difference is ACL at back of large vnodes • Inode field: • small vnodes: points to diskfile inode number • large vnodes: is RVM address of dir inode • Contain important small structure: vv_t • Pointers to reslog entries • VM: cvnode’s with hash table, freelists etc
Vnodes in RVM • RVM: VnodeDiskinfo (rvm_malloced) • vnodes sit on rec_smolists • each link points to a DiskVnode • lists link vnodes with identical vnodenumbers but different uniquefiers • new vnodes grabbed from FreeLists (index.cc, recov{a,b,c}.cc) • volumes have arrays of rec_smolists which grow when they are full
Vnodes in action • Model: • GetFSObj calls GetVnode • work is done • PutFS Objects calls • rvm_begin_transaction • ReplaceVnode - copies data from VM to RVM • rvm_end_transaction • Getting a vnode takes 3 pointer derefs, possibly 3 page faults vs. 1 for local file systems. • Is this necessary? Probably not. Cure it: yes!
Directories (rvm) • DirInode • page table and “copy on write” refcount • DirPages 2048 bytes each • build up the directory • divided into 64 32byte blobs • Hash table for fast name lookups • Blob Freelist • Array of free blobs per page
Directories • More than one vnode can point to directory (copy on write) • VM: hash table of DirHandles • point to VM contiguous copy of dir • point to DirInode • have a lock etc • Model: as for volumes & vnodes • Critique: too baroque
Files • Vnode references file by InodeNumber • Files are copy on write • There are “FileInodes” like dir inodes, but they are held in external DB or in inode itself • Server always reads/writes whole files (could be exploited)
Volinit and salvage • Set up volume hash table, serverlist, DiskPartitionList • Cycle through partitions, check each for • list of inodes • every inode has a vnode • every vnode has a directory name • every directory name has a vnode • Put volume in a VM hash table
Server connection info • Array of HostEntry (a “venus”) • Contains a linked list of connections • Contains a callback connection id • Connection setup • first binding creates a host & callback conn • new binding creates a new connection and verifies callback • in RPC2_NewBinding & ViceNewConnectFS
Callbacks • Hashtable of FileEntries: • each contains Fid • number of users • linked list of callbacks • Callbacks: point to HostEntry • Ops: • RPC: BreakCallBack • Local: placing, delete, deleteVenus
Callbacks • Connection is non-authenticated. Should be fixed. Session key for CB connection should not expire. • Side effect of callback connection is used for BackFetch bulk transfer of files during reintegration.
RPC processing • Venus RPC’s: • srvproc.cc - standard file ops • srvproc2.cc - standard volume ops • codaproc.cc - repair stuff • codaproc2.cc - reintegration stuff • Volutil RPC’s: • vol-your-rpc.cc (in coda-src/volutil) • Resolution: below
RPC processing • RPC structure: • ValidateParms: validate, hand off COP2, cid • GetObject: vm copy, lock objects • CheckSemantics: • Concurrency, Integrity, Permissions • Perform operations: • BulkTransfer, UpdateObjects, OutParms • PutObject: rvm transactions, inode deletions
vlists • GetFSObjects: instantiate a vlist • RPC needs list of objects copied from RVM • Modification status is held there (did CopyOnWrite kick in etc) • PutObjects • rvm_begin_transaction • walk through the list, copy, rvm_set_range, unlock • rvm_end_transaction
COP2 handling • In COP2 Venus give final VV to server • are sent out by Venus (with some delay) often piggybacked in bulk • server knows about pending COP2 entries in hash table (coppend.cc) • Manager thread CopPendingManager • Runs every minute. • Removes entries more than 900 secs old
Cop2 to RVM • Data can be • PiggyBacked on another rpc • sent in ViceCop2 rpc. • Both cases call InternalCop2 (srvproc.cc) • InternalCop2 (codaproc.cc) • notifies the manager to dequeue • gets the FS objects listed for the COP2 • installs final VV’s into RVM (rvm transaction!)
COP2 Problems • Easy cause of conflicts in replicated volumes when clients access objects in rapid succession. (Can be fixed easily during the writeback caching operation) • Not optimized for singly replicated volume.
Resolution • Initiated by client with RPC to coordinator • ViceResolve (codaproc.cc) • coordinator • sets up connections in VSG (unauthenticated) • LockAndFetch (res/reslock, resutil): • lock volumes, • collect “closure”
Resolution - special cases • RegResDirRequired (rvmres/rvmrescoord.cc) • check for • unresolved ancestors • already inconsistent • runts (missing objects) • weak equality (identical storeid)
RecovDirResolve • Phase II: (rvmres/{rescoord,subphase?}.cc) • coordinator request logs from other servers • subordinates lock affected dirs,marshall logs • coordinator merges logs • Phase III: • ship merged log to subordinates • perform operations on VM copies • Return results to coordinator
Resolution • Phase IV: (is old Phase 3 …) • collect results, compute new VV’s ship to subordinates • commit results
Comments on resolution • Old versions of resolution: • OldDirResolve: resolve only runts and weak • DirResolve: resolve only in VM • Remove these • resolve directory has nothing to do with resolution: should be called librepair. Srv uses merely one function in it - repair uses the rest
Volume Log • During FS operations, log entries are created for use during resolution • Different format per operation (rvmres/recov_vollog.cc) • Added to the vlist by SpoolVMLogRecord • Put in RVM at commit time
Repair • Venus makes ViceRepair RPC. • File and symlink repair: BulkTransfer the object • Directory repair, BulkTransfer the repair file and replay operations • Venus follows this with a COP2 multi rpc • For directory repair Venus invokes asynchronous resolve
Future • Good: • Design is simple and efficient • There is little C++: should eliminate • easy to multi-thread • Bad: • Scalability ~8GB in practice, ~40GB in theory • Data handling is bad: tricky to fix • Volume code was & is worst: rewrite