170 likes | 285 Views
Recovery Management in Quicksilver. Haskin, Malachi, Sawdon, Chan IBM Almaden ACM TOCS (6:1) February 1988. Introduction. Distributed, extensible system Partition computation and data “lean” kernel System services are processes Message-oriented IPC
E N D
Recovery Management in Quicksilver Haskin, Malachi, Sawdon, Chan IBM Almaden ACM TOCS (6:1) February 1988
Introduction • Distributed, extensible system • Partition computation and data • “lean” kernel • System services are processes • Message-oriented IPC • How to deal with more complicated failure modes? • Provide atomic transactions as system service
Recovery Techniques • timeouts • how to distinguish slow from dead? • connectionless protocols / stateless servers • some actions can’t be made idempotent • retries can cause problems
Recovery Techniques • virtual circuits • can’t handle multiple servers • replication • too expensive for some uses • how to detect failures?
Transactions • Basic idea: use transactions as a single, system-wide recovery paradigm • Transactions are heavyweight • Not every server needs them • Different server classes • Volatile (window mgr) • Replicated + volatile (name server, uses TXN for commit atomicity) • Recoverable (file server) • Long running transactions need log support
Structure of Transactions • Everything belongs to a transaction • Default transaction ID for processes • Globally unique transaction identifiers • Each transaction has an owner and multiple participants • Owner can commit or abort • Participants can only abort
Recovery Manager • One transaction-based recovery manager per host • Three components • Transaction Manager • Log Manager • Deadlock Detector
Transaction Manager • Tracks transactions for processes on host • Manages distributed commit protocol • Distributed transaction is a tree • Only need to know your superior and your immediate subordinates • Failure vs. Termination • Termination causes commit/abort to proceed immediately • Failure is remembered and transaction aborted when it finally terminates
Transaction Manager • Participants can say whether their failure causes transaction failure or termination • Subordinates can reclaim resources early • Several alternative commit protocols available to servers • 1-phase – used by volatile servers • 2-phase – used by recoverable servers
2-phase Commit • Different voting options • abort: undo my action, announce abort to others in 2nd phase • commit-read-only: no recoverable resources modified, don’t include me in 2nd phase • commit-volatile: same as read-only, but notify me of results of 2nd phase • commit-recoverable: recoverable state modified, notify me of results of 2nd phase
Commit Processing • Special rules to handle special cases • Commit before participate (late joining) • Cycles in transaction graph • New requests after being prepared to commit • Rules • TM must accept new participants and let them vote until commit • All requests that could force an abort must complete before commit • 1-phase-commit servers cannot commit before making requests that might force an abort
Commit Processing • Transaction coordinator at transaction birth-site • Usually a user workstation, likely to fail • Migrate or replicate coordinator for reliability
Log Manager • Log manager provides optional services • Backpointers for log replay • Block I/O access • Log replication • Log archival • Servers tell LM what they need • Not penalized for services they don’t use • LM does not interpret data – servers determine recovery strategy
Deadlock Detector • Distributed deadlock detection is hard! • So, they didn’t do it.
Criticisms • ???
Criticisms • IPC is responsible for a lot • Guaranteed delivery • Message ordering • Security constraints • Keeping transaction graphs together • For a system that claims to not make you pay for services you don’t use….
Why Do We Care? • Transactions as a core OS mechanism • Mechanism, not policy • Customize effort to need • Optional cost for optional services