170 likes | 364 Views
Recovery Management in QuickSilver. Roger Haskin, Yoni Malachi, Wayne Sawdon, and Gregory Chan IBM Almaden Research Center. Introduction: Problem Domain. Recovery management in distributed OSs Trends in contemporary research: Extensibility and Distribution.
E N D
Recovery Management in QuickSilver Roger Haskin, Yoni Malachi, Wayne Sawdon, and Gregory Chan IBM Almaden Research Center
Introduction: Problem Domain • Recovery management in distributed OSs • Trends in contemporary research: • Extensibility and Distribution
Contemporary Recovery Techniques • timeouts • how to distinguish slow from dead? • connectionless protocols / stateless servers • some actions can’t be made idempotent • retries can cause problems • virtual circuits • can’t handle multiple servers • replication • too expensive for some uses • how to detect failures?
Quicksilver: what’s so special? • Fundamental Trade-Off: • Generality & efficiency vs. Ease of use (Quicksilver)(Camelot, Argus, etc.) Transparency isn’t always best!
Quicksilver: specs and features • Client-server model • System services are processes • IPC message-passing • More complicated set of failure modes (to handle more specific cases) • Atomic transactions
Server Classes Common server classes: • Volatile (window manager) • Replicated + volatile (name server) • Recoverable (file server) • Long running transactions need log support
Design Goals • Programs should be resilient to external process and machine failure • Server processes should contain their own recovery code • Uniform system-wide architecture for recovery management • Logically related activities must execute atomically
Transaction Structure • Everything belongs to a transaction • Globally unique transaction identifiers (tid) • Each transaction has one owner and multiple participants • Owner can commit or abort • Participants can only abort
Recovery Manager: Components • Transaction Manager: manages commit coordination by communicating with servers at its own node and with transaction managers at other nodes • Log Manager: serves as a common recovery log both for the TM’s commit log and the server’s recovery data • Deadlock Detector: detects and resolves global deadlocks (not implemented)
Transaction Manager • Tracks transactions for processes on host • Manages distributed commit protocol • Distributed transaction is a tree • Only need to know your superior and your immediate subordinates • Several alternative commit protocols available to servers • 1-phase – used by volatile servers • 2-phase – used by recoverable servers
2-Phase Commit • Voting options • abort: undo my action, announce abort to others in 2nd phase • commit-read-only: no recoverable resources modified, don’t include me in 2nd phase • commit-volatile: same as read-only, but notify me of results of 2nd phase • commit-recoverable: recoverable state modified, notify me of results of 2nd phase
Transaction Coordination • Transaction coordinator at transaction birth-site • Usually a user workstation, likely to fail • Migrate or replicate coordinator for reliability
Log Manager • Log manager provides optional services • Backpointers for log replay • Block I/O access • Log replication • Log archival • Servers tell LM what they need • Not penalized for services they don’t use • LM does not interpret data – servers determine recovery strategy
Open questions - ??? • Efficiency vs. Transparency? • Still relevant for today’s hardware? • …