1 / 32

Operating System Reliability

Operating System Reliability. Andy Wang COP 5611 Advanced Operating Systems. Some Axioms. Some simple systems, designed from scratch, sometimes work A complex system that works is invariably found to have evolved from a simple system that works

swain
Download Presentation

Operating System Reliability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems

  2. Some Axioms • Some simple systems, designed from scratch, sometimes work • A complex system that works is invariably found to have evolved from a simple system that works • A complex system, designed from scratch never works

  3. Failure-Mode Theorems • Complex systems usually operate in failure mode • A system should have safe behaviors when encountering failures • When a “fail-safe” system fails, it fails by failing to fail safe

  4. Some definitions • Failure occurs when the system does not perform its services in the manner specified • Failures can be subtle (e.g., performance fault) • Fault is anomalous physical condition • Includes system specification/implementation mistakes • Error is part of system state that differs from its intended value

  5. Classification of Failures • Process failures • System failures • Secondary storage failures • Communication medium failures

  6. Process Failures • Examples • Computation results in incorrect outcome • System state deviates from specification • Process fails to progress • Errors leading to failure • Deadlock, timeout, protection violation • Bad input, consistency violation • Ignoring malicious behavior

  7. System Failures • Processor fails to execute • Software error, hardware error (CPU, bus, etc.) • Fail-stop behavior assumed • Failure types • Amnesia • Partial-amnesia • Pause • Halting

  8. Secondary Storage Failures • Stored data inaccessible • Parity error • Head crash • Contaminated medium • Reconstructable from archive + log, maybe • Mirrored disks (independent failure mode)

  9. Communication Medium Failures • Site can’t communicate with another site • Causes • Switching node failure • Hardware failure • Software failure • Congestion • Link failure • Hardware • Implementation failure • Network partitions can result

  10. Recovery • Restart process/processor • Reclaim resources • Undo/finish incomplete transactions • Concurrency makes things harder

  11. Forward Error Recovery • Goal: To restore system from erroneous state to error-free state • If nature of error is completely known • Remove error from state • Proceed with execution from error-free state • Rarely possible to do

  12. Backward Error Recovery • When error source unknown • Restore state to previous error-free state; restart • Independent of fault, errors causing fault • Problems • Performance penalty • No guarantee fault will not reoccur • Possible unrecoverable component of state • Recovery point: state used to replace error

  13. Backward Error Recovery • Basic approaches • Operation-based • Logs • Update-in-place • Write-ahead-log • State-based

  14. Update-in-Place • Every update to object also records the log • Name of object • Old and new states of object • Recoverable update operation implements as • Do, undo, redo operations

  15. Write-ahead Log • Update-in-place has problem if crash occurs between update and log recorded to stable storage • Update object only after undo log recorded • Before committing updates, record both redo and undo logs • Expensive to write log to stable storage

  16. State-Based Recovery • Save entire process state at recovery point • Recovery point called checkpoint • Rolling back process: restoring to checkpoint • Tradeoff: frequent checkpoints vs. completion delay • Shadow pages • Save unmodified page copy on stable storage • Update only volatile copy; discard on rollback

  17. Concurrent Systems Recovery • Rollback issues • Orphan messages • Domino effect • Lost messages • Livelocks

  18. Orphan Messages (a message prior to a checkpoint is sent to the future) x1 x2 X [ [ y1 my2 Y [ [ z1 z2 Z [ [ [ recovery point

  19. Messages from Future Sent to the Past x1 x2 X [ [ y1 m y2 Y [ [ z1 z2 Z [ [ [ recovery point

  20. Messages from Future Sent to the Past x1 x2 X [ [ y1 m y2 Y [ [ z1 z2 Z [ [ [ recovery point

  21. Domino Effect Completed x1 x2 X [ [ y1 m y2 Y [ [ z1 z2 Z [ [ [ recovery point

  22. Lost Messages x1 X [ m z1 Z [ failure [ recovery point

  23. Live Locks x1 X [ z1 Z [ repeated failure [ recovery point

  24. Concurrent Recovery • Coordination required at either time of establishing checkpoints • Beginning of recovery

  25. Checkpoint Assumptions • Communication via messages • Reliable FIFO channels • Higher-level end-to-end protocols assumed • Subsumes rollback-caused message loss • No network partitions from communication failures

  26. Checkpoint Algorithm Concepts • Permanent and tentative checkpoints • Saved on stable storage • Permanent: part of known consistent global checkpoint • Tentative: until successful termination of checkpoint algorithm • Rolls back only to permanent checkpoints

  27. Synchronous Checkpoint Algorithms • Two-phase commit • Problems: • Message overhead for synchronizations • Synchronization delays • Costly when failures are frequent

  28. Asynchronous Checkpointing • Local checkpoints taken independently • Log all incoming messages on stable storage • Minimizes undone computation • Allows reprocessing of messages after rollback

  29. Asynchronous Checkpointing Assumptions • Assumptions • Reliable FIFO communication channels • Infinite buffers • Event-driven computation • A process idle until message received • Processes message and change state • Sends zero or more messages • Can identify each event with monotonically increasing counter

  30. Event-Driven Computation x1 x2 X y1 y2 Y z1 z2 Z

  31. Asynchronous Checkpointing • Basic idea • Save states, messages sent at each event • Volatile logging • Each processor notes number of messages sent to others, and received from others • Use counters to determine orphan messages

  32. Summary • Failures caused by errors • Can remove errors by forward/backward error recovery • Backward error-recovery more costly, more general • Synchronous checkpoints helpful, costly • Asynchronous checkpoints messier, domino effects

More Related