310 likes | 423 Views
MURI: Autonomic Recovery of Enterprise-wide Systems After Attack or Failure with Forward Correction: System-Level Design & Implementation. Peng Liu Pennsylvania State University University Park, PA 16802 July 20, 2007. Outline. Recovery angle of enterprise health care The recovery problem
E N D
MURI: Autonomic Recovery of Enterprise-wide Systems After Attack or Failure with Forward Correction:System-Level Design & Implementation Peng Liu Pennsylvania State University University Park, PA 16802 July 20, 2007
Outline • Recovery angle of enterprise health care • The recovery problem • The state-of-the-art • Our goal • Year-by-year plan: overview • Year one plan: zoomed-in
The recovery problem:(1) Patient systems A patient mankind A “patient” system App processes OS Components: code, stack, heap, (VM) pages, files, sockets, PCB, page tables, registers, sys. calls, drivers, … Threat: virus infection
(2) Why a compromised system could be called a patient A “patient” system A patient mankind Process Organ Text, stack, heap, pages, files Tissues Memory unit, register, disk block, … Cells OS Neuro + blood PCB, page tables, drivers, sys. calls, scheduler, sockets, interceptions, … Neuro sub-systems, Blood sub-systems Memory unit, register, disk block, … Cells
The recovery problem:(3) System state transition A system’s state is determined by the state values of its components: stack, heap, files, registers, … Component x is poisoned by attack at 9:30am Time State at 8am State at 9am State at 10am State at 5pm … … … … … Checkpoint C-8am C-9am C-10am C-5pm Fact: “infection” can propagate
(4) Simplest full-system recovery: before Lymph, gallbladder, etc. Body at week 10 Body at week 1 Body at week 2 Body at week 3 … … Time Component x (liver) is poisoned by attack at 9:30am State at 9am State at 10am State at 5pm State at 8am … … … … … Checkpoint C-8am C-9am C-10am C-5pm
(5) Simplest full-system recovery: after Body at week 1 Body at week 2 Bob’s memory after week 2 is lost: very painfulfor Bob Component x (liver) is poisoned by attack at 9:30am Time The work after 9am is lost State at 8am State at 9am Memory-less recovery Checkpoint C-8am C-9am
(6) Memory preserving, full-system recovery • What is memory-preserving recovery? • When we perform surgeries on the liver, do not roll-back the state of the brain • When we repair an infected process, do not roll-back any uninfected process • Memory-preserving recovery requires fine-grained process-level (i.e., organ-level) and operation-level forward correction surgeries • Memory-preserving recovery is challenging • Due to infection propagation, it is hard to know which (part of an) organ should be cut-off and which should be kept
(7) Full-body anesthesia vs. local anesthesia • There are two ways to perform surgeries: • Full-body “anesthesia”: The machine is halted during recovery • Local “anesthesia”: The uninfected processes can still be executed as usual • For non-stop enterprise computing, local “anesthesia” is required
The recovery problem:(8) Infection quarantine • Why quarantine? • The under-repair components are infectious prevent infecting clean processes • Execution of the uninfected processes may interfere with the surgeries guarantee correctness • Quarantine = “disinfection” + local “anesthesia” • Quarantine strategies • Two-way quarantine • One-way quarantine
The state-of-the-art • Memory-less recovery • Re-playable systems • Process checkpointing • Process migration • Memory-preserving subsystem recovery with full-body anesthesia
The state-of-the-art(1) memory-less recovery • One-button recovery • A standard feature in laptops (HP, Dell, etc.) • The OS will lose all “memory” • Simplest full-system recovery • Checkpoint-based • E.g., the whole state of a VM at time t can be copied to disk (State Procurement) • Will lose “memory” after the moment of attack
The state-of-the-art(2) re-playable systems • E.g., Revirt can log and replay all operations of a virtual machine • re-playable ≠ recoverable • Revirt cannot detangle bad operations from good ones • Revirt cannot replay only the unaffected good operations • Revirt cannot do forward correction • Revirt cannot do local anesthesia • Revirt cannot quarantine infection
The state-of-the-art(3) process checkpointing • Per-process checkpointing: Flashback (and Rx) can checkpoint the whole state of a process at time t in RAM • checkpoint-able ≠ recoverable • Flashback cannot detangle bad operations from good ones within the same process • Flashback cannot track taint-propagation channels • Flashback cannot do forward correction • Flashback cannot quarantine infection
The state-of-the-art(4) process migration • Process migration: • A Pod is a group of processes “tangled” with each other • Zap can migrate a Pod from machine A to B • Migrate-able ≠ recoverable • Zap cannot detangle bad operations from good ones • A partially infected Pod has to be totally “discarded” • Zap cannot track taint-propagation • Zap cannot do forward correction
(5) Memory-preserving subsystem recovery with full-body anesthesia • Taser can do memory-preserving recovery, however, • Not full-system recovery: It can only repair file systems • Taser requires full-body anesthesia • Taser cannot quarantine infection • Taser cannot do on-the-fly surgeries • Compared with our blueprint, Taser does not have the capabilities to do: • Remote surgeries • Nested recovery • Replicated Recovery • Non-stop Recovery
Our goal • Do memory-preserving, self-recoverable, non-stop enterprise computing: • Fine-grained recovery surgeries • Forward correction • Keep good “memory” in a consistent way • Remove bad “memory” • Local “anesthesia” • Quarantine infection during recovery • Transparent to uninfected processes
Challenges: Multi-Granularity Recovery • Machine-level recovery • Processes are usually “tangled” with each other • It is not hard to checkpoint a VM, but • It is hard to detangle bad operations from good ones • Pod-level recovery • Zap can checkpoint and migrate a Pod, but • It is hard to do detangling • A partially infected Pod has to be totally “discarded” • Process-level recovery • A partially infected process has to be totally “discarded” • Need to track taint-propagation channels • Operation-level recovery: the desired granularity • Need fine-grained surgeries inside the “body” of a VM • Very hard to do selective replay or migration • Tough tradeoffs between recoverability and consistency
Outline • Recovery angle of enterprise health care • The recovery problem • The state-of-the-art • Our goal • Year-by-year plan: overview • Year one plan: zoomed-in
Recovery Services: Roadmap Initial Capability Gold Capability • Focus: processes, files • Logger: VMM based • Atomicity: per-process • Dependency analysis • Quarantine via VMM • Roll-Forward correction • Nested recovery • - Intra-process checkpointing • - Nested transactions Platinum Capability • Replicated recovery • - Heterogeneous VM replica • - Standby VM Silver Capability • Holistic recovery • - Sockets, shared memory, • DBMS, attributes, … • - Control dependencies: • process forking, workflows • -Remote “surgeries” • - EHCC sends instructions • to remote surgery agents Diamond Capability • Non-stop recovery • - Transparent switching • - Stateful migration
New recovery capabilities: advanced ones • New capabilities can • Provide transactional atomicity & consistency • Do non-stop warm-start or hot-start recovery • Perform remote surgeries • Do intra-process checkpointing • Do nested recovery within a process • Do heterogeneous VM replication • Construct standby VM • Do stateful recovery-driven process migration • Side benefits: • Improved observation/inspection capability • Improved diagnosis/forensics capability • Improved detection capability
Outline • Recovery angle of enterprise health care • The recovery problem • The state-of-the-art • Our goal • Year-by-year plan: overview • Year one plan: zoomed-in
Year one: Initial Capability • Scope: local health care • Focus: app processes, files • Logging: VMM based • Atomicity: per-process • Dependency analysis based detangling • Local anesthesia via host kernel • Quarantine via VMM • On-the-fly, roll-forward correction
System architecture App A App B Display process Stack Timer Log Heap Dependency Analyzer Keyboard Task structure Guest OS Guest OS Ports CPU VMM auditor Quarantine Task structure Roll-Forward Correction Instruction Generator Disks Hook Cache Surgery Agent Host Kernel Drivers
Why run “patient” systems in a VM? • Enhanced security • App processes are isolated in separate VM • The host kernel does not directly interact with the app processes • Although any component of a “patient” system may be compromised, the host kernel is quite safe • The audits and recovery code are well protected • Enhanced observation/inspection capability • Much easier to do local anesthesia • Much easier to quarantine • Much easier to perform repair surgeries • Downside: performance degradation
Year one work plan • Team 1: QEMU-based implementation • Team 2: UML-based implementation • Each team has two graduate students • Goal of each implementation: • Phase I: be able to do incremental logging • Phase II: be able to do damage assessment and detangling • Phase III: be able to perform on-the-fly repair “surgeries”
Phase I: do incremental logging • VM state-procurement techniques are recently proposed, but • If the checkpoints are taken frequently too much overhead • If the checkpoints are not taken frequently “memory” loss • A better idea is logging only the changes • Any operation could change the state • If we log every state change too much • So what changes do not need to be logged? • Are we able to log all changes? • QEMU-based CPU emulator can log every change • UML-based logger can log every trap to OS
Phase II: do damage assessment • The goal is to detangle tainted operations from untainted operations • Dependency analysis is required in order to do detangling • We have built various kinds of dependency graphs for data processing systems • We will extend these graphs to capture the taint-propagation channels in a VM • Fine-grained VM information flow analysis techniques are recently proposed, • Although their purpose is intrusion detection, they may be applied to serve our recovery purposes
Phase III: perform on-the-fly repair surgeries • How to do local anesthesia? • Let the host kernel not schedule any tainted process that is under surgery • How to quarantine? • Let the VMM enforce the quarantine policy • Two-way quarantine: the tainted components are totally contained • One-way quarantine: a tainted process may access a version of an untainted component, but, not vise versa • How to do forward correction surgeries? • Naïve idea: Replace the state value of every tainted component with a clean version • Real challenge: How to keep the consistency among the clean versions of e.g. 50 components