Authors: George W. Dunlap Samuel T. King Sukru Cinar Murtaza A. Basrai Peter M. Chen

ReVirt: Enabling Intrusion Analysis through Virtual Machine Logging And Replay Authors: George W. Dunlap Samuel T. King Sukru Cinar Murtaza A. Basrai Peter M. Chen Presentation by: Will Hrudey

Introduction • ReVirt is an intrusion analysis solution that facilitates post attack analysis • ReVirt applies VM and fault tolerant techniques to enable the Administrator to replay long term instruction-by-instruction execution of a computer system • ReVirt runs the target operating system (OS) and applications in a VM running as a kernel module in a host OS, allowing: • Migration of logging from the target OS to the host OS below the VM • Playback of the target system’s execution before, during, and after an intruder compromises the system

Motivation • The improvement of today’s computer system security is an urgent and difficult problem • The complexity and rapid change in software systems prevent developers from verifying their code to eliminate all vulnerabilities • Administrators have to routinely cope with computer break-ins • CERT Coordination Center reports steady a increase of incidents handled and the number of vulnerabilities over the past 4 years

Goals Solve two problems with current audit logging: • Improve the integrity of the logger because: • Existing loggers depend on the integrity of the OS • Attackers can disable, modify or delete system logs • Kernel’s are large and complex so tend to contain many bugs Solution: Encapsulate target system within VM and place logging below VM • Improve the completeness of the logger because: • Existing loggers don’t save enough data to replay and analyze attacks so Administrator still has to guess what happened • Can’t account for non-determinism Solution: Utilize checkpointing, logging and roll-forward recovery

Virtual Machines • A virtual-machine monitor (VMM) is a software layer that emulates the hardware of a complete computer system • The VMM creates an abstraction called a virtual machine (VM) • The host platform that the VMM runs on can be another OS (the host OS) or the bare hardware • So the VMM runs in a separate domain from the guest OS and applications • Although the VMM can still be compromised, it makes a better trusted computing base (TCB) than the guest OS due to its narrow interface and small size

Virtual Machines • The VMM interface is similar to the physical hardware whereas the interface provided by a typical OS is much richer • The narrower interface restricts actions and the smaller code is easier to verify the VMM • VM's can be classified by how similar they are to the host hardware • On one end, VM’s export a backwards compatible interface with the host hardware such as IBM VM/370. OS’s and applications intended to run on the host platform can run on these VMM’s without change • On the other end, language-level VM's like Java VM export an interface completely different from the host hardware. These VMM’s can run only OS’s and applications written specifically for them

Virtual Machines Direct-on-Host (DoH) Different VM Configurations OS-on-OS (OoO)

UMLinux • ReVirt uses UMLinux as the virtual machine • VMM in UMLinux exports an interface similar but not identical to the host hardware • VMM custom optimizations in the underlying OS increase speed • Virtual machine in UMLinux runs as a user process on the host • Guest OS and guest applications run inside this user host process • Guest OS uses host services (system calls and signals) as the interface to peripheral devices, hence OS-on-OS architecture • Normal structure of target applications running directly on the host OS reflects the Direct-on-Host architecture

UMLinux • VMM in UMLinux is a loadable kernel module in the host OS • Module is called before/after each signal and system call to/from the VM process • Most instructions executed within the VM execute directly on host CPU • Memory accesses are translated by the host’s MMU based on translations that are set up via the host OS’s memory system calls • A host X application displays console output and reads keyboard input • The VMM module maintains a virtual privilege level (VPL) • Set to kernel when transferring control to the guest kernel • Set to user when transferring control to a guest application

UMLinux • If the current VPL is kernel, the VMM knows the guest OS made the system call and it checks to ensure its a call the guest OS should be making, then passes it onto the host OS • If the current VPL is user, the VMM knows the guest application made the system call and it sends a SIGUSR1 to the guest OS to notify it • SIGUSR1 signal handler in the guest kernel is the equivalent of the system-call trap handler in a normal OS • SIGALRM, SIGIO, and SIGSEGV signals are used to emulate the hardware timer, I/O device interrupts, and memory exceptions • UMLinux emulates the enabling/disabling of interrupts by masking signals • The TCB is comprised of the VMM kernel module and the host OS

UMLinux

UMLinux Attacker strategies: • From above DoH: Attacker can cause application processes to exploit any/all host OS functionality in dangerous ways OoO: Attacker can take similar avenues to attack Guest OS, however VMM limits available systems calls to < 7% and Guest OS can only access a limited number of host files and devices • From below DoH: Attacker can send dangerous network packets to the host to compromise lower levels of the protocol stack OoO: Less of the host OS network stack is exposed to the same dangerous packets

Logging And Replaying • Logging is used to recover state • Start from a checkpoint of a prior state, then roll forward using the log • Most events are deterministic and needn’t be logged however any host system calls that can yield non-deterministic results must be logged • Non-deterministic events are categorized as either time or external input • Time refers to the point in the execution stream which an event takes place • External input is data received from a non-logged entity (keyboard, mouse, etc) • Output to peripherals does not affect the replay process • Log records are added and saved to disk similar to Linux syslogd daemon • PC and the # of branches executed since the last interrupt are logged • New asynchronous virtual interrupts do not perturb VM process playback

Logging And Replaying • ReVirt goes through two phases to find the right instruction at which to deliver the original asynchronous virtual interrupt • 1st phase has branch_retired generate an interrupt after most branches • 2nd phase is needed to stop at exactly the right instruction • Replay can occur on any host with similar processor type as host • Most non-deterministic sources generate small amounts of log data • Received network messages can generate massive logs • Can reduce the amount of logged network data since the receiver doesn’t need to log data because the sender can recreate the data via replay • Requires cooperating computers to trust each other to regenerate the same message data during replay

Logging And Replaying Administrator tools used to in understanding the attack: • Tools that run inside the guest VM to probe the VM state • edit files • list current processes, etc • Tools that run outside the guest VM to analyze the state of a VM • Xserver • Debuggers • Disk Analyzer, etc

Experiments: Testbed • VM is configured to use 192 MB of physical memory • Virtual hard disk is stored on a raw disk partition

Experiments: Objective • Measure Virtualization Overhead: • Application runtimes within UMLinux vs. runtimes on the host OS • Evaluates 5 workloads with a warm cache averaged over 3 runs • Validate Correctness: • Micro-benchmarks run in the VM to verify virtual interrupts are being replayed at the same point at which they occurred during logging • Macro-benchmark verifies ReVirt faithfully plays back input from external systems • Measure Logging And Replaying Overhead • Quantify the time and space overhead of logging • Checkpoint overhead is not included • Attack Analysis • Exploit the ptrace race condition and verify replay

Experiments: Virtualization Overhead

Experiments: Logging / Replaying

Future Work • Make checkpointing faster and more convenient • Accelerate disk copy done during checkpointing • Enable the VMM to checkpoint a running VM • Reduce host OS size used to support UMLinux • Build higher level analysis tools to leverage ability to replay detailed, long-term executions • Move the X server into another VM • Use ReVirt as a building block for new security services • Cooperative logging in ReVirt?

Conclusion • ReVirt adopts VM and fault tolerance techniques to enable replay of long-term instruction by instruction execution to facilitate attack analysis • Target OS and applications run within the VM • ReVirt can replay execution before, during and after an intrusion • ReVirt logs all non-deterministic events so it can replay non-deterministic attacks and executions • ReVirt provides arbitrarily detailed observations about what transpired • ReVirt is implemented as a set of modifications to the host OS • ReVirt adds “reasonable?” time and space overhead

Observations • Total overhead for kernel-intensive workloads: up to 66% • Is this overhead justifiable? • Should have reported total overhead in tables for increased clarity • Checkpoint time and space overhead not characterized • Host OS can still be compromised • No quantitative data to support narrower interface is more secure • Tests seem to focus on overhead rather than ability to enable analysis • There are no specific tools to analyze potentially large ReVirt logs • Log growth could be much larger since SPECWeb99 benchmark was based on only 15 simultaneous connections • Replay must start from a powered-off VM state, is this practical? • How portable is ReVirt to other guest/host OS’s? • “No perceptible time overhead” is a weak measurement. Better metric? • No multiprocessor support yet published in late 2002

Discussion • The authors state that they “believe that even an overhead of 58% is not prohibitive for sites that value security.” (p11) I believe that an overhead of 58% is pretty big, especially for busy systems. How much of a concern is this really? • They show the average space/day logging takes. But does this include the daily snapshot as well? If you're running a lot of guest OS’s concurrently, couldn't this become a bottleneck (or does ReVirt only run one guest OS at a time)? They give results for both virtualization overhead and logging overhead, but not both at the same time (which is the real-world scenario). Is there any indication to how much the total overhead is?

Discussion • The authors talk about checkpointing in a few areas of the paper. They claim it will be a rare event and so do not test the time and space overhead to run one. They then say that their future work is to “make checkpointing faster and more convenient.” I wonder how slow and inconvenient checkpointing is at this point for them to avoid testing it (or releasing the test results)? I think this should have been included in the paper as, even though checkpointing may not happen often, it is still part of the system overhead.

Discussion • If ReVirt detects the non-deterministic events occurred during the attack, what can it do to prevent further attack? Is it possible to isolate them? • Is UMLinux the only guest OS that can be used in ReVirt? Is there any other OS were ported to ReVirt? Or how about the development of ReVirt or some system like it?

Discussion • The authors introduce ReVirt to address two shortcomings of current systems - integrity and completeness. They state that the "current system loggers lack integrity because they assume the operating system kernel is trustworthy." However, they also indicate that "even the VMM may be subject to security breaches," but that the VMM is more trustworthy than operating system because the interface is narrower. Does a narrower interface really make that much of a difference in securing the system? Can't attackers still do a lot of damage?

Discussion • They talk about how this approach is useful in analyzing an attack, and in section 5.4 give an example of this. But to do so they introduced a vulnerability and then used the logging method to analyze an attack that they themselves initiated. While the example may have some validity, it would have been nice to see something that they didn't set up themselves.

Discussion • Cooperative logging is cited as being capable of significantly reduced storage as no LAN data needs to be logged (it can just be regenerated); however you lose the ability to run independent machines without running the whole network (or so it seems). Are there any schemes that let you do both? • They use a modified version of Linux 2.4.18 as the host OS. I’m wondering how modified it is? They claim that the host OS is safe from attack, but because it is still just an ordinary OS, I’m not sure about this. What do you think?

Discussion • ReVirt logs all input from external devices. Could these logs be used to pick up passwords from keyboard input or other security input (i.e. fingerprint readers and files from memory sticks)? • "ReVirt log all input from external entities. These include most virtual devices: keyboard, mouse, network interface card, ..." When we want to analyze the intrusion of a highly-used web server, logging all input from the network device seems quite expensive (I believe it would be much more than 1.4 GB/day as shown in the experiment). Any solution for that?

Discussion • So does it make more sense to add this VM layer just so we can track, or is it just easier? (i.e. what are the arguments for not having a VM layer?) • When they used ReVirt to analyze and attack, they only tested it with one attack. I think a broader range of attacks should have been tested to get an accurate account of what ReVirt can do. What do you think about this? • What kind of analysis tools do the authors suggest/ provide? They were able to find an error, but when they themselves knew exactly what they were looking for.

Discussion • In section 4.4, the paper mentioned alternative architectures for logging and replay. Basically, they compared OS-on-OS structure with direct-on-host structure. How about the direct-on-VMM structure? Does removing host OS improve the performance and stability of ReVirt? • In section 6, the paper compared hypervisors with ReVirt and argued that they are targeting different goals. However, since Hypervisors already have similar logging functionalities, why not design ReVirt as a plugin (i.e. a special VM) for some hypervisors?

Discussion • Is there some other way to improve security that does not involve loading the VMM as a kernel module? • The guest doesn't run X itself, but rather connects to a remote X server (say on the host). Doesn't this introduce a hook that a malicious user could use to gain access to (or at least destabilize) the host?

Discussion • Why does ReVirt have only a single disk checkpoint which is the virtual machine being powered off? Why did they not think to add in other checkpoints? Why did they "envision checkpointing being a rare event?" Is this because they don't see their system being attacked more frequently than that?

Authors: George W. Dunlap Samuel T. King Sukru Cinar Murtaza A. Basrai Peter M. Chen

Authors: George W. Dunlap Samuel T. King Sukru Cinar Murtaza A. Basrai Peter M. Chen

Presentation Transcript

The second king in the United Kingdom

Book of 1 Peter King James Version

Operating System Support for Virtual Machines

David: Redirection in Rejection 2 Samuel 7

the four stones...

King Solomon

Lesson 68-69 – 1 Samuel 9–15

Composed the Declaration of Independence (10)

King George III

“We want a king!”

A closer look at David’s family 2 Samuel 11 – 1 Kings 2 (pp. 486-523)

FACING GIANTS c

Execution Replay for Multiprocessor Virtual Machines

Hua Chen and Samuel H. Cox RMI “Brown Bag” Seminar August 31, 2007

Execution Replay for Multiprocessor Virtual Machines

Samuel W. Thomas III, Koushik Venkatesan, Peter Muller, and Timothy M. Swager

2 Samuel 2:12-3:13

Engineering 1020

Ecclesiastes 12:9-14

1 Samuel 8:6,19,20

Lesson 18