Autonomic Recovery of Enterprise-wide Systems After Attack or Failure with Forward Correction

Autonomic Recovery of Enterprise-wide Systems After Attack or Failure with Forward Correction By Anup K. Ghosh George Mason University With Sushil Jajodia, GMU Angelos Stavrou, GMU Angelos Keromytis, Columbia University Jason Nieh, Columbia University Sal Stolfo, Columbia University Peng Liu, Penn State University July 10, 2008

Problem • The complexity of most enterprise-scale software systems precludes perfect reliability and invulnerability • Enterprise computing supports mission-critical functions including logistics, transportation, intelligence, & command and control -- the failure of which can have severe consequences • Enterprise-wide solutions must be engineered to account for failure and attacks against network servers and workstation clients

Objective Develop self-regenerative enterprise networks that recover and re-constitute themselves after attacks and failures • Develop a transaction-based model for commodity operating systems to determine where an attack occurred, what data or programs were altered, and back-out all these changes without affecting unrelated data/activities. • Automatically generate patches to make systems more robust after attack.

Approach • Develop an enterprise-wide approach to self-regenerative systems including: • Application-level resilience using error virtualization and rescue points (Columbia U) • Non-stop server resilience using virtualization and automatic feedback control to continuously provide servers with known high integrity, even after compromise (GMU) • Self-healing database to track damage, quarantine tainted records, and repair damage (PSU/GMU) • Journaling computer system for workstations to determine malicious actions and effects including tainted documents, programs, and network events (GMU) • System restore with with correction to back-out malicious changes (GMU) • Dynamic patching of applications to improve resiliency after attack (Columbia U)

Overview of Presentations • Application-level error virtualization & dynamic patching (Columbia U) • Self-healing database (Penn State) • Non-stop server system (GMU) • Journaling Computer System for desktop applications (GMU)

A Non-Stop Server using Automatic Feedback Control • Goal • Develop a Virtual Machine-based system for enterprise servers that provides non-stop computing services by compensating for faults and attacks against critical network servers • Feedback control system coupled with VM-based redundancy provides very high availability for imperfect software systems

State of Server Systems • Servers are very complex pieces of software • Complex -> buggy -> software failures & attacks • Servers are used to provide mission-critical services for enterprise networks • Failures in servers result in substantial business loss of revenue and productivity • Internet-facing servers tend to bear the brunt of attacks against enterprises (govt & commercial) • Current strategies to mitigate the risk of failing servers • Demand better software from vendors -> perfection is not realizable • Best is auto-patching which introduces downtime • Provide hardware back-up redundancy • Common-mode failures for incidental bugs & attacks • Large, emerging market for server consolidation via virtualization • Reduces TCO for maintaining lots of server boxes • Virtualized server farms is becoming the norm

Diversify and replicate servers in virtual machines Create a trustworthy controller (TC) that uses automatic feedback control to control state of servers Hide details of server replication from clients Revert servers to pristine condition on attack or corruption while continuing to provide service VS VS SensorReports Action VSH VSH Action Recommendation TC LoadBalancer Action decisions VSH VSH VS VS Solution for Non-Stop Computer Servers

Sensors Intrusion sensors Anomaly detectors Integrity monitors Performance monitors Exposure time Actuators Service restoration Terminate unauthorized processes VM revert Client throttling/blocking Control models Rules-based engine Learning-based state estimator Sensors Actuators TC State Estimator Response Selector Trustworthiness Controller (TC)

TC Testbed Setup Apache00 TC GUI Station Apache01 TC Control Station Client Apache02 LoadBalancer Server 192.168.0/24 10.0.0.0/16

TC GUI • Visualize system state and dynamics • Passive: it receives state information from TC control and displays --- GUI does not direct TC control • System View: • Show the state of the server machine and summarized state of VMs • VM View: • Show the state of individual VMs

TC GUI: System View

TC GUI: VM View

Denial of Service Attack

Active Malicious Process

Withstanding Persistent DoS Attacks 1 attack per second  92% of normal throughput 8 attacks per second  60% 8/14/2014 16

Revert Overhead 8 measurements took one minute Worst case revert overhead = 12% (when reversion starts) Return to 99% of normal throughputs in 30 sec (measure 5). 8/14/2014 17

Conclusion • TC is a close-loop control architecture for intrusion detection and server defense • Servers are virtualized so that they can be reverted to pristine state at low cost. • The control loop issues actuators in response to sensor inputs • Handles “false negatives,” including zero-day exploits and ingenious stealthy attacks that evade detection. • Handles false alarms automatically without human in the control loop. • Address the problem of overwhelming “false positives.”

Autonomic Recovery & Regeneration using Lightweight Virtualization Objective: Develop self-regenerative enterprise networks that recover and re-constitute themselves after attacks and failures Recover: bring the system back to an operational state Regenerate: roll forward with correction to quarantine tainted processes and files & back-out corrupted changes 8/14/2014 19

Traditional Logging for Recovery To be comprehensive, all system objects and activities need to be monitored, including processes, threads, inter-process communications, file system activities, signals, memory, network and local sockets … The challenges lie in the number of activities to monitor and the amount of resulting information to log. 8/14/2014 20

Virtualization Technologies • Full and para-virtualization • A virtual machine (VM), acts like a complete system, equipped with its own OS and (virtual) hardware management • Lightweight virtualization • A VE, aka Container, has its own file system space, process space, socket space, and network identity but no guest OS and ensuing overhead 8/14/2014 21

Journaling Computing System JCS executes applications in lightweight VEs, created on demand and started in pristine state The host monitors; the VEs do the jobs The focus of monitoring is on the interactions among VEs, not VE internal activities. A novel VE construction method allows minimum effort to monitor VE integrity (to be discussed later). This drastically reduce the amount of activities of interests Abstract inter-VE interactions as transactions; high-level semantics of transactions further reduce the information needed to be kept 8/14/2014 22

JCS Host Diagram TransactionSummarization Engine VE Manager System Journal VE 1 VE 2 VE N Atomic Transactions Syscalls JCS Kernel Monitor OpenVZ Kernel 8/14/2014 8/14/2014 23 23

The JCS Transactions • File Transactions: information exchange through shared files • Socket transactions: limited to Inet sockets • Memory transactions. • Transactions channels are tightly controlled. • File sharing setup cannot be changed from within VEs. • Firewalls can be used to block illegal TCP connections. 8/14/2014 24

JCS Transactions Continued • Atomic transactions: lowest level system events of interests --- presently syscalls. • Summarized transactions: combining multiple atomic transactions into one --- can be lossless or not • Application defined transactions: bring in application semantics • Ongoing research: causality-based summarization • summarize the combined effects of multiple system calls as one transaction, without loss information in causality analysis. 8/14/2014 25

JCS Desktop • Clicking on icons or menu items creates a new VE to run the designated application • Each VE has its own file system, process space, local socket space, and network identity. • It is like running applications in their own VMs but without the overhead of full virtualization • Application windows are seamlessly integrated with the desktop. • Directory sharing is set up for seamless work flow: see an example in the next page 8/14/2014 26

An Example of File Transactions through Shared Directory Application in a VE can’t see other apps/VEs. Shared directory for file transactions VEs created on demand Email VE Office VE Foo.doc Shared Directory Save Attachment Click onFoo.doc 8/14/2014 27

Analysis & Recovery Actions • An interface to allow the user to identify corrupted files, virus infections, bad URLs, … • Intrusion/corruption detections can be integrated to automate the above • Corruption propagation analysis button • Analysis of sensitive data leakage • Corruption source discovery button (bad applications, URLs, etc.) • Application self-healing button 8/14/2014 28

Name Space Unification Union Mount / vz JCS 101 FFX-PS 101-dirties Ubuntu Firefox .mozilla download bin/ls .mozilla download bin/ls RO RO RW RW 8/14/2014 29

JCS VE A firefox VE Created / .mozilla download bin/ls The file system seenwithin the VE When a firefox VE is created at /vz/101, the /vz/101 subtree becomes its entire file system (the /) Applications in VE 101 see only unified namespace --- they cannot see individual branches. /vz/101-dirties serves as a “honey branch.” 8/14/2014 30

101-dirties: The “Honey Branch” • When a Trojan horse of the ls command is installed, it reveals itself in 101-dirties, the only branch needs to be monitored Union Mount / vz JCS 101 FFX-PS 101-dirties Ubuntu Firefox .mozilla download bin/ls bin/ls bin/ls .mozilla download bin/ls RO RO RW RW 8/14/2014 31

Disk Space Usage by Firefox VEs 8/14/2014 32

Memory Space by Firefox VE Memory in kilo bytes; each VM is configured with 128MB memory 8/14/2014 33

First Generation Prototype Dell 2900 server: 8 cores, 16GB memory, 15000 RPM SCSI hard drive. OS: 64-bit CentOS 5.1 (free version of RedHat Enterprise Linux 5.1) Kernel: 2.6.24-4 with OpenVZ and unionFS patches 32 kprobes created to monitor file system and AF_INET socket activities Relay channel used to send probe reports to user space program (jcs-relay) 8/14/2014 34

Demo • Scenario • Open the browser • Go to bad websites • Download some files, one of them a malware, installer.exe • Run installer.exe • Corrupt some files • Send sensitive data out (browsing history) • Demo • show all files corrupted and give instructions of recovery • show all data that leaked from corrupted processes • list possible IP sources of the malware 8/14/2014 35

Movie 1: JCS System Startup • Show the startup of the VE manager • JCS kernel module installed • Relay channeled opened • JCS-relay programming running • System ready 8/14/2014 36

Movie 2: Starting an Application Click on a pdf file Ve-manager (at the right lower of the screen) shows: Launching PDF Reader Container 105 PDF file displayed from container 105 8/14/2014 37

Movie 3: Online Browsing Click on the Firefox icon on the desktop VE-manager shows: Launching Firefox Container 106 Firefox window shows up Visit kernel.org and download the change log of recent kernel revisions Visit a bad site (192.168.0.12, the server in the TC testbed), download Installer.exe, the malware Visit cnn.com and leave the browser there 8/14/2014 38

Movie 4: Running Malicious Software Click on the Terminal icon VE-manager shows: Launching Terminal Container 107 Execute Installer.exe in the terminal The user will find files in the desktop corrupted. The user will see a message asking $100 for decryption key. 8/14/2014 39

Movie 5: Analysis Engine Run the analysis program with the name and directory of the malware as inputs Wait until the program ends 8/14/2014 40

Movie 6: Analysis Results Give the container (107) that executed the malware Display files that might have been leaked to what IP addresses by VE 107 Display files that are written by VE 107. Recommend the user to recover which files to versions before what times Shows that the malware was created by container 106, a Firefox container Shows files in the shared directory (Desktop) that VE 106 had “touched” Shows that suspected (IP) sources of the malware 8/14/2014 41

Ongoing Research • Applying lightweight virtualization to TC • More virtual web servers rotating • Less exposure times for each • Causality-based summarization • Capturing application semantics in transactions • Mission specific summarization: for data recovery, intrusion analysis, … • Mechanisms to capture memory transactions 8/14/2014 42

Discussion aghosh1@gmu.edu

Autonomic Recovery of Enterprise-wide Systems After Attack or Failure with Forward Correction

Autonomic Recovery of Enterprise-wide Systems After Attack or Failure with Forward Correction

Presentation Transcript

Autonomic Systems

Enterprise-Wide Information Systems

Forward Error Correction

Failure Recovery

CPS216: Data-intensive Computing Systems Failure Recovery

Chapter 7 Enterprise-Wide Information Systems

CPS216: Data-intensive Computing Systems Failure Recovery

Failure Recovery

Policy-Driven Systems for Enterprise-Wide Security

Autonomic Recovery of Enterprise-wide Systems After Attack or Failure with Forward Correction

Autonomic distributed systems

RTML feed-forward correction

Enterprise Wide Information Systems The Procurement Process

Chapter 7 Enterprise-Wide Information Systems

Traffic Engineering with Forward Fault Correction (FFC)

WCIT: Failure or success? Impasse or way forward?

CPS216: Data-intensive Computing Systems Failure Recovery

Capillary-routing with Forward Error Correction (FEC)

CS422 Principles of Database Systems Failure Recovery

From Failure to Correction