RAMSES ( Regeneration And iMmunity SErviceS ): A Cognitive Immune System

RAMSES(Regeneration And iMmunity SErviceS): A Cognitive Immune System Mark Cornwell James Just Nathan Li Robert Schrag Global Infotek R. SekarStony Brook University

Outline • Goals • Approach Overview • Status and Next Steps • Memory Errors • Taint Recognition • Questions

Project Goals • Prevent most attacks from causing damage • Cover a wide range of attacks • Work on blackbox COTS applications with modest performance overheads • Refine response to preserve availability • Reduce performance impact of unsuccessful attacks by filtering out future attack instances • Filters (“signatures”) should be deployable across different instances of an application • Input filters (network layer or close) • Output filters (at well-known APIs) • Don’t require “deep” instrumentation

Attack Coverage (Stack-smashing, heap overflow, integer overflow, data attacks) Generalized Injection Attacks CVE Vulnerabilities (Ver. 20040901)

RAMSES Project Schedule Baseline Tasks 1. Refine RAMSES Requirements 2. Design RAMSES 3. Develop Components 4. Integrate System 5. Analyze & Test RAMSES 6. Coordinate & Rept Prototypes Optional Tasks O.3 Cross-Area Exper CY06 CY09 CY07 CY08 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 1 2 3 21 June 2007

RAMSES Components • Event Collector • parse/decode/normalize HTTP requests, parameters, cookies, … Internet • Attack Detector • Address-space randomization • Taint-based policies, anomalies RAMSES Interceptors • Filter Generator • Output filter • Input filter RAMSES Overview • Key research problems • Learn taint propagation • Identify tainted components in output, generate filtering criteria • Learn input/output transformation • Use transformation to project output filters to input Protected System Web Server (IIS/Apache) Web App (PHP/ ASP) SQL Database (MySQL) Network/App Firewall (e.g. mod_security) Network DLLs OS DLLs Application DLLs

Instrumentation • Instrument important APIs • Uses Microsoft detours framework to intercept DLL calls • No need for source code or semantics of application specific data structures • No need for complex analyses or transformations on binaries • Instrumentation will support • Logging of relevant operations • Including calling context, parameters and return values • Interposition of filter functions • Including injection of failure returns at appropriate points to ensure error recovery

Event Collection • Apply further processing for widely used, standardized APIs (e.g., HTTP) • Parse into components • Request type, URL, form parameters, cookies, … • Exposes more of protocol semantics to learning and filtering algorithms • Normalize formats to avoid effect of various encoding schemes • To cope with evasion techniques • To ensure accuracy of taint-learning

Memory Error Detection • Base technology (ASR) developed in DAWSON project (SRS Phase I) for Windows XP • Enhanced with: • Multiple sensors for earlier detection • Analysis of memory for vulnerability characterization • Integrated with signature generation

Taint-based Detection Input Interface $name=$_GET[‘name’] Attacker injects malicious input data • Taint-tracking: • Identify parts of output “directly controlled” by untrusted input • Taint-enhanced policy enforcement: • Policies based on output value as well as taint, e.g., “No tainted semicolons in SQL query argument” • Taint-enhanced anomaly detection: • Detect anomalous structure/content of output, e.g., “tainted component too long, contains too many non-alphabetic chars” $query= “SELECT price FROM products WHERE name=‘” . $name . “’” Attacker-provided data propagated in program Program sql_query($query) Attacker-provided data used as argument to corrupt system/data Security-Sensitive Operations

Learning-based Taint Inference • Off-line learning from logs • Identify dataflow from input to output • Compare each output with recent inputs • Search can be narrowed if each related operations are made by the same thread • Learn a FSA for taint-marking • Given an output, can quickly identify which portions are tainted • Online • Match outputs using learned FSA • Double-check with input to verify (optional)

Identifying Dataflow • In most applications, parts of input directly copied into output, with some minor modifications • deletion, e.g., spaces • modification, e.g., upper to lower case • insertion, e.g., escape special chars • more complex transformations aren’t easily handled by a learning technique • Given strings I and O, we need to answer one of: • Problem P0: Does I equal O, possibly with some changes? • Problem P1: Does I occur as a substring of O, possibly with minor modifications? • Problem P2: Do substrings of I occur within O, possibly with minor modifications?

Algorithms for P0 to P2 • P0 is the approximate string matching problem • Dynamic programming technique yields O(|I|*|O|) algorithm for weighted approximate matching • Closely related problems: • Longest common subsequence • Global alignment • Weighted edit distance

Algorithms for P0 to P2 • P1 is the approximatesubstring matching problem • A very simple change to the algorithm for P0 yields solution for P1 • So, problem still has O(|I|*|O|) runtime • P2 is the common approximate substring problem • Again, solvable with a minor tweak on the algorithm for P1 • Local alignment [Smith-Waterman '81] • Difficulty: results very sensitive to weights • So, we focus on P1, since we have already parsed inputs into parameters, cookies, etc.

Practical Issues with Algorithms for P1 • Some outputs can be very long • E.g., HTML outputs of server • Need faster algorithms than O(|I|*|O|) • Idea: use O(|O|) algorithm to find likely places for start of matching substring • Some inputs can be too short • Leads to too many matches • Need to compute some measures of statistical significance of a match

Taint-marking FSA • Given a set OCof possible outputs in a context C, an FSA FCthat • accepts all strings in OC • edges of FSA are annotated to indicate if the corresponding input symbol is tainted • Limit size of OC using as much execution context as possible • capture calling context of output function (includes all activation records on stack) • simplifies learning: often there is just one element in O for a given set of inputs

Learning Taint-marking FSA • Suitable algorithms depend on lexical structure of output language • SQL, shell commands, … • This info could be specified externally • Our initial design based on general characteristics of command languages • Character classes: alphabetic, numeric, upper/lower case, separators, special chars,… • Leverage properties of tainted/untainted parts of output • Tainted components are from input, hence highly variable: generalize quickly from examples • Untainted components are from web app, likely static: generalize only if necessary

xyz SELECT price FROM products WHERE name=' ' AND brand=' abcd ' Learning Taint-marking FSA • SELECT price FROM products WHERE name='abcd' AND brand='xyz'

xyz SELECT price FROM products WHERE name=' ' AND brand=' abcd ' Learning Taint-marking FSA • SELECT price FROM products WHERE name='abcd' AND brand='xyz‘ • SELECT price FROM products WHERE name='abcd' AND brand='yyz'

x yz y SELECT price FROM products WHERE name=' ' AND brand=' abcd ' Learning Taint-marking FSA • SELECT price FROM products WHERE name='abcd' AND brand='xyz' • SELECT price FROM products WHERE name='abcd' AND brand='yyz'

x yz y SELECT price FROM products WHERE name=' ' AND brand=' abcd ' Learning Taint-marking FSA • SELECT price FROM products WHERE name='abcd' AND brand='xyz' • SELECT price FROM products WHERE name='abcd' AND brand='yyz' • SELECT price FROM products WHERE name='abcd' AND brand='uvw'

SELECT price FROM products WHERE name=' ' AND brand=' abcd ' a…z Learning Taint-marking FSA • SELECT price FROM products WHERE name='abcd' AND brand='xyz' • SELECT price FROM products WHERE name='abcd' AND brand='yyz' • SELECT price FROM products WHERE name='abcd' AND brand='uvw' a…z

SELECT price FROM products WHERE name=' ' AND brand=' abcd ' a…z Learning Taint-marking FSA • SELECT price FROM products WHERE name='abcd' AND brand='xyz' • SELECT price FROM products WHERE name='abcd' AND brand='yyz' • SELECT price FROM products WHERE name='abcd' AND brand='uvw‘ • SELECT price FROM products WHERE name='defg' AND brand='uvw' a…z

a…z a…z Learning Taint-marking FSA • SELECT price FROM products WHERE name='abcd' AND brand='xyz' • SELECT price FROM products WHERE name='abcd' AND brand='yyz' • SELECT price FROM products WHERE name='abcd' AND brand='uvw' • SELECT price FROM products WHERE name='defg' AND brand='uvw' SELECT price FROM products WHERE name=' ' AND brand=' a…z a…z '

a…z a…z Learning Taint-marking FSA • SELECT price FROM products WHERE name='abcd' AND brand='xyz' • SELECT price FROM products WHERE name='abcd' AND brand='yyz' • SELECT price FROM products WHERE name='abcd' AND brand='uvw' • SELECT price FROM products WHERE name='defg' AND brand='uvw' • SELECT price FROM products WHERE name='ijkl' SELECT price FROM products WHERE name=' ' AND brand=' a…z a…z '

a…z a…z Learning Taint-marking FSA • SELECT price FROM products WHERE name='abcd' AND brand='xyz' • SELECT price FROM products WHERE name='abcd' AND brand='yyz' • SELECT price FROM products WHERE name='abcd' AND brand='uvw' • SELECT price FROM products WHERE name='defg' AND brand='uvw' • SELECT price FROM products WHERE name='ijkl' SELECT price FROM products WHERE name=' AND brand=' a…z ' a…z '

Taint Inference Vs Taint-tracking • Disadvantages of learning • False negatives if inputs transformed before use • Low likelihood for most web apps • False positives due to coincidence • Mitigated using statistical information • Plan to evaluate these experimentally • Benefits of learning • Low performance overhead • Some significant implicit flows handled without incurring high false positives • Can address multi-step attacks where tainted data is first stored in a file/database before use • More generally, in dealing with information flow that crosses module boundaries

Filter Criteria Correlative filters Equality-based filter Structure-based filter Statistical filter Causal filters Filtering criteria derived from attack detection criteria (policy or anomaly) Filter Location Input filter Easier to deploy but harder to synthesize Output filter (precedes sensitive operation) Easier to synthesize than input filter, but deployment needs deeper instrumentation May be too late for some attacks (memory corruption) Filter Types Note: All filters evaluated using large number of benign samples and 1 attack sample

Output Filters • Use taint-marking FSA to identify tainted components of output • Attack-independent signature component • Structural filter: FSA match failure • Due to structural changes associated with SQL and command injection • Example: taint-marking regular expressionSELECT price FROM products WHERE name=\'[a-z]*\'doesn’t match attack outputSELECT … WHERE name=\'z\'; UPDATE products SET price=0 WHERE name=abcd\'

Output Filters (continued …) • Equality-based filter • tainted parts same as attack-causing output • Statistical • Filter: statistics matching that of attack output but not benign outputs • length of tainted data > threshold • tainted data contains `<script>’ • tainted data contains too many non-alpha characters • Causal: Just apply attack detection criteria • Note: filter independent of attack sample

Input Filters • Taint-marking FSA only indicates which parts of output are tainted • Don’t know which parts relate to which inputs • Need this info to generate input filters • We need to capture the relationship between inputs and outputs • We will represent this relation using an FSA as well, but its transitions will be on pairs (input char/output char) • Note: i may be from one of n input parameters; the transition will specify which. • The term “finite-state transducer” (FST) is commonly used to refer to such FSA

$4=/  /’| gpg -h $2@/$2 $3/$3 /.g -r $1/$1 /echo ' / -r $4/$4 ' ' $2/ I/O Transformation FST Example from SquirrelMail

Generating Causal Filters from FST • Consider a command injection on SquirrelMail • Use I/O FST to compute output • Violates policy: no “;” (or other command separators) in tainted parts • Can be projected into a corresponding condition on input: no command separator in input $3 • Can be generalized to state that if $1 has no unmatched quotes, then neither $3 nor $4 can contain unescaped “;” x@yz.com;touch /tmp/t /’| gpg -h sekar /.g -r ab /echo ' ' '

RAMSES Functional Architecture (Spiral 1) Responses in the form of learned attack signatures and specific interventions (block, filter) are fed to interceptors to provide an immune response Memory Error Attack Detector User Inputs New information from sensors Is analyzed in context of retained history Function Interceptors Signature & Rule Manager Application (s) Function Interceptors Interceptors observe and control inputs to applications at the function call level Function Interceptor Manager Crash Dump Application DLL Signature Generator Function Interceptors Win32 API Sense Monitor Respond Alerts, Buffered Inputs Function Interceptors Windows SysCall Function Interceptors Windows Kernel RAMSES Initiator Dataflow Anomaly Detector Offline Load LOG Sensor Config Parameters, Input, Output, Context Taint Mark Rules Offline Dataflow Learning Dataflow / user-taint identification rules learned offline from logs. Rule-based monitoring of inputs performed by Interceptors (also alerting and response)

RAMSES Implementation Status • Architecture supports generalized injection attack defenses • Vulnerability-based signature and policy-based regular expression signature • Works on multiple applications with application specific configuration • Memory error detector integrated • Zero-day attack detector for unknown vulnerability • DAWSON enhanced and integrated on windows XP • Function interceptors have been tested and stabilized over numerous applications on Windows XP/Vista/Linux • IIS v4 - v7, SQL Server, MySQL • IE browser, Windows Explorer • Also implemented for Linux Apache and PHP • Test beds and synthetic applications developed • Performed first set of end-to-end experiments on synthetic application based attacks • Memory error • SQL injection • XSS attacks • Completed Spiral 1 Functional Architecture

Function Interceptor • The Infrastructure • Based on Detours package from Microsoft Research • Same Component used by RAMSES online and offline mode • Works the same way for all generalized injection attack defense • Function Interceptors • To intercept/monitor/alter application behavior • Can intercept any exported function • Internal functions can be intercepted with Microsoft online debug information • Function name, parameters, return result, calling context are logged in name:value pair • Data buffer content is logged in printable ASCII string • Current Implementation • Currently support following types of APIs: • Windows Socket APIs: 40+ functions • Two set of socket APIs on Windows, both are instrumented • Process/Thread APIs: 10+ functions • Windows COM APIs: 60+ functions • File IO APIs: 40+ functions • HTTP APIs: 10+ functions (on Vista)

Function Interceptor Sample Log Function Name Timestamp Parameters Return result Calling context • 20070613175843825: FuncIndex=0x8e,PID=0xd58,ThreadId=0xeb0,SPOffset=0x290, send(SOCKET:274|Buf:1edc08|Len:413|Flags:0|RETURN:19d|DUMPBUFFER:1) • 20070613175843825: FuncIndex=0x8e,PID=0xd58,ThreadId=0xeb0,SPOffset=0x290, DUMPBUFFER:BEGIN • 20070613175843825: FuncIndex=0x8e,PID=0xd58,ThreadId=0xeb0,SPOffset=0x290, GET / HTTP/1.1\r\n • 20070613175843825: FuncIndex=0x8e,PID=0xd58,ThreadId=0xeb0,SPOffset=0x290, Accept: */*\r\n • 20070613175843825: FuncIndex=0x8e,PID=0xd58,ThreadId=0xeb0,SPOffset=0x290, Accept-Language: en-us\r\n • 20070613175843825: FuncIndex=0x8e,PID=0xd58,ThreadId=0xeb0,SPOffset=0x290, UA-CPU: x86\r\n • 20070613175843825: FuncIndex=0x8e,PID=0xd58,ThreadId=0xeb0,SPOffset=0x290, Accept-Encoding: gzip, deflate\r\n • 20070613175843825: FuncIndex=0x8e,PID=0xd58,ThreadId=0xeb0,SPOffset=0x290, User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506 • 20070613175843825: FuncIndex=0x8e,PID=0xd58,ThreadId=0xeb0,SPOffset=0x290, Host: www.google.com\r\n • 20070613175843825: FuncIndex=0x8e,PID=0xd58,ThreadId=0xeb0,SPOffset=0x290, Connection: Keep-Alive\r\n • 20070613175843825: FuncIndex=0x8e,PID=0xd58,ThreadId=0xeb0,SPOffset=0x290, Cookie: PREF=ID=c559b127b50b7436:TM=1176843792:LM=1180046298:L=0jXnI1ToXFxWWo5LbIcrLh7t8ID0Fd1HW3eXHVTozZwA:S=clcSnwJfHl • 20070613175843825: FuncIndex=0x8e,PID=0xd58,ThreadId=0xeb0,SPOffset=0x290, rjWQTm; testcookie=\r\n • 20070613175843825: FuncIndex=0x8e,PID=0xd58,ThreadId=0xeb0,SPOffset=0x290, \r\n • 20070613175843825: FuncIndex=0x8e,PID=0xd58,ThreadId=0xeb0,SPOffset=0x290, DUMPBUFFER:END

Key Steps • Attack detection • Process crash • Useful information can be extracted from a failed attack • Usually in the form of access violation exception • Full crash dump can be analyzed offline • Policy violation • Trace back to input • Correlate attack effect to some part of input • Vulnerability/exploit analyzer • Analyze attack in the context of recent input history • Correlate the exploit/vulnerability to input without fine grained execution trace • Signature generator • Generate generic signature and response for the underlying vulnerability

Memory Error Example WorkerThread WorkerThread WorkerThread • Multi-threaded vulnerable server • Server receives input from port V • Server handles socket error • Server handles thread level exception RAMSES Attack Detector ASR Function Interceptor Attacker Port V Exception Handler Listening Thread Recent Input History Vulnerability Exploit Analyzer Signature Generator RAMSES Protected Server

Memory Error Attack • A traditional stack buffer overflow exploit jmp esp at 0x77fb59cc from NTDLL • Brute-force attack enumerates all possibilities, 0x000059cc to 0xffff59cc • Attack succeeded or service denied without protection WorkerThread RAMSES Attack Detector WorkerThread ASR Function Interceptor Attacker Port V 0x77fb59cc Exception Handler Listening Thread 0xABCD59cc Recent Input History 0xFFFF59cc Vulnerability Exploit Analyzer WorkerThread Signature Generator Faulting address, Instruction, stack content RAMSES Protected Server

Next Steps • Attack Detection • Enhance detection to be closer to or even before vulnerability exploiting point. • Save attacking dump for offline analysis to increase confidence and enhance signatures • Signature Generation • Create new probes to have an accurate characterization of the underlying vulnerability, for example the minimum size required to overflow a buffer and overwrite return address, by customizing/randomizing certain part of payload like target address, buffer length. • Measure and enhance false positive/false negative

Taint Recognition Implementation Overview • Script Vulnerability Scenario • Taint Flow ~ Script Vulnerability • Exploits: SQL, Plant XSS attack • Taint Recognition and Policy • Example of a policy signature • Algorithm for generating a simple recognizer • Limits of effectiveness • Solutions are not unique • Multiple tainted segments causes problems • Better signatures • Character distribution invariants • More things we can learn • What calls/contexts tainted inputs appear • Measure effectiveness of taint templates

Script Vulnerability Scenario Values from the database User supplied inputs

$v = $_REQUEST['vote']; $n = $_REQUEST['name']; $sql = "INSERT INTO namevote VALUES('$n ', $v );" ; multi_query($conn,$sql) Taint Flow ~ Script Vulnerability Web Server (IIS) Web App (PHP ASP) SQL Database (MySQL) INSERT INTO namevote VALUES(‘alice’,345);

$v = $_REQUEST['vote']; $n = $_REQUEST['name']; $sql = "INSERT INTO namevote VALUES('$n ', $v );" ; multi_query($conn,$sql) Exploit: SQL Capture Attacker can execute arbitrary SQL commands via crafted strings intended for data values. Web Server (IIS) Web App (PHP ASP) SQL Database (MySQL) 445); DROP TABLE FOO; INSE Trudy INSERT INTO namevote VALUES(‘trudy’,445); DROP TABLE FOO; INSERT INTO namevote VALUES(‘bob’,345);

$v = $_REQUEST['vote']; $n = $_REQUEST['name']; $sql = "INSERT INTO namevote VALUES('$n ', $v );" ; multi_query($conn,$sql) Exploit: Plant XSS Attack Imperfect filtering offers the possibility of planting XSS attacks even without subverting the SQL syntax. Web Server (IIS) Web App (PHP ASP) SQL Database (MySQL) Alice<script>alert(String.fr 345 INSERT INTO namevote VALUES(‘Alice<script>alert(String.fromCharCode(88,83,83)</script>’,345); Tainted text later executes in browsers of future visitors to site. Observe: No variation in server control flow!

Observations on Exploits • Can launch from ordinary web browser. • SQL capture can inject persistent data into database, that may propagate to 3rd party victims. • Text the application programmer intended for limited use as data is crafted to influence the command interpretation stream in unintended ways. • Command shell, filename, format string, all share a similar structure to this canonical example. • These “capture” vulnerabilities stem from flaws in application programs. Perfect filtering by the application could have eliminated the flaw. Alas, Perfection is not always achieved. Can we craft our application to learn from experience?

RAMSES ( Regeneration And iMmunity SErviceS ): A Cognitive Immune System