560 likes | 688 Views
RAMSES (Regeneration And iMmunity SErviceS): A Cognitive Immune System. Self Regenerative Systems 18 December 2007. Mark Cornwell James Just Nathan Li Robert Schrag Global InfoTek, Inc. R. Sekar Stony Brook University. Outline. Overview Efficient content-based taint identification
E N D
RAMSES (Regeneration And iMmunity SErviceS):A Cognitive Immune System Self Regenerative Systems 18 December 2007 Mark Cornwell James Just Nathan Li Robert Schrag Global InfoTek, Inc R. SekarStony Brook University
Outline • Overview • Efficient content-based taint identification • Syntax and taint-aware policies • Memory attack detection and response • Testing • Red Team suggestions • Questions • Demo
Incoming requests (Untrusted input) Program Outgoing requests (Security-sensitive operations) RAMSES Attack Context • Attack target: “program” mediatingaccess to protected resources/services • Attack approach: use maliciously crafted input to exert unintended control over protected resource operations • Resource or service uses: • Well-defined APIs to access • OS resources • Command interpreters • Database servers • Transaction servers, • …… • Internal interfaces • Data structures and functions within program • Used by program components to talk to each other
Example 1: SquirrelMail Command Injection Input Interface sendto=“nobody; rm –rf *” $send_to_list = $_GET[‘sendto’] $command=“gpg –r nobody; rm –rf * 2>&1” $command = “gpg -r $send_to_list 2>&1” Program popen($command) Attack: Removes all removable files in web server document tree popen($command) “Output” Interface
Example 2: phpBB SQL Injection topic=“-1 UNION SELECT ord(substring(user_password,1,1)) FROM phpbb_users WHERE user_id = 3” Input Interface $topic_id=$_GET[‘topic’] $sql = “SELECT p.post_id FROM POSTS_TABLE WHERE p.topic_id = $topic_id” $sql= “SELECT p.post_id FROM POSTS_TABLE WHERE p.topic_id = -1 UNION SELECT ord(substring(user_password,1,1)) FROM phpbb_users WHERE user_id = 3” Program sql_query($sql) Attack: Steal another user’s password sql_query($sql) “Output” Interface
Attack Space of Interest (CVE 2006) Generalized Injection Attacks
Input Interface (Untrusted input) Program “Output” Interface: (Security-sensitive operations) Detection Approach • Attack: use maliciously crafted input to exertunintendedcontrol over output operations • Detect “exertion of control” • Based on “taint:” degree towhich output depends on input • Detect if control is intended: • Requires policies (or training) • Application-independent policies are preferable
Input Interface (Untrusted input) Program “Output” Interface RAMSES Goals and Approach • Taint analysis: develop efficient and non-invasive alternatives • Analyze observed inputs and outputs • Needs no modifications to program • Language-neutral • Leverage learning to speed up analysis • Attack detection: develop framework to detect a wide range of attacks, while minimizing policy development effort and FP/FNs • “Structure-aware policies:” leverage interplaybetween taint and structural changes to output requests • Use Address-Space Randomization (ASR) for memory corruption • ASR: efficient, in-band, “positive” tainting for pointer-valued data • Immunization: filter out future attack instances • Output filters: drop output requests that violate taint-based policies • Input filters: “Project” policies on outputs to those on inputs • Relies on learning relationships between input and output fields • Network-deployable
Steps • Develop efficient algorithms for inferring flow of input data into outputs • Compare input and output values • Allow for parts of input to flow into parts of output • Tolerate some changes to input • Changes such as space removal, quoting, escaping, case-folding are common in string-based interfaces • Based on approximate substring matching • Leverage learning to speed up taint inference • Even the “efficient” content-matching algorithms are too expensive to run on every input/output • Same learning techniques can be used for detecting attacks using anomaly detection
Weighted Substring Edit Distance Algorithm • Maintain a matrix D[i][j] of minimum edit distance between p[1..i] and s[1..j] • D[i][j] = min{D[i-1][j-1]+ SubstCost(p[i],s[j]), D[i-1][j] + DeleteCost(p[i]), D[i][j-1] + InsertCost(s[j])} • D[0][j] = 0 (No cost for omitting any prefix of s) • D[i][0] = DeleteCost(p[1])+…+DeleteCost(p[i]) • Matches can be reconstructed from the D matrix • Quadratic time and space complexity • Uses O(|p|*|s|) memory and time
Improving performance • Quadratic complexity algorithms can be too expensive for large s, e.g., HTML outputs • Storage requirements are even more problematic • Solution: Use linear-time coarse filtering algorithm • Approximate D by FD, defined on substrings of s of length |p| • Let P (and S) denote a multiset of characters in p (resp., s) • FD(p, s) = min(|P-S|, |S-P|) • Slide a window of size |p| over s, compute FD incrementally • Prove: D(p, r) < t FD(p, r) < t for all substrings r of s • Result: O(|p|2) space and time complexity in practice • Implementation results • Typically 30x improvement in speed • 200x to 1000x reduction in space • Preliminary performance measurements: ~40MB/sec
Efficient online operation • Weighted edit-distance algorithms are still too expensive if applied to every input/output • Need to run for every input parameter and output • Key idea: • Use learning to construct a classifier for outputs • Each class consists of similarly tainted outputs • taint identified quickly, once the class is known • Classifying strings is difficult • Our technique operates on parse trees of output • For ease of development, generality, and tolerance to syntax errors, we use a “rough” parser • Classifier is a decision tree that inspects parse tree nodes in an order that leads to good decisions
Decision Tree Construction • Examines the nodes of syntax tree in some order • The order of examination is a function of the set of syntax trees • Chooses nodes that are present in all candidate syntax trees • Avoids tests on tainted data, as they can vary • Avoids tests that don’t provide significant degree of discrimination • “similar-valued” fields will be collected together and generalized, instead of storing individual values • Incorporates a notion of “suitability” for each field or subtree in the syntax tree • Takes into account approximations made in parsing
Example of a Decision Tree 1. SELECT*FROM phpbb_config 2. SELECTu.*,s.* FROM phpbb_sessions s,phpbb_users u WHERE s.session_id='[a3523d78160efdafe63d8db1ce5cb0ba]' AND u.user_id=s.session_user_id 3. SELECT * FROM phpbb_themes WHERE themes_id=1 4. SELECT c.cat_id,c.cat_title,c.cat_order FROM phpbb_categories c,phpbb_forums f WHERE f.cat_id=c.cat_id GROUP BY c.cat_id,c.cat_title,c.cat_order ORDER BY c.cat_order 5. SELECT* FROM phpbb_forums ORDER BY cat_id,forum_order switch (1) { case ROOT : switch (1.1) { case CMD : switch (1.1.2) { case c FINAL {@1.1.1:SELECT @1.1.3:. cat_id,c.cat_title,c.cat_order FROM phpbb_categories c,phpbb_forums f WHERE f.cat_id=c.cat_id GROUP BY c.cat_id,c.cat_title,c.cat_order ORDER BY c.cat_order } case u FINAL {@1.1.1:SELECT @1.1.3:. *,s.* FROM phpbb_sessions s,phpbb_users u WHERE s.session_id='[a3523d78160efdafe63d8db1ce5cb0ba]' AND u.user_id=s.session_user_id } case * FINAL {@1.1.1:SELECT @1.1.3:FROM phpbb_?????? } } } }
Implementation Status and Next Steps • “Rough” parsers implemented for • HTML/XML • Shell-like languages (including Perl/PHP) • SQL • Preliminary performance measurements • Construction of decision trees: ~3MB/sec • Classification only: ~15MB/sec • Significant improvements expected with some performance tuning • Next steps • Develop better clustering/classification algorithms based on tree edit-distance • Current algorithm is based entirely on a top-down traversal, and fails to exploit similarities among subtrees
Overview of Policies • Leverage structure+taint to simplify/generalize policy • Policy structure mirrors that of parse trees • And-Or “trees” with cycles • Can specify constraints on values (using regular expressions) and taint associated with a parse tree node • Most attacks detected using one basic policy • Controlling “commands” vs command parameters • Controlling pointers vs data
Controlling “commands” Vs “parameters” • Observation: parameters don’t alter syntactic structure of victim’s requests • Policy: Structure of parse tree for victim’s request should not be controlled by untrusted input (“tainted data”) • Alternate formulation: tainted data shouldn’t span multiple “fields” or “tokens” in victim’s request
Policy prohibiting structure changes • Define “structure change” without using a reference • Avoids need for training and associated FP issues • Policy 1 • Tainted data cannot span multiple nodes • for binary data, it should not span multiple fields • Policy 2 • Tainted data cannot straddle multiple subtrees • Tainted data spans two adjacent subtrees, and at least one of them is not fully tainted • Tainted data “overflowed” beyond the end of one subtree and resulted in a second subtree • Both policies can be further refined to constrain the node types and children subtrees of the nodes
Stack frame 1 Return Address Stack frame 2 Return Address Stack frame 2 Commands Vs parameters: Example 2 • Memory corruption attack overflowing stack buffer • For binary data, we talk about message fields rather than parse trees • ….. • Violation: tainted data spans multiple stack “fields” • Heap overflows involve tainted data spanning across multiple heap blocks
Attacks Detected by “No structure change” Policy • Various forms of script or command injection • SQL injection • XPath injection • Format string attacks • HTTP response splitting • Log injection • Stack overflow and heap overflow
Application-specific policies • Not all attacks have the flavor of “command injection” • Develop application-specific policies to detect such attacks • Policy 3: Cross-site scripting: no tainted scripts in HTML data • Policy 4: Path traversal: tainted file names cannot access data outside of a certain document tree • … • Other examples • Policy 5: No tainted CMD_NAME or CMD_SEPARATOR nodes in shell or SQL commands
Implementation status • Four test applications • phpBB • SquirrelMail • PHP/XMLRPC • WebGoat (J2EE) • Detects following attacks without FPs • Command injection (Policies 1, 2, 5) • SQL injection (1, 2, 5) • XSS (3) • HTTP Response splitting (2) • Path traversal (4) • Memory corruption detected using ASR • Should be able to detect many other attacks easily • XPATH injection (1,2), Format-string (1, 2), Log injection (1,2)
Memory Error Based Remote Attack Attacker’s goal: Overwrite target of interest to take over instruction execution Attacker’s approach: Propagate attacker controlled input to target of interest Violate certain structural constraints in the propagation process
Stack Frame Structural Violation A’s stack frame Function arguments High Return address Previous stack frame Exception Registration Record Local variables B’s stack frame Function arguments Return address( to A) Previous stack frame Local variables C’s stack frame Function arguments Low Return address (to B) EBP Previous stack frame FS:0 Exception Registration Record ESP Local variables
Heap Block Structural Violation • Happens when removing free block from double-linked list: • Ability to write 4 bytes into any address, usually well known address, like function pointer, return address, SEH etc. Size Previous Size Segment Index Flags Unused Tag Index FLink BLink Windows Free Heap Block Header Structure
ASLR and Crash Analysis • ASLR randomizes the addresses of targets of interest • Memory attack using the original address will miss and cause crash (exception). • Crash analysis tracks back to vulnerability, which enables accurate signature generation • Structural information usually retrievable at runtime, thanks to enhanced debugging technology • Crash analysis aided with JIT(Just In-time Tracing) • JIT triggered at certain events: • “Suspicious” network inputs, e.g. sensitive JMP address • Attach/detach JIT monitor at event of interest • Memory dump can be dumped in the right granularity, log info from a few KB to a 2GB
Crash Root Cause Analysis Root Cause Analysis Exception Record/Context, Faulting thread/Instructions/Registers Stack trace/Heap/Module/Symbols Stack Corruption Heap Corruption Read Access Violation Bad EIP (Corrupted Return Address or SEH) Read Access Violation Bad Deference (Corrupted Local Variables/passing parameters) Write Access Violation (Address to write, Value to write )
Stack-based Overflow Analysis • “Target” driven analysis • The goal of attack string is to overwrite target of interest on stack, e.g., return address, SEH handler. • Start matching target values from crash dump to input, like EIP, EBP and SEH handler • More efficient than pattern match in the whole address space • If any targets are matched in input, expand in both directions to find LCS • A match usually indicates the input size needed to overflow certain targets
SEH Overflow and Analysis • A unique approach for Windows exploit • SEH stands for Structured Exception Handler • Windows put EXCEPTION_REGISTRATION_RECORD chain on stack with SEH in the record. • More reliable and powerful than overwrite return address • More JMP address to use (pop/pop/ret) • An exception (accidental/intentional) is desired • Can bypass /GS buffer check • SEH crash analysis: • Catch the first exception as well as the second one (caused by ASR) • Locate the SEH chain head from first dump, usually overwritten by input • Usually first exception is enough, second exception can be used for confirmation
Heap Overflow Analysis • How to analyze heap overflow attack? • Exploit happens in free blocks unlink • Multiple ways to trigger • Write Access Violation with ASR • with overwriting in invalid address • Overwrite 4 bytes value in arbitrary address • Interested targets include return address, SEH, PEB and UEF • Exploit contains the pair: (Address To Write, Value to Write) • Appeared in the overflowed heap blocks • Usually contained in registers • Should be provided from input by attacker • Match found in synthetic heap exploits • The value pairs need to be in fixed offset • For a given heap overflow vulnerability • To enable overwrite the right address with the right value desired
Case Study: RPC DCOM • Step 1: Exception Analysis FAULTING_IP: +18759f ExceptionCode: c0000005 (Access violation) Attempt to read from address 0018759f PROCESS_NAME: svchost.exe FAULTING_THREAD: 00000290 PRIMARY_PROBLEM_CLASS: STACK_CORRUPTION • Step 2: Target – Input correlation: StackBase: 0x6c0000, StackLimit: 0x6bc000,Size =0x4000 Begin analyze on Target Overwrite and Input Correlation: Analyze crash EIP: Find EIP pattern at socket input: Bytes size to overwrite EIP= 128 Analyze crash EIP done! Analyze SEH: Find SEH byte at socket input: Bytes size to overwrite SEH handler= 1588 Analyze SEH done!
Signature Generation • Signature generation: • Signature captures the vulnerability characteristics • Minimum size to overwrite certain target(s) • Use contexts to reduce false positive: • Using incoming input calling stack • Stack offset can uniquely identify the context • Using incoming input semantic context: • Message format like HTTP url/parameter • Binary message field
Components & Implementation RAMSES Crash Monitor: * Catch interested exception only • Snapshots for a given period * Self healer Protected Application 1 Infrastructure: Save Crash Dump Extract Relevant Info Search/Match Disassemble Crash(Exception) Uses 2 Generate Windows Debug Engine Crash Dump* 5 4 Analyze Signature RAMSES Crash Analyzer • Fault type detection • Security oriented analysis • Feedback 3 Provide Input History Uses * Crash Dump provides the same interface as LIVE process, so Crash Analyzer actually does NOT have to work on saved crash dump file.
Test Attacks & Applications Baseline Applications • phpBB (php) • squirrelMail (php) • WebGoat (java) • hMailServer (C++) Many “sub languges” SQL, XML, JavaScript, HTML, HTTP, JSON, shell, cmd, path
Traffic Generation • Purpose • Coverage of legitmate structural variation in monitored structures • SQL, command strings, call parameters • Stress of log complexity for practicality • Multiple users, multiple sessions • Performance measurements • Program performance metrics • Quantify performance impact
Traffic Generation to Web Sites • Approaches • Simple Record/Playback (basic) • with minor substitutions (cookies, ips) • shell scripts, netcat, MaxQ (jython based • Custom DOM/Ajax scripting (learning) • Can access dynamically generated browser content after(during) client side script eval • Automated site crawls of URLS • Automated form contents (site specific metadata) • COTS tools • Load testing and metrics
Suggested Red Team ROEs • Initial telecons held in Fall • Claim: RAMSES will defeat most generalized injection attacks on protected applications • Red Team should target our current and planned applications rather than new ones (unless new application, sample attacks and complete traffic generator can be provided to RAMSES far enough in advance for learning and testing) • Remote network access to the targeted application • Attack designated application suite • Required instrumentation yet to be determined • Red Team exercise start 15 April or later • ……
RAMSES Project Schedule Baseline Tasks 1. Refine RAMSES Requirements 2. Design RAMSES 3. Develop Components 4. Integrate System 5. Analyze & Test RAMSES 6. Coordinate & Rept Prototypes Optional Tasks O.3 Cross-Area Exper CY06 CY09 CY07 CY08 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 1 2 3 Red Team Exercise Today: 11 September 2007
Plans • Develop input filters from output policies • Extend memory error analyzer • Demonstrate RAMSES on more applications and attack types • Native C/C++ app (most likely app is hMail server) • Java • Integrate components • Performance and false positive testing • Red Team exercise