340 likes | 589 Views
The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be. Nathan Rosenblum University of Wisconsin nater@cs.wisc.edu. Binary Analysis. Processing of the binary code to extract syntactic and symbolic information from many sources: Symbol tables (if present)
E N D
The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum University of Wisconsin nater@cs.wisc.edu Unconventional Code Constructs
Binary Analysis • Processing of the binary code to extract syntactic and symbolic information from many sources: • Symbol tables (if present) • Decode (disassemble) instructions • Control-flow information: basic blocks, loops, functions • Data-flow information: from basic register information to highly sophisticated (and expensive) analyses. Unconventional Code Constructs
Products of Binary Analysis • High-level organization and characteristics • Function entry/exit points • Intra-procedural call graph • Inter-procedural control-flow graph • Exception handlers • Jump tables • Virtual function tables • Abstract assembly representation • Data-flow characteristics • Register liveness (for instrumentation, modification) Unconventional Code Constructs
Uses of Binary Analysis • Debugging • Testing • Performance profiling • Performance modeling • Behavior Modeling • Dynamic Modification • Binary Rewriting • Reverse engineering Unconventional Code Constructs
Binary Analysis Tool Goals Unconventional Code Constructs
Why is Binary Analysis Hard? Source Code Binary Func foo() { … switch(a) { … } … } push %ebp mov %esp, %ebp … mov [0x1d], %eax jmp *%eax … The Compiler Unconventional Code Constructs
Current Approaches • Linear disassembly of binaries is insufficient • Symbol tables often lie, or are absent • Functions are not address ranges, may be non-contiguous • Parsing based on program control flow • Commonly used approach: • Must contend with gaps in known code regions after parsing Unconventional Code Constructs
Dyninst Control Flow Parsing • Opportunistic parsing: • Utilizes symbol table and other information when available (and sensible) • Provides more accurate view of the binary than linear disassembly • Addresses problem of gaps in the binary through speculative parsing • Heuristics to identify function preambles Unconventional Code Constructs
Control Flow Traversal Illustrated Parsing follows control flow Control transfers are edges in the CFG Target blocks can parsed in any order <func foo>: 00: mov [a8], r1 04: mov [ac], r2 08: add r1, r2, r3 0c: cmp r3, 0 10: bne 24 14: call <bar> 18: add r3, 8, r3 1c: call <baz> 20: jmp 28 24: mul r2, 2, r3 28: sub r1, r3, r1 . . . 00 24 14 28 Unconventional Code Constructs
Control Flow Traversal Illustrated Call sites determine location of functions Targets of calls are added to the function parsing work list Known Functions foo quux quuux bar baz <func foo>: 00: mov [a8], r1 04: mov [ac], r2 08: add r1, r2, r3 0c: cmp r3, 0 10: bne 24 14: call <bar> 18: add r3, 8, r3 1c: call <baz> 20: jmp 28 24: mul r2, 2, r3 28: sub r1, r3, r1 . . . Unconventional Code Constructs
Binary Parsing Challenges • Pointer-based control transfer • Non-returning calls • Non-contiguous code sections • Tail calls • Gaps in the binary • Exception handlers • Shared codeand multiple entryrepresentation Unconventional Code Constructs
Non-returning Call Sites • Some functions will not return • Examples: abort, exit • Code following call site may not be valid • Even if names are available, calls may be hard to detect: dfaerror fatal exit Unconventional Code Constructs
Detecting Non-Returning Functions • Goal: detect non-returning functions from first principles • Identify distinguishing features of non-returning functions • Wide variety of behavior in non-returning functions makes this difficult Example: operations in abort abort() -> sigaction() IO_flush_all() raise(SIGABRT) -> kill(getpid(),sig) hlt[privileged instruction] Unconventional Code Constructs
Non-returning Call Sites Example: GNU libc library routines 000214d0 <__assert_fail>: . . . 2160f: e8 cc db 0a 00 call cf1e0 <__libc_write> 21614: e8 07 7f 00 00 call 29520 <abort> 21619: 90 nop 2161a: 90 nop 2161b: 90 nop 2161c: 90 nop 2161d: 90 nop 2161e: 90 nop 2161f: 90 nop 00021620 <__assert_perror_fail>: 21620: 55 push %ebp 21621: 89 e5 mov %esp,%ebp . . . Call to abort does not return Parser will naively follow control into the following region Bytes following call site may not be code (e.g., jump tables, other functions, string data) Unconventional Code Constructs
Non-contiguous Code • Functions are not address ranges • Symbol table representation fails • Many sources of non-contiguous layout: • Jump tables • Data (strings, etc) • Unparsed code • Exception handlers • Padding or junk bytes Func Foo Unconventional Code Constructs
Non-contiguous Code Example: Microsoft Word . . . 77e7b1cb: 83 41 04 04 addl $0x4,0x4(%ecx) 77e7b1cf: 5d pop %ebp 77e7b1d0: c2 0c 00 ret $0xc 77e7b1d3: 68 f5 06 00 00 push $0x6f5 77e7b1d8: eb 05 jmp 0x77e7b1df 77e7b1da: 68 e6 06 00 00 push $0x6e6 77e7b1df: e8 bb 86 02 00 call 0x77ea389f 77e7b1e4: 4c ba e7 77 77e7b1e8: 34 b2 e7 77 77e7b1ec: b5 b1 e7 77 77e7b1f0: 0c 9f e8 77 77e7b1f4: 96 37 e8 77 77e7b1f8: cf b1 e7 77 77e7b1fc: 00 00 00 00 01 01 01 02 02 02 03 03 04 02 05 77e7b20c: 3c 10 cmp $0x10,%al 77e7b20e: 0f 85 a6 3b 02 00 jne 0x77e9edba . . . Jump table separates valid instruction sequences Control following call site is invalid Unconventional Code Constructs
Named Non-contiguous Sections Example: GNU libc library routines 00021060 <__duplocale>:....210f0: lock cmpxchg %ecx,0x2968(%ebx) 210f8: jne 2118e 210fe: xor %esi,%esi 21100: cmp $0x6,%esi... 0002118e <_L_mutex_lock_78>: 2118e: lea 0x2968(%ebx),%ecx 21194: call ea0f0 21199: jmp 210fe Looks like shared code Fragment is not a real function Unconventional Code Constructs
Named Non-contiguous Sections • Recognizing function fragments • Have a symbol table entry • Reached by branches from one function • Branch back to one function • Use combination of CFG and symbol table clues Unconventional Code Constructs
Tail Calls Compiler has joined two functions into one Looks like non-contiguous shared code Func Foo Func Bar Func Quux . . . call <bar> . . . jmp <quux> . . . ret Unconventional Code Constructs
Gap Parsing • Gaps between known code regions may contain undiscovered functions • Targets of indirect calls Func Foo Unidentified section of code Func Bar Speculative parsing: pattern-based heuristics to recognize function prologues in gaps Unconventional Code Constructs
Exceptions • Exception handling code is normally unreachable • Use information in the binary where available • Example: Linux ELF exception tables push %ebp mov %esp,%ebp push %ebx sub $0x24,%esp movl $0x6,0xfffffff8(%ebp) mov 0x8(%ebp),%eax mov %eax,(%esp) call 804aafa jmp 804abe9 mov %eax,0xfffffff4(%ebp) cmp $0x2,%edx je 804ab58 . . . mov 0xfffffff4(%ebp),%eax mov %eax,(%esp) call 804a388 add $0x24,%esp pop %ebx pop %ebp ret C++ style exception catch block Unconventional Code Constructs
Shared Code Models • Code may be shared between functions • Multiple entry functions • Compiler optimizations • Analysis tools must be able to recognize and handle overlapping control flow Func A Func B Shared Code Unconventional Code Constructs
Summary of Binary Analysis Techniques • Control flow traversal is a powerful tool for addressing the challenges of modern binaries • Lying/missing symbol tables • Data/code disambiguation • Jump tables • Speculative parsing techniques can be useful for expanding code coverage • Gaps in code • Indirect calls and branches Unconventional Code Constructs
Incidence of Shared Code in Binaries • Parsed 828 Linux/x86 binaries • 238 contained shared code • Most binaries contain only a few code-sharing functions • Some code sharing may be due to non-returning call sites Unconventional Code Constructs
Where Do We Go From Here? • Are there good solutions from first principles? • Almost certainly. • We are just starting to explore the limits of such techniques. • Are special case solutions necessary? • Again, almost certainly. • We will try to use these as sparingly as possible. Unconventional Code Constructs
Future Directions in Binary Analysis • Problem: code exists but is unreachable through standard control-flow traversal parsing • Heuristics are a moving target • Existing opportunistic parsing techniques can help, but only to an extent • Exception handlers, virtual function tables may be recoverable from the binary • Given the information we can recover from traditional techniques, can we synthesize additional information that will increase coverage of the binary? Unconventional Code Constructs
Statistical Binary Parsing • Can we utilize known code to find unknown code? • We have a partial parse of the binary • Code unknown regions of the binary will likely share characteristics with previously identified code • Identify code in unknown regions: • Create a probabilistic model of valid code • Identify sections of unknown regions in the binary that are similar to valid code Unconventional Code Constructs
Binary Modeling Techniques • Code idiomsare one possibility for validating potential code • Function preambles, jump table bounds tests, system call stubs, case statements • Idioms can be identified manually • Model can be trained to identify new idioms with machine learning techniques • n-gram models, long-distance interaction • Unparsed code can be scored to indicate its statistical similarity to known code Unconventional Code Constructs
Open Questions in Binary Analysis • What learning techniques will yield the best results? • How can we overcome the relative dearth of information in binaries with very little code reachable through control flow analysis? • Incorporate information from analysis of other binaries • What techniques will allow us to accurately identify the range of recognizable code? Unconventional Code Constructs
Questions? Unconventional Code Constructs
Backup Slides Unconventional Code Constructs
Shared Code Models Func A Func B Entry A Entry B What is the difference from the perspective of the parser? Shared Code Multiple Entry Unconventional Code Constructs
A Choice of Abstraction • Shared code and multiple entry models are similar • Represent independent flows of control merging together • Shared model is a better fit for Dyninst • Preserves semantic guarantees of function independence Unconventional Code Constructs
Shared Code Example: GNU libc library routines 000a94c0 <__waitpid>: a94c0: cmpl $0x0,%gs:0xc a94c8: jne a94e7 000a94ca <__waitpid_nocancel>: a94ca: push %ebx a94cb: mov 0x10(%esp,1),%edx a94cf: mov 0xc(%esp,1),%ecx a94d3: mov 0x8(%esp,1),%ebx a94d7: mov $0x7,%eax a94dc: int $0x80 a94de: pop %ebx a94df: cmp $0xfffff001,%eax a94e4: jae a9513 . . . Code common to the two functions is marked as shared. Unconventional Code Constructs