1 / 34

The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be

The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be. Nathan Rosenblum University of Wisconsin nater@cs.wisc.edu. Binary Analysis. Processing of the binary code to extract syntactic and symbolic information from many sources: Symbol tables (if present)

knox
Download Presentation

The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum University of Wisconsin nater@cs.wisc.edu Unconventional Code Constructs

  2. Binary Analysis • Processing of the binary code to extract syntactic and symbolic information from many sources: • Symbol tables (if present) • Decode (disassemble) instructions • Control-flow information: basic blocks, loops, functions • Data-flow information: from basic register information to highly sophisticated (and expensive) analyses. Unconventional Code Constructs

  3. Products of Binary Analysis • High-level organization and characteristics • Function entry/exit points • Intra-procedural call graph • Inter-procedural control-flow graph • Exception handlers • Jump tables • Virtual function tables • Abstract assembly representation • Data-flow characteristics • Register liveness (for instrumentation, modification) Unconventional Code Constructs

  4. Uses of Binary Analysis • Debugging • Testing • Performance profiling • Performance modeling • Behavior Modeling • Dynamic Modification • Binary Rewriting • Reverse engineering Unconventional Code Constructs

  5. Binary Analysis Tool Goals Unconventional Code Constructs

  6. Why is Binary Analysis Hard? Source Code Binary Func foo() { … switch(a) { … } … } push %ebp mov %esp, %ebp … mov [0x1d], %eax jmp *%eax … The Compiler Unconventional Code Constructs

  7. Current Approaches • Linear disassembly of binaries is insufficient • Symbol tables often lie, or are absent • Functions are not address ranges, may be non-contiguous • Parsing based on program control flow • Commonly used approach: • Must contend with gaps in known code regions after parsing Unconventional Code Constructs

  8. Dyninst Control Flow Parsing • Opportunistic parsing: • Utilizes symbol table and other information when available (and sensible) • Provides more accurate view of the binary than linear disassembly • Addresses problem of gaps in the binary through speculative parsing • Heuristics to identify function preambles Unconventional Code Constructs

  9. Control Flow Traversal Illustrated Parsing follows control flow Control transfers are edges in the CFG Target blocks can parsed in any order <func foo>: 00: mov [a8], r1 04: mov [ac], r2 08: add r1, r2, r3 0c: cmp r3, 0 10: bne 24 14: call <bar> 18: add r3, 8, r3 1c: call <baz> 20: jmp 28 24: mul r2, 2, r3 28: sub r1, r3, r1 . . . 00 24 14 28 Unconventional Code Constructs

  10. Control Flow Traversal Illustrated Call sites determine location of functions Targets of calls are added to the function parsing work list Known Functions foo quux quuux bar baz <func foo>: 00: mov [a8], r1 04: mov [ac], r2 08: add r1, r2, r3 0c: cmp r3, 0 10: bne 24 14: call <bar> 18: add r3, 8, r3 1c: call <baz> 20: jmp 28 24: mul r2, 2, r3 28: sub r1, r3, r1 . . . Unconventional Code Constructs

  11. Binary Parsing Challenges • Pointer-based control transfer • Non-returning calls • Non-contiguous code sections • Tail calls • Gaps in the binary • Exception handlers • Shared codeand multiple entryrepresentation Unconventional Code Constructs

  12. Non-returning Call Sites • Some functions will not return • Examples: abort, exit • Code following call site may not be valid • Even if names are available, calls may be hard to detect: dfaerror fatal exit Unconventional Code Constructs

  13. Detecting Non-Returning Functions • Goal: detect non-returning functions from first principles • Identify distinguishing features of non-returning functions • Wide variety of behavior in non-returning functions makes this difficult Example: operations in abort abort() -> sigaction() IO_flush_all() raise(SIGABRT) -> kill(getpid(),sig) hlt[privileged instruction] Unconventional Code Constructs

  14. Non-returning Call Sites Example: GNU libc library routines 000214d0 <__assert_fail>: . . . 2160f: e8 cc db 0a 00 call cf1e0 <__libc_write> 21614: e8 07 7f 00 00 call 29520 <abort> 21619: 90 nop 2161a: 90 nop 2161b: 90 nop 2161c: 90 nop 2161d: 90 nop 2161e: 90 nop 2161f: 90 nop 00021620 <__assert_perror_fail>: 21620: 55 push %ebp 21621: 89 e5 mov %esp,%ebp . . . Call to abort does not return Parser will naively follow control into the following region Bytes following call site may not be code (e.g., jump tables, other functions, string data) Unconventional Code Constructs

  15. Non-contiguous Code • Functions are not address ranges • Symbol table representation fails • Many sources of non-contiguous layout: • Jump tables • Data (strings, etc) • Unparsed code • Exception handlers • Padding or junk bytes Func Foo Unconventional Code Constructs

  16. Non-contiguous Code Example: Microsoft Word . . . 77e7b1cb: 83 41 04 04 addl $0x4,0x4(%ecx) 77e7b1cf: 5d pop %ebp 77e7b1d0: c2 0c 00 ret $0xc 77e7b1d3: 68 f5 06 00 00 push $0x6f5 77e7b1d8: eb 05 jmp 0x77e7b1df 77e7b1da: 68 e6 06 00 00 push $0x6e6 77e7b1df: e8 bb 86 02 00 call 0x77ea389f 77e7b1e4: 4c ba e7 77 77e7b1e8: 34 b2 e7 77 77e7b1ec: b5 b1 e7 77 77e7b1f0: 0c 9f e8 77 77e7b1f4: 96 37 e8 77 77e7b1f8: cf b1 e7 77 77e7b1fc: 00 00 00 00 01 01 01 02 02 02 03 03 04 02 05 77e7b20c: 3c 10 cmp $0x10,%al 77e7b20e: 0f 85 a6 3b 02 00 jne 0x77e9edba . . . Jump table separates valid instruction sequences Control following call site is invalid Unconventional Code Constructs

  17. Named Non-contiguous Sections Example: GNU libc library routines 00021060 <__duplocale>:....210f0:  lock cmpxchg %ecx,0x2968(%ebx) 210f8:  jne    2118e 210fe:  xor    %esi,%esi 21100:  cmp    $0x6,%esi... 0002118e <_L_mutex_lock_78>: 2118e: lea    0x2968(%ebx),%ecx 21194: call   ea0f0 21199: jmp    210fe Looks like shared code Fragment is not a real function Unconventional Code Constructs

  18. Named Non-contiguous Sections • Recognizing function fragments • Have a symbol table entry • Reached by branches from one function • Branch back to one function • Use combination of CFG and symbol table clues Unconventional Code Constructs

  19. Tail Calls Compiler has joined two functions into one Looks like non-contiguous shared code Func Foo Func Bar Func Quux . . . call <bar> . . . jmp <quux> . . . ret Unconventional Code Constructs

  20. Gap Parsing • Gaps between known code regions may contain undiscovered functions • Targets of indirect calls Func Foo Unidentified section of code Func Bar Speculative parsing: pattern-based heuristics to recognize function prologues in gaps Unconventional Code Constructs

  21. Exceptions • Exception handling code is normally unreachable • Use information in the binary where available • Example: Linux ELF exception tables push %ebp mov %esp,%ebp push %ebx sub $0x24,%esp movl $0x6,0xfffffff8(%ebp) mov 0x8(%ebp),%eax mov %eax,(%esp) call 804aafa jmp 804abe9 mov %eax,0xfffffff4(%ebp) cmp $0x2,%edx je 804ab58 . . . mov 0xfffffff4(%ebp),%eax mov %eax,(%esp) call 804a388 add $0x24,%esp pop %ebx pop %ebp ret C++ style exception catch block Unconventional Code Constructs

  22. Shared Code Models • Code may be shared between functions • Multiple entry functions • Compiler optimizations • Analysis tools must be able to recognize and handle overlapping control flow Func A Func B Shared Code Unconventional Code Constructs

  23. Summary of Binary Analysis Techniques • Control flow traversal is a powerful tool for addressing the challenges of modern binaries • Lying/missing symbol tables • Data/code disambiguation • Jump tables • Speculative parsing techniques can be useful for expanding code coverage • Gaps in code • Indirect calls and branches Unconventional Code Constructs

  24. Incidence of Shared Code in Binaries • Parsed 828 Linux/x86 binaries • 238 contained shared code • Most binaries contain only a few code-sharing functions • Some code sharing may be due to non-returning call sites Unconventional Code Constructs

  25. Where Do We Go From Here? • Are there good solutions from first principles? • Almost certainly. • We are just starting to explore the limits of such techniques. • Are special case solutions necessary? • Again, almost certainly. • We will try to use these as sparingly as possible. Unconventional Code Constructs

  26. Future Directions in Binary Analysis • Problem: code exists but is unreachable through standard control-flow traversal parsing • Heuristics are a moving target • Existing opportunistic parsing techniques can help, but only to an extent • Exception handlers, virtual function tables may be recoverable from the binary • Given the information we can recover from traditional techniques, can we synthesize additional information that will increase coverage of the binary? Unconventional Code Constructs

  27. Statistical Binary Parsing • Can we utilize known code to find unknown code? • We have a partial parse of the binary • Code unknown regions of the binary will likely share characteristics with previously identified code • Identify code in unknown regions: • Create a probabilistic model of valid code • Identify sections of unknown regions in the binary that are similar to valid code Unconventional Code Constructs

  28. Binary Modeling Techniques • Code idiomsare one possibility for validating potential code • Function preambles, jump table bounds tests, system call stubs, case statements • Idioms can be identified manually • Model can be trained to identify new idioms with machine learning techniques • n-gram models, long-distance interaction • Unparsed code can be scored to indicate its statistical similarity to known code Unconventional Code Constructs

  29. Open Questions in Binary Analysis • What learning techniques will yield the best results? • How can we overcome the relative dearth of information in binaries with very little code reachable through control flow analysis? • Incorporate information from analysis of other binaries • What techniques will allow us to accurately identify the range of recognizable code? Unconventional Code Constructs

  30. Questions? Unconventional Code Constructs

  31. Backup Slides Unconventional Code Constructs

  32. Shared Code Models Func A Func B Entry A Entry B What is the difference from the perspective of the parser? Shared Code Multiple Entry Unconventional Code Constructs

  33. A Choice of Abstraction • Shared code and multiple entry models are similar • Represent independent flows of control merging together • Shared model is a better fit for Dyninst • Preserves semantic guarantees of function independence Unconventional Code Constructs

  34. Shared Code Example: GNU libc library routines 000a94c0 <__waitpid>: a94c0: cmpl $0x0,%gs:0xc a94c8: jne a94e7 000a94ca <__waitpid_nocancel>: a94ca: push %ebx a94cb: mov 0x10(%esp,1),%edx a94cf: mov 0xc(%esp,1),%ecx a94d3: mov 0x8(%esp,1),%ebx a94d7: mov $0x7,%eax a94dc: int $0x80 a94de: pop %ebx a94df: cmp $0xfffff001,%eax a94e4: jae a9513 . . . Code common to the two functions is marked as shared. Unconventional Code Constructs

More Related