1 / 23

Machine-Learning Assisted Binary Code Analysis

Machine-Learning Assisted Binary Code Analysis. N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu. K. Hunt National Security Agency huntkc@gmail.com. Supporting Static Binary Analysis .

farhani
Download Presentation

Machine-Learning Assisted Binary Code Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu K. Hunt National Security Agency huntkc@gmail.com

  2. Supporting Static Binary Analysis Binary Analysis is a Foundational Technique for Many Areas • Malware detection • Vulnerability analysis • Static and Dynamic Instrumentation • Formal verification Example Uses Why Analyze Binaries? • Source code unavailable • e.g., malware • Source code is inaccurate • Compiler transforms structure • Provides most accurate representation Code is found through symbol information and parsing MUCH HARDER without symbols Rosenblum, Zhu, Miller, Hunt

  3. Many Binaries are Stripped • Malicious programs • Operating system distributions • Commercial software packages • Legacy codes BINARY Stripped binaries lack symbol & debug information Headers EXAMPLES: Code Segment (functions?) Data Segment Standard Approach: Parse from entry point Rosenblum, Zhu, Miller, Hunt

  4. Stripped Binaries Exhibit Gaps Code Segment • Indirect (pointer-based) control ambiguity • Deliberate calls/branch obfuscation • Gaps in code segment may not contain code After static parsing, gap regions remain Rosenblum, Zhu, Miller, Hunt

  5. Gap contents may vary .__gmon_start__.libc.so.6.stpcpy.strcpy.__divdi3.printf.stdout.strerror.memmove.getopt_long.re_syntax_options.__ctype_b.getenv.__strtol_internal.getpagesize.re_search_2.memcpy.puts.feof.malloc.optarg.btowc._obstack_newchunk.re_match.__ctype_toupper.__xstat64.abort.strrchr._obstack_begin.calloc.re_set_registers.fprintf. Stripped Binaries Exhibit Gaps Code Segment String data • Dialog Constants • Import names • Other strings Rosenblum, Zhu, Miller, Hunt

  6. Gap contents may vary 0x8022346 0x802434b 0x80243ad 0x80403d0 0x80503d0 0x8052140 0x8053142 0x806000b 0x802321a 0x8023332 0x804132a 0x8050ca0 Stripped Binaries Exhibit Gaps Code Segment Tables or lists of addresses • Jump tables • Virtual function tables • Data objects Rosenblum, Zhu, Miller, Hunt

  7. Gap contents may vary Stripped Binaries Exhibit Gaps Code Segment gap_funcA { . . . } Code unreachable through standard static parsing gap_funcB { . . . • Function pointers • Virtual methods • Obfuscated calls gap_funcC { . . . } Rosenblum, Zhu, Miller, Hunt

  8. Stripped Binaries Exhibit Gaps Code Segment Gap contents may vary 7a 01 00 fd a2 b3 74 68 69 73 20 65 78 61 6d 70 6c 65 20 69 73 20 62 6f 67 75 73 2e 2e 2e 7a 01 00 fd a2 b3 74 68 69 73 20 65 78 61 6d 70 6c 65 20 69 73 20 62 6f 67 75 73 2e 2e 2e 7a 01 00 fd a2 b3 74 68 69 73 20 65 78 61 6d 70 6c 65 20 69 73 20 62 6f But… all of these just look like bytes Every byte in gaps may be the start of a function How can we find code in gaps? Our approach: Use information in known code to model code in gaps Previous work (Vigna et al., 2007) augments parsing with simple instruction frequency information Rosenblum, Zhu, Miller, Hunt

  9. Modeling Binary Code Problem reduces to finding function entry points Task: Classifying every byte in a gap as entry point or non-entry point • Content: Idiom features of function entry points • Based on instruction sequences • Structure: Control flow & conflict features • Capture relationship of candidate function entry points • Requires joint assignment over all function entry point candidates Two types of features: Rosenblum, Zhu, Miller, Hunt

  10. Content-based Features Entry idioms are common patterns at function entry points Idioms are preceding and succeeding instruction sequences with wildcards Candidate For each idiom u, C1 Entry idioms push ebp push ebp|mov esp,ebp push ebp|*|sub esp push ebp|*|mov esp,ebp *|mov_esp,ebp *|sub 0x8,esp *|mov 0x8(ebp),eax PRE nop PRE ret|nop PRE pop ebp|*|nop Rosenblum, Zhu, Miller, Hunt

  11. y1 = 1 y2 = 1 y3 = -1 y4 = 1 Call Consistency & Overlap Call & conflict features relate candidate FEPs over entire gap Candidates C1 C2 C3 C4 Rosenblum, Zhu, Miller, Hunt

  12. Markov Random Field Formalization • Joint assignment of yi = {1,-1} for each FEP xi in binary P • Unary idiom features fu • Weights u trained through logistic regression • Binary features fo (overlap), fc (call consistency) • Weights o, c large, negative Rosenblum, Zhu, Miller, Hunt

  13. Experimental Setup • Large set (100’s) of binaries from department Linux servers and Windows workstations • Additional binaries compiled with Intel compiler • Binaries have full symbol information • Model implemented as extensions to Dyninst instrumentation library • Strip binary copies and parse to obtain training set • Select top idiom features by forward feature selection • Perform logistic regression to build idiom model • Evaluate model on test data from gap regions in Step 1. • Unstripped copies of binaries provide reference set Rosenblum, Zhu, Miller, Hunt

  14. Features: Feat1 Feat2 Feat3 ... Featk Idiom Feature Selection & Training 1. Obtain training data from traditional parse 2. Use Condor HTC to drive forward feature selection on idioms Statically reachable functions … Corpus is hundreds of stripped binaries 3. Perform logistic regression on the selected idiom features to obtain model parameters t Rosenblum, Zhu, Miller, Hunt

  15. Evaluation Data Sets • GNU C Compiler • Simple, regular function preamble • MS Visual Studio • High variation in function entry points • Intel C Compiler • Most variation in entry points; highly optimized Rosenblum, Zhu, Miller, Hunt

  16. Preliminary Results Comparison of three binary analysis tools: • Original Dyninst • Scans for common entry preamble • IDA Pro Disassembler • Scans for common entry preamble • List of Library Fingerprints (Windows) • Dyninst w/ Model • Model replaces entry preamble heuristic Rosenblum, Zhu, Miller, Hunt

  17. Classifier Comparisons GCC MSVS ICC Model-based Dyninst extensions outperform vanilla Dyninst and IDA Pro Rosenblum, Zhu, Miller, Hunt

  18. Model Component Contributions • Structural information improves classifier accuracy • Conflict resolution contributes the most ICC Test Set Rosenblum, Zhu, Miller, Hunt

  19. So Far We’ve… • Framed stripped binary parsing as a machine learning problem • Combined idiom and structural information to consider gap regions as a whole • Extended Dyninst with classifier of Function Entry Points in gaps • Obtained significant improvement in parsing stripped binaries over existing tools • Shown how the HTC approach makes expensive ML techniques tractable for large scale systems Rosenblum, Zhu, Miller, Hunt

  20. Future Work: Extensions • We’d like precision-recall AUC  1. How? • More detailed instruction sequence models (e.g. Hidden Markov Model) • Additional information sources (e.g. pointer tables) • Caveat: this is where IDA Pro often goes wrong • Code provenance • First task: identify source compiler (needed to choose appropriate model) Rosenblum, Zhu, Miller, Hunt

  21. Future Work: Targets • Malicious code • Lots of hand-coded assembly • Usually packed (see Kevin Roundy’s talk) • Obfuscated code • Obfuscation/deobfuscation arms race • Signal-based obfuscation is latest salvo • Can not trust control flow (e.g. non-returning calls, branch functions, opaque branches) • Maybe model block-level structural properties? Rosenblum, Zhu, Miller, Hunt

  22. Backup Slides Rosenblum, Zhu, Miller, Hunt

  23. Tool Performance Comparison • Classifier maintains high precision with good recall • Model performance highly system-dependent • MS Visual Studio & Intel C Compiler FEPs are highly variable Rosenblum, Zhu, Miller, Hunt

More Related