1 / 42

Extracting File Formats from Executables

Learn how to automatically extract output data format specifications from executables without source code. Utilize cutting-edge techniques to recover, analyze, and validate file formats. Enhance your reverse engineering skills with this invaluable resource.

showalters
Download Presentation

Extracting File Formats from Executables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting File Formats from Executables Junghee Lim, Thomas Reps and Ben Liblit University of Wisconsin-Madison 13th Working Conference on Reverse Engineering Oct. 26, 2006 http://www.cs.wisc.edu/~junghee/WCRE2006.ppt University of Wisconsin-Madison

  2. * * size: 1 value: Top size: 1 value: Top size: Top value: Top size: Top value: Top size: 4 value: Top size: 4 value: Top size: 1 value: 0x1F size: 1 value: 0x8B size: 1 value: 0x08 size: 1 value: Top size: 4 value: Top Data Format (File Format) • Goal: automatically extract a specification of a program’s output format • E.g., something similar to the file-format specification for gzip • FFE (File Format Extractor) • Input: an executable without source code or documentation • Output: a representation of the output data format • (e.g., a regular expression) University of Wisconsin-Madison

  3. * * size: 1 value: Top size: 1 value: Top size: Top value: Top size: Top value: Top size: 4 value: Top size: 4 value: Top size: 1 value: 0x1F size: 1 value: 0x8B size: 1 value: 0x08 size: 1 value: Top size: 4 value: Top Gzip specification vs. our structure University of Wisconsin-Madison

  4. Usage Scenarios • Reuse components of a tool chain • COTS (Commercial Off-The-Shelf) products • Detect malware • Recover output format (= network-communication pattern) from captured malware • Detect variants in the wild by detecting network traffic with that pattern • Characterize what a program computes/creates • Find inconsistencies between specifications and implementations University of Wisconsin-Madison

  5. Bulk writes Individual writes 1 2 Programming Styles e.g. - tar - cpio e.g. - gzip - compress95 - png2ico University of Wisconsin-Madison

  6. What are the Steps? • Disassemble executable • Recover • Interprocedural CFG • Variables (and their sizes) • Possible values of variables • Construct Hierarchical Finite-State Machine (HFSM) • Annotate HFSM with size/value information • [Construct regular expression] • Perform in-line expansion • [Validation] • Regular exp.  flex spec.  recognizer • Examples  recognizer  success/failure University of Wisconsin-Madison

  7. What are the Steps? • Disassemble executable • Recover • Interprocedural CFG • Variables (and their sizes) • Possible values of variables • Construct Hierarchical Finite-State Machine (HFSM) • Annotate HFSM with size/value information • [Construct regular expression] • Perform in-line expansion • [Validation] • Regular exp.  flex spec.  recognizer • Examples  recognizer  success/failure University of Wisconsin-Madison

  8. 1 2 5 6 9 10 7 9 10 8 FSM 3 5’ 6’ 9’ 10’ 7’ 9’ 10’ 8’ 4 HFSM baz foo bar 1 5 call bar 6 2 9 call baz call bar 3 7 call baz 10 4 8 What are the Steps? • Disassemble executable • Recover • Interprocedural CFG • Variables (and their sizes) • Possible values of variables • Construct Hierarchical Finite-State Machine (HFSM) • Annotate HFSM with size/value information • [Construct regular expression] • Perform in-line expansion • [Validation] • Regular exp.  flex spec.  recognizer • Examples  recognizer  success/failure University of Wisconsin-Madison

  9. What are the Steps? • Disassemble executable • Recover • Interprocedural CFG • Variables (and their sizes) • Possible values of variables • Construct Hierarchical Finite-State Machine (HFSM) • Annotate HFSM with size/value information • [Construct regular expression] • Perform in-line expansion • [Validation] • Regular exp.  flex spec.  recognizer • Examples  recognizer  success/failure University of Wisconsin-Madison

  10. What are the Steps? • Disassemble executable • Recover • Interprocedural CFG • Variables (and their sizes) • Possible values of variables • Construct Hierarchical Finite-State Machine (HFSM) • Annotate HFSM with size/value information • [Construct regular expression] • Perform in-line expansion • [Validation] • Regular exp.  flex spec.  recognizer • Examples  recognizer  success/failure University of Wisconsin-Madison

  11. Well-known concepts from • formal-language theory • but we use varying-sized • alphabet symbols What are the Steps? • Disassemble executable • Recover • Interprocedural CFG • Variables (and their sizes) • Possible values of variables • Construct Hierarchical Finite-State Machine • Annotate HFSM with size/value information • [Construct regular expression] • Perform in-line expansion • [Validation] • Regular exp.  flex spec.  recognizer • Examples  recognizer  success/failure University of Wisconsin-Madison

  12. What are the Steps? • Disassemble executable • Recover • Interprocedural CFG • Variables (and their sizes) • Possible values of variables • Construct Hierarchical Finite-State Machine (HFSM) • Annotate HFSM with size/value information • [Construct regular expression] • Perform in-line expansion • [Validation] • Regular exp.  flex spec.  recognizer • Examples  recognizer  success/failure University of Wisconsin-Madison

  13. Executable 0100100010001001001001000010111010100111010101010101010101010000101010100101010110110100101010100101010101010010010101010101010100101010101010101001001010101010 compile disassemble Individual writes 1 Example code University of Wisconsin-Madison

  14. User-supplied information • Library function, or • Wrapped library function The disassembled code for our example 401120 sub_401120 proc near; type 401120 push ebp 401121 mov ebp, esp 401123 sub esp, 0Ch 401126 mov eax, [ebp-4] 401129 mov [ebp-8], eax 40112C cmp [ebp-8], 0 401130 jz short loc_40113A 401132 cmp [ebp-8], 1 401136 jz short loc_401147 401138 jmp short loc_401152 40113A loc_40113A: 40113A mov eax, [ebp-4] 40113D mov [esp], eax 401140 call sub_401050 401145 jmp short loc_401152 401147 loc_401147: 401147 mov eax, [ebp-4] 40114A mov [esp], eax 40114D call sub_401050 401152 loc_401152: 401152 leave 401153 retn 401154 sub_401154 proc near; chksum 401154 push ebp 401155 mov ebp, esp 401157 sub esp, 8 40115A mov eax, [ebp-4] 40115D mov [esp], eax 401160 call sub_401075 401165 leave 401166 retn 401167 sub_401167 proc near; fill_data 401167 push ebp 401168 mov ebp, esp 40116A sub esp, 8 40116D loc_40116D: 40116D cmp [ebp-1], 0 401171 jz short loc_401181 401173 movsx eax, [ebp-1] 401177 mov [esp], eax 40117A call sub_401050 40117F jmp short loc_40116D 401181 loc_401181: 401181 leave 401182 retn 401183 sub_401183 proc near; main 401183 push ebp 401184 mov ebp, esp 401186 sub esp, 28h 401189 and esp, 0FFFFFFF0h 40118C mov eax, 0 401191 add eax, 0Fh 401194 add eax, 0Fh 401197 shr eax, 4 40119A shl eax, 4 40119D mov [ebp-14h], eax 4011A0 mov eax, [ebp-14h] 4011A3 call sub_401200 4011A8 call __main 4011AD mov eax, [ebp-10h] 4011B0 mov [esp], eax 4011B3 call sub_401075 4011B8 mov eax, [ebp-0Ch] 4011BB mov [esp], eax 4011BE call sub_401075 4011C3 mov [esp+4], 4 4011CB mov eax, [ebp-8] 4011CE mov [esp], eax 4011D1 call sub_4010E4 4011D6 call sub_401120 4011DB call sub_401167 4011E0 mov eax, [ebp-4] 4011E3 mov [esp], eax 4011E6 call sub_401075 4011EB call sub_401154 4011F0 mov eax, 0 4011F5 leave 4011F6 retn sub_401050 (put_byte) : void put_byte(char c); sub_401075 (put_long) : void put_long(int n); sub_4010E4 (writes) : void writes(char* str, int size); Output functions Output operations 401140, 40114D, 401160, 40117A, 4011B3, 4011BE, 4011D1, 4011E6 University of Wisconsin-Madison

  15. 4011BE call sub_401075 (put_long) 4011D1 call sub_4010E4 (writes) 4011B3 call sub_401075 (put_long) 4011D6 call sub_401120 (type) 4011DB call sub_401167 (fill_data) 4011EB call sub_401154 (chksum) 4011E6 call sub_401075 (put_long) 401140 call sub_401075 (put_byte) 40114D call sub_401075 (put_byte) 40117A call sub_401050 (put_byte) 401160 call sub_401075 (put_long) HFSM for our example University of Wisconsin-Madison

  16. HFSM for gzip - 12 FSMs - 64 nodes - 36 call-sites 4051b4_ENTRY 4051b4_ENTRY 403d20_ENTRY call 4056df call 4056df 404f0e_ENTRY 40572b 403d62 call 40510c call 4056df call 4056df 403d6e call 4054e6 call 4054e6 call 4057f2 call 4056df 403d7a call 4057a5 call 4056df 403d90 404366_ENTRY call 4056df call 4056df 404145_ENTRY 403d9d call 404145 call 4051b4 call 4056df call 4056df 403df1 call 4051b4 call 4051b4 call 4056df 403dfd call 4051b4 call 4051b4 call 4056df 404f0e_ENTRY 403e1f 40510c_ENTRY call 404366 call 4056df call 4056df 40510c_ENTRY 4059c8_ENTRY 403e43 call 4056df call 4056df 403e50 call 4056df 403e50 call 4056df 403e50 call 4056df call 4056df call 4056df 403e50 call 404f0e 408281_ENTRY 4057a5_ENTRY 403e50 call 4056df 4057be 408414 4057d8 call 404f0e University of Wisconsin-Madison

  17. A fragment of the call graph of gzip University of Wisconsin-Madison

  18. HFSM for gzip - 12 FSMs - 64 nodes - 36 call-sites 4051b4_ENTRY 4051b4_ENTRY 403d20_ENTRY call 4056df call 4056df 404f0e_ENTRY 40572b 403d62 call 40510c call 4056df call 4056df 403d6e call 4054e6 call 4054e6 call 4057f2 call 4056df 403d7a call 4057a5 call 4056df 403d90 404366_ENTRY call 4056df call 4056df 404145_ENTRY 403d9d call 404145 call 4051b4 call 4056df call 4056df 403df1 call 4051b4 call 4051b4 call 4056df 403dfd call 4051b4 call 4051b4 call 4056df 404f0e_ENTRY 403e1f 40510c_ENTRY call 404366 call 4056df call 4056df 40510c_ENTRY 4059c8_ENTRY 403e43 call 4056df call 4056df 403e50 call 4056df 403e50 call 4056df 403e50 call 4056df call 4056df call 4056df 403e50 call 404f0e 408281_ENTRY 4057a5_ENTRY 403e50 call 4056df 4057be 408414 4057d8 call 404f0e University of Wisconsin-Madison

  19. * * size: 1 value: Top size: 1 value: Top size: Top value: Top size: Top value: Top size: 4 value: Top size: 4 value: Top size: 1 value: 0x1F size: 1 value: 0x8B size: 1 value: 0x08 size: 1 value: Top size: 4 value: Top Regular Expression for gzip If HFSM is too complicated and there is no recursion, in-line expand to create regular expression University of Wisconsin-Madison

  20. File Format Extractor (FFE/x86) VSA* ASI* Augmenting an HFSM with VSA and ASI information Organization of CodeSurfer/x86 IDA Pro CodeSurfer/x86 disassembleExecutable Executable Connector CodeSurfer Back-end VSA* Build CFGs ASI* * VSA (Value Set Analysis) A combined numeric-analysis and pointer-analysis algorithm that determines an over-approximation of the set of numeric values and addresses that each abstract memory location holds at each program point. (G. Balakrishnan and T. Reps. “Analyzing memory accesses in x86 executables”, CC04) * ASI (Aggregate Structure Identification) A unification-based, flow-insensitive algorithm to identify a program’s arrays and structs. (G. Ramalingam and et. al, “Aggregate structure identification and its application to program analysis”, POPL99) (G. Balakrishnan and T. Reps, “Recovery of variables and heap structure in x86 executables”, TR-1533, Comp. Sci. Dept., UW-Madison, 2005) University of Wisconsin-Madison

  21. Output function Output operation void put_long(int n) { put_short(n&0xffff); put_short((ulong)n >> 16); } push 12345678h call put_long esp stack Output function void writes(char* c, uint len) { for(int i=0; i<len; i++) { outbuf[outcnt++]=(uchar)(c[i]); if(outcnt==OUTBUFSIZE) flush_outbuf(); } } Value Set Analysis (VSA) University of Wisconsin-Madison

  22. LookupVSA(esp-4x8, 4)=12345678h 1004 d Output function Output operation 1003 c void writes(char* c, uint len) { for(int i=0; i<len; i++) { outbuf[outcnt++]=(uchar)(c[i]); if(outcnt==OUTBUFSIZE) flush_outbuf(); } } mov ebx, 1000 ... push 4 push ebx call writes 1002 b 1001 a 1000 ... esp stack Value Set Analysis (VSA) Output functionOutput operation 12h size:4 34h void put_long(int n) { put_short(n&0xffff); put_short((ulong)n >> 16); } push 12345678h call put_long 56h 78h esp stack University of Wisconsin-Madison

  23. Output functionOutput operation 12h size:4 34h void put_long(int n) { put_short(n&0xffff); put_short((ulong)n >> 16); } push 12345678h call put_long 56h esp esp 78h esp stack LookupVSA(esp-4x8, 4)=12345678h size:4 4 Value Set Analysis (VSA) 1004 d Output functionOutput operation 1003 c void writes(char* c, uint len) { for(int i=0; i<len; i++) { outbuf[outcnt++]=(uchar)(c[i]); if(outcnt==OUTBUFSIZE) flush_outbuf(); } } mov ebx, 1000 ... push 4 push ebx call writes 1002 b 1001 a 1000 ... stack University of Wisconsin-Madison

  24. Output functionOutput operation 12h size:4 34h void put_long(int n) { put_short(n&0xffff); put_short((ulong)n >> 16); } push 12345678h call put_long 56h esp esp 78h esp stack LookupVSA(esp-4x8, 4)=12345678h 1000 Value Set Analysis (VSA) 1004 d Output functionOutput operation 1003 c void writes(char* c, uint len) { for(int i=0; i<len; i++) { outbuf[outcnt++]=(uchar)(c[i]); if(outcnt==OUTBUFSIZE) flush_outbuf(); } } mov ebx, 1000 ... push 4 push ebx call writes size:4 1002 b 1001 a 1000 ... 4 stack University of Wisconsin-Madison

  25. Output functionOutput operation 12h size:4 34h void put_long(int n) { put_short(n&0xffff); put_short((ulong)n >> 16); } push 12345678h call put_long 56h esp esp 78h esp stack LookupVSA(esp-4x8, 4)=12345678h Value Set Analysis (VSA) 1004 d Output functionOutput operation 1003 c void writes(char* c, uint len) { for(int i=0; i<len; i++) { outbuf[outcnt++]=(uchar)(c[i]); if(outcnt==OUTBUFSIZE) flush_outbuf(); } } mov ebx, 1000 ... push 4 push ebx call writes size:4 1002 b 1001 a 1000 ... LookupVSA(*(esp-4*8))=“abcd” 4 1000 stack University of Wisconsin-Madison

  26. size: 2 value: ? size: 2 value: 0 size: 2 value: ? size: 2 value: 1 size: 2 value: Top size: 2 value: ? size: 1 value: ? size: 1 value: Top size: 1 value: Top size: 1 value: ? size: 1 value: Top size: 1 value: ? size: 1 value: 0 size: 1 value: ? size: 2 value: ? size: 2 value: 0 size: 2 value: Top size: 2 value: ? size: 4 value: Top size: 4 value: ? size: 4 value: ? size: 4 value: Top * * size: 2 value: ? size: 2 value: 1 size: 2 value: ? size: 2 value: Top size: 4 value: ? size: 4 value: Top size: 4 value: 40 size: 4 value: ? size: 4 value: ? size: 4 value: Top size: 4 value: 0 size: 4 value: ? size: 4 value: 0 size: 4 value: ? size: 4 value: ? size: 4 value: Top size: 4 value: 0 size: 4 value: ? size: 4 value: ? size: 4 value: 0 size: 4 value: ? size: 4 value: 0 * * * * * * * * size: 4 value: Top size: 4 value: ? size: Top value: Top size: Top value: ? size: 1 value: 0 size: 1 value: ? size: Top value: Top size: Top value: ? BeforeAfter University of Wisconsin-Madison

  27. ASI output : Aggregate Structure Identification (ASI) ... [14] call sendto University of Wisconsin-Madison

  28. Experiments • gzip • GNU data-compression program • png2ico • converts PNG files to Windows icon-resource files • ping • sends ICMP ECHO_REQUEST packets to a host to see if the host is reachable via the network University of Wisconsin-Madison

  29. * * size: 1 value: Top size: 1 value: Top size: Top value: Top size: Top value: Top size: 4 value: Top size: 4 value: Top size: 1 value: 0x1F size: 1 value: 0x8B size: 1 value: 0x08 size: 1 value: Top size: 4 value: Top gzip University of Wisconsin-Madison

  30. png2ico (1) • Usage scenario • Find inconsistencies between specifications and implementations University of Wisconsin-Madison

  31. png2ico (2) size: 2 value: 0 size: 2 value: 1 size: 2 value: Top * size: 1 value: Top size: 1 value: Top size: 1 value: Top size: 1 value: 0 size: 2 value: 0 size: 2 value: Top size: 4 value: Top size: 4 value: Top * size: 2 value: 1 size: 2 value: Top size: 4 value: Top size: 4 value: 40 size: 4 value: Top size: 4 value: 0 size: 4 value: 0 size: 4 value: Top size: 4 value: 0 size: 4 value: 0 size: 4 value: 0 * * * * size: 4 value: Top size: Top value: Top size: 1 value: 0 size: Top value: Top University of Wisconsin-Madison

  32. bug? png2ico (2) size: 2 value: 0 size: 2 value: 1 size: 2 value: Top * size: 1 value: Top size: 1 value: Top size: 1 value: Top size: 1 value: 0 size: 2 value: 0 size: 2 value: Top size: 4 value: Top size: 4 value: Top * size: 2 value: 1 size: 2 value: Top size: 4 value: Top size: 4 value: 40 size: 4 value: Top size: 4 value: 0 size: 4 value: 0 size: 4 value: Top size: 4 value: 0 size: 4 value: 0 size: 4 value: 0 * * * * size: 4 value: Top size: Top value: Top size: 1 value: 0 size: Top value: Top University of Wisconsin-Madison

  33. png2ico (3) • We found an inconsistency between the file-format specification for Windows icons and the converter png2ico • png2ico regular exp.  flex spec.  recognizer • Windows icon files  recognizer  failure!  University of Wisconsin-Madison

  34. writeWord(outfile,0); // wPlanes png2ico (4) size: 2 value: 0 size: 2 value: 1 size: 2 value: Top * size: 1 value: Top size: 1 value: Top size: 1 value: Top size: 1 value: 0 size: 2 value: 0 size: 2 value: Top size: 4 value: Top size: 4 value: Top * size: 2 value: 1 size: 2 value: Top size: 4 value: Top size: 4 value: 40 size: 4 value: Top size: 4 value: 0 size: 4 value: 0 size: 4 value: Top size: 4 value: 0 size: 4 value: 0 size: 4 value: 0 * * * * size: 4 value: Top size: Top value: Top size: 1 value: 0 size: Top value: Top University of Wisconsin-Madison

  35. main entry main exit pinger pinger catcher catcher pinger catcher entry catcher exit pinger ? * * pinger pinger pinger pinger ping (1) The HFSM gives a hint about the behavior of ping. University of Wisconsin-Madison

  36. ? * * pinger pinger pinger pinger ping (2) typedef struct icmp { uint8 icmp_type; /* type of message, see below */ uint8 icmp_code; /* type sub code */ uint16 icmp_checksum; /* ones complement cksum of struct */ #define icmp_cksum icmp_checksum union { uint8 ih_pptr; /* ICMP_PARAMPROB */ struct in_addr ih_gwaddr; /* ICMP_REDIRECT */ struct ih_idseq { uint16 icd_id; uint16 icd_seq; } ih_idseq; int ih_void; /* ICMP_UNREACH_NEEDFRAG – Path MTU Discovery (RFC1191) */ struct ih_pmtu { uint16 ipm_void; uint16 ipm_nextmtu; } ih_pmtu; struct ih_rtradv { uint8 irt_num_addrs; uint8 irt_wpa; uint16 irt_lifetime; } ih_rtradv; } icmp_hun; #define icmp_pptr icmp_hun.ih_pptr ... union { struct id_ts { uint32 its_otime; uint32 its_rtime; uint32 its_ttime; } id_ts; struct id_ip { struct ip idi_ip; /* options and then 64 bits of data */ } id_ip; struct icmp_ra_addr id_radv; uint32 id_mask; char id_data[1]; } icmp_dun; #define icmp_otime icmp_dun.id_ts.its_otime ... } icmp_t; size: 1 value: Top size: 1 value: Top size: 2 value: Top size: 2 value: Top size: 2 value: Top University of Wisconsin-Madison

  37. Conclusion • A technique for extracting an over-approximation of a program’s output data format, including • a way to extract a preliminary structure for the output data format • a way to elaborate the structure by annotating it with information about possible output values and sizes University of Wisconsin-Madison

  38. Over-Approximation? • Yes, modulo . . . • All operations must append to the output • No tracking of file-pointer rewind, seek, . . . • Multiple different formats in a program • Signals and exceptions ignored • In principle, could use the same technique used in the MOPS tool University of Wisconsin-Madison

  39. Possible Future Work • Automatic detection of output functions • Other operation sequences  other formats • Input operations • Network-communication operations • Adoption of a learning technique for refining output formats University of Wisconsin-Madison

  40. Thank you!Clarifications? University of Wisconsin-Madison

  41. University of Wisconsin-Madison

  42. Identifying Output Operations • IDAPro disassembler identifies library output procedures • Typically, inspect the call graph to choose which application procedures should be considered output wrappers University of Wisconsin-Madison

More Related