890 likes | 1.17k Views
SRE Basics. In this Section…. We briefly cover following topics Assembly code Virtual machine/Java bytecode Windows PE file format. Assembly Code. High Level Languages. First, high level languages… Ancient high level languages Basic --- little structure FORTRAN --- limited structure
E N D
SREBasics SRE Basics 1
In this Section… • We briefly cover following topics • Assembly code • Virtual machine/Java bytecode • Windows PE file format SRE Basics 2
Assembly Code SRE Basics 3
High Level Languages • First, high level languages… • Ancient high level languages • Basic --- little structure • FORTRAN --- limited structure • C --- “structured” language • C was designed to deal with complexity • OO languages take this one step further • Above languages considered primitive today SRE Basics 4
High Level Languages • Object oriented (OO) languages • “Object” groups code and data together • Consider best way to handle complexity (at least for now…) • Important OO ideas include • Encapsulation, inheritance, polymorphism SRE Basics 5
High Level Languages • Program must deal with code and data • Data • Variables, data structures, files, etc. • Code • Reverser must study control flow • Conditionals, switches, loops, etc. SRE Basics 6
High Level Languages • High level languages --- different users want different things • Goes back (at least) to C vs FORTRAN • Today, major tradeoff is between simplicity and flexibility • Simplicity --- easy to write short program to do exactly what you want (e.g., C) • Flexibility --- language has it all (e.g., Java) SRE Basics 7
High Level Languages • Some languages compiled into native code • exe is specific to the hardware • C, C++, FORTRAN, etc. • Other languages “compiled” into “code”, which is interpreted by a virtual machine • Java, C# • Often possible to make compiled version • For reverser, this distinction is far more important than OO or not SRE Basics 8
Intro to Assembly • At the lowest level, machine binary • Assembly code lives between binary and high level languages • When reversing native code, we must deal with assembly code • Why assembly code? • Why not “reverse” binary to, say, C? SRE Basics 9
Intro to Assembly • Reverser would like to deal with high level, but is stuck with low level • Ideally, want to create mental “link” from low level to high level • Easier for code written in C • Harder for OO code, such as C++ • Why? SRE Basics 10
Intro to Assembly • Perhaps biggest difference at assembly level is dealing with data • High level languages hide lots and lots of details on data manipulations • For example, loading and storing • Also, low level instructions are primitive • Each instruction does not do very much SRE Basics 11
Intro to Assembly • Consider following simple C program • Simple, but far higher level than assembly code int multiply(int x, int y) { int z; z = x * y; return z; } SRE Basics 12
Intro to Assembly int multiply(int x, int y) { int z; z = x * y; return z; } • In assembly code… • Store state before entering function • Allocate memory for z • Load x and y into registers • Multiply x by y and store result in register • Copy result back to memory for z (optional) • Restore state that was stored in 1. • Return z SRE Basics 13
Intro to Assembly • Why are things so complicated at low level? • It’s all about efficiency! • Reading memory and storing are slow • No single asm instruction to read memory, operate on it, and store result • But this is common in high level languages SRE Basics 14
Intro to Assembly • Registers --- “local” processor memory • So don’t have to read and write RAM • Stack --- “scratch paper” (in RAM) • Holds register values, local variables, function parameters and return values • E.g., storage for “z” in multiply example • Heap --- dynamic, variable-sized data • Data section --- e.g., string constants • Control flow --- high level “if” or “while” are much more complex at low level SRE Basics 15
Registers • Registers used in most instructions • Specifics here deal with “IA-32” • Intel Architecture, 32-bit • Used in “Wintel” machines • We use IA-32 notation • AT&T notation also exists • Eight 32-bit registers (next slide) • All 8 start with “E” • Also several system registers SRE Basics 16
Registers • EAX, EBX, EDX --- generic, used for int, Boolean, …, memory operations • ECX --- generic, used as counter • ESI/EDI --- generic, source/destination pointers when copying memory • SI == source index, DI == destination index • EBP --- generic, stack “base” pointer • Usually, stack position after return address • ESP --- stack pointer • Curretn stack frame is between ESP to EBP SRE Basics 17
Flags • EFLAGS --- special registers • Status flags updated by various operations to “record” outcomes • System flags too, but we don’t care about them • Flags are basic tool for conditionals • For example, a TEST followed by a jump instruction • TEST sets various flags, jump determines action to take, based on those flags SRE Basics 18
Instruction Format • Most instructions consist of… • Opcode --- the “instruction” • One or two operands --- “parameter(s)” • Operand (parameters) are data • Operands come in 3 flavors • Register name --- for example, EAX • Immediate --- e.g., hard-coded constant • Memory address --- enclosed in [brackets] SRE Basics 19
Operand Examples • EAX • Read from (or write to) EAX register, depending on opcode • 0x30004040 • Immediate --- number is embedded in code • Usually a constant in high-level code • [0x4000349e] • This os a memory address • Could be a global variable in high level code SRE Basics 20
Basic Instructions • We cover a few common instructions • First we give general format • Later, we give a few simple examples • There are lots of assembly instructions • But, most assembly code uses only a few • About 14 assembly instructions account for more than 90% of all code SRE Basics 21
Opcode Counts • Typical opcode counts, “normal” code SRE Basics 22
Opcode Counts • Opcode counts, typical virus code SRE Basics 23
Instructions • We consider following operations • Moving data • Arithmetic • Comparisons • Conditional branches • Function calls SRE Basics 24
Moving Data • MOV is the most popular opcode • 2 operands, destination and source: • MOV DestOperand, SourceOperand • Note the order • Destination first, source second SRE Basics 25
Arithmetic • Six integer arithmetic operations • ADD, SUB, MUL, DIV, IMUL, IDIV • Many variations based on operands • ADD Op1, Op2 ; add, store result in Op1 • SUB Op1, Op2 ; sub Op2 from Op1 --> Op1 • MUL Op ; mul Op by EAX ---> EDX:EAX • DIV Op ; div EDX:EAX by Op quotient ---> EAX, remainder ---> EDX • IMUL, IDIV --- like MUL and DIV, but signed SRE Basics 26
Comparisons • CMP opcode has 2 operands • CMP Operand1, Operand2 • Subtracts Operand2 from Operand1 • Result “stored” in flag bits • If 0 then ZF flag is set • Other flags can be used to tell which is greater, depending on signed or unsigned SRE Basics 27
Conditional Branches • Conditional branches use “Jcc” family of instructions (je, jne, jz, jnz, etc.) • Format is • Jcc TargetAddress • If Jcc true, goto TargetAddress • Otherwise, what happens? SRE Basics 28
Function Calls • Use CALL and RET • CALL FunctionAddress …… • RET ; pops return address • RET can be told to increment ESP • Need to reset stack pointer • Why? SRE Basics 29
Examples cmp ebx,0xf020 jnz 10026509 • What does this do? • Compares value in EBX with constant • Jumps to specified address if operands are not same • Note: JNE and JNZ are same instruction SRE Basics 30
Examples mov edi,[ecx+0x5b0] mov ebx,[ecx+0x5b4] imul edi,ebx • What does this do? • First, add 0x5b0 to ECX register, get value at that memory and put in EDI • Next, add 0x5b4 to ECX, get value at that memory and put in EBX • Note that ECX points to some data structure • Finally, EDI = EDI * EBX • Note there are different forms of IMUL SRE Basics 31
Examples push eax push edi push ebx push esi push dword ptr [esp+0x24] call 0x10026eeb • What does this do? • PUSH four register values • PUSH something related to stack ptr • Probably, parameter or local variable • Would need to look at more code to decide • Note “dword ptr” is effectively a cast • CALL a function SRE Basics 32
Examples mov eax, dword ptr [ebp - 0x20] shl eax, 4 mov ecx, dword ptr [ebp - 0x24] cmp dword ptr [eax+ecx+4], 0 call 0x10026eeb • What does this do? • Maybe “data structure in an array” • Last line • ECX --- gets base pointer • EAX --- current offset into the array • Add 4 to get specific member of structure SRE Basics 33
Examples • AT&T syntax pushl $14 pushl $helloWorld pushl $1 movl $4, %eax pushl %eax int $0x80 addl $16, %esp pushl $0 movl $1, %eax pushl %eax int $0x80 SRE Basics 34
Compilation • Converts high level representation of code to binary • Front end --- lexical analysis • Verify syntax, etc. • Intermediate representation • Optimization • Improve structure, eliminate redundancy, … SRE Basics 35
Compilation • Back end --- generates the actual code • Instruction selection • Register allocation • Instruction scheduling --- pipelining, parallelism • Back end process might make disassembly hard to read • Optimization too • Each compiler has its own quirks • Can you automatically determine compiler? SRE Basics 36
Virtual Machines & Bytecode SRE Basics 37
Virtual Machines • Some languages instead generate intermediate bytecode • Bytecode runs in a virtual machine • Virtual machine is a program that (historically) interprets bytecode • Translates bytecode for the hardware • Bytecode analogous to assembly code SRE Basics 38
Virtual Machines • Advantages? • Hardware independent • Disadvantages? • Slow • Today, usually just-in-time compilers instead of interpreters • Compile snippets of bytecode into native code as needed SRE Basics 39
Reversing Bytecode • Reversing bytecode is easy • Unless special precautions are taken • Even then, easier than native code • Bytecode usually contains lots of metadata • Possible to reconstruct highly accurate high level language • Bytecode can be obfuscated • In worst case, reverser must learn bytecode • But bytecode is easier than native code SRE Basics 40
Windows PE Files SRE Basics 41
Windows PE File Format • Designed to be standard executable file format for all versions of OS… • …on all supported processors • Only small changes since PE format was introduced • E.g., support for 64-bit Windows SRE Basics 42
Windows PE Files • Trivia • Q: What’s the difference between exe and dll? • A: Not much --- one bit differs in PE files • Q: What is size of smallest possible PE file? • A: 133 bytes • PE file on disk is a file • Once loaded into memory, it’s a module • File is mapped to module • Address where module begins is HMODULE • PE file may not all be mapped to module SRE Basics 43
Windows PE Files • WINNT.H is final word on what PE file looks like • Tools to examine PE files • Dumpbin (Visual Studio) • Depends • PE Browse Professional • In spite of its name, it’s free • PEDUMP (by author of article) SRE Basics 44
PE File Sections • Each section is “chunk of code or data that logically belongs together” • For example, all import tables in one section • Code is in .text section • Code is code, but many types of data • Data examples • Program data (e.g., .rdata for read-only) • API import/export tables • Resources, relocation info, etc. • Can specify section names in C++ source SRE Basics 45
PE File Sections • When mapped, module starts on a page boundary • Linker can be told to merge sections • E.g., to merge .text and .rdata: • /MERGE:.rdata=.text • Some sections commonly merged • Some sections cannot be merged SRE Basics 46
Relative Virtual Addresses • Exe file specifies in-memory addresses • PE file specifies preferred load location • But DLL can actually load just about anywhere • So, PE specifies addresses in a way that is independent of where it loads • No hardcoded addresses in PE • Instead, Relative Virtual Addresses (RVAs) • RVA is an offset relative to where PE is loaded SRE Basics 47
Relative Virtual Addresses • To find actual memory location, add RVA to the actual load address • For example, suppose • Exe file is loaded at 0x400000 • And RVA is 0x1000 • Then code (.text) starts at 0x401000 • In Windows terminology, actual address is known as Virtual Address (VA) SRE Basics 48
Data Directory • There are many data structures within exe • For efficiency, must be loaded quickly • E.g., imports, exports, resources, base relocations, etc. • DataDirectory • Array of 16 data structures • #define IMAGE_DIRECTORY_ENTRY_xxx defines array indexes (0 to 15) SRE Basics 49
Importing Functions • To use code or data from another DLL, must import it • When PE file loads, Windows loader locates imported functions/data • Usually automatic, when program first starts • Imported DLLs may import others • For example, any program created with Visual C++ imports KERNEL32.DLL… • …and KERNEL32.DLL imports from NTDLL.DLL SRE Basics 50