1 / 47

Reversing, better

William Whistler REcon 2010. Reversing, better. Introduction. Who am I Long-time reverser Studied Computer Science at Oxford Enjoys reversing challenges T2 2007 winner, etc . Purpose of this presentation To describe the concepts behind a new application for reversing called REvealer.

warren
Download Presentation

Reversing, better

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. William Whistler REcon 2010 Reversing, better

  2. Introduction • Who am I • Long-time reverser • Studied Computer Science at Oxford • Enjoys reversing challenges • T2 2007 winner, etc. • Purpose of this presentation • To describe the concepts behind a new application for reversing called REvealer

  3. Introduction • Let’s model things like ‘real’ engineers • Allows us to ask the model questions • Include everything we discover • Both through manual and automated techniques • Gradually move to higher-level concepts • But still allow drilling down to the exact detail

  4. Dynamic Taint Analysis • {define eax as tainted} • movebx, eax • {tainted: eax, ebx} • add edx, ecx • {tainted: eax, ebx} • moveax, edx • {tainted: ebx} • add edx, ebx • {tainted: ebx, edx} • Step through, instruction by instruction • If the input to something is tainted, the output is tainted

  5. Dynamic Taint Analysis • {define eax as tainted} • movebx, eax • {tainted: eax, ebx} • add edx, ecx • {tainted: eax, ebx} • moveax, edx • {tainted: ebx} • add edx, ebx • {tainted: ebx, edx} • Used by vulnerability researchers to find where data from the network is used • But useful in other reversing contexts too

  6. Dynamic Taint Analysis • Dynamic, so deals with one execution path • No problems with loops/etc • Exact memory addresses are available at use • moveax, dwordptr [esi+ecx] • Just look up esi and ecx to get the exact memory address being read • Some complications, but still very useful

  7. Multiple Taint Sources • {eax tainted red} {ecx tainted green} • movebx, eax • {red: eax, ebx} {green: ecx} • add edx, ecx • {red: eax, ebx} {green: ecx, edx} • moveax, edx • {red: ebx} {green: eax, ecx, edx} • add edx, ebx • {red: ebx, edx} {green: eax, ecx, edx} • More than one taint source • Analysis done simultaneously • Locations can be tainted by more than one source

  8. Multiple Taint Sources • Let’s define every external input to our code as a taint source • CPU timestamp, file reads, network buffers, API results… • Useful to see where values have come from when looking at code for the first time

  9. Static Taint Analysis • We want to deal with more than one particular run through • Ideally all possible! • So we go static instead • But now we have some problems…

  10. Static Taint Analysis • What does it access? • Indirect memory reads/writes • moveax, dwordptr [esi+ecx] • Where does it go? • Conditional jumps • jz • Indirect jumps/calls • jmpeax • Self-modifying code • Loops • Exceptions

  11. Path Explosion Problem 100: test eax, eax 101: jz 103 102: … 103: test ecx, ecx 104: jz 106 105: … 106: test ebx, ebx 107: jz 109 108: … 109: … 100 101 102 103 104 103 104 105 106 107 105 106 107 106 107 106 107

  12. Paths • Recombining paths • “OR” taint status of each location? • Store different possibilities somehow? • Let’s go with the former for now. • Which ones to recombine? • Obvious choice: by EIP?

  13. Paths 100: test eax, eax 101: jz 103 102: … 103: test ecx, ecx 104: jz 106 105: … 106: test ebx, ebx 107: jz 109 108: … 109: … 100 101 102 105 103 104 106 107

  14. Paths • But why be so strict? • Be user-guided, not fully automated • When is keeping paths separate useful? • Self-modifying code • Loop unrolling • Obfuscations • Intrainstruction jumps

  15. Storing results • We want to ask what’s tainted at any point the user chooses, not just termination • Repeating the whole analysis is slow • Let’s store what we learn along the way • Store all tainted sets at each point? • Store differences? • Store effects?

  16. Storing by difference • {eax tainted red} {ecx tainted green} • movebx, eax • {red: +ebx} {green: no change} • add edx, ecx • {red: no change} {green: +edx} • moveax, edx • {red: -eax} {green: +eax} • add edx, ebx • {red: +edx} {green: no change} • Store what changes for each taint source after each instruction • Query at any point by working back up • Can either query for the full set or a subset of locations

  17. Storing by effects movebx, eax • If eax was tainted, ebx is now tainted • If eax was untainted, ebx is now untainted add ebx, eax • If eax was tainted, ebx is now tainted • If eax was untainted, ebxremains as it was • Flags updated in a similar way

  18. Storing by effects • Uses the principle of “lazy evaluation” • If we only do specific queries, this is much more efficient • Allows us to consider intermediary locations as ‘internal’ taint sources • e.g. Ask what’s tainted by a function parameter

  19. Slicing • See what’s involved in calculating something, not just the taint sources used • Plenty of practical uses • Where was this buffer encrypted? • Where was this checksum calculated? • Can even hide the instructions we’re not interested in as a nice visualization

  20. Slicing moveax, 100 movebx, 200 add eax, ebx moveax, 100 movebx, 200 movecx, 300 movedx, 400 add eax, ebx add ecx, edx add eax, ecx eax ecx movecx, 300 movedx, 400 add ecx, edx

  21. Data flow add eax, ecx add eax, ebx mov eax,100 add ecx, edx mov ecx,300 mov ebx,200 mov edx,400 eax ebx ecx edx moveax, 100 movebx, 200 movecx, 300 movedx, 400 add eax, ebx add ecx, edx add eax, ecx ecx eax

  22. Symbolic Execution add eax, ecx add eax, ebx mov eax,100 add ecx, edx mov ecx,300 mov ebx,200 mov edx,400 eax=100 ebx=200 ecx=300 edx=400 moveax, 100 movebx, 200 movecx, 300 movedx, 400 add eax, ebx add ecx, edx add eax, ecx ecx=700 eax=300 eax=1000

  23. Symbolic Execution • Why restrict ourselves to the “route” taken by information? • Emulate instructions when values are known • As well as storing the taint effects of each instruction, store an evaluation function too

  24. Symbolic Execution • Allows us to deal with some (but not all) branches and indirect jumps/calls • Opaque predicates • Similar to certain compiler optimizations (constant folding) • For merged paths, similar choice to before • If the incoming values are different, just forget it • Find something more complicated to keep all possibilities

  25. Abstract Interpretation • Why restrict ourselves to exact values? • More limited information can still be useful • Many possibilities • Even/odd • Or other modular congruencies • Positive/negative • Alphanumeric or not • Ranges

  26. Abstract Interpretation • Instructions can pass non-exact value information • Can be specialized for specific cases (as with the mul here) • {eax totally unknown} • muleax, 2 • {eax is even} • inceax • {eax is odd}

  27. Backwards Abstract Interpretation • Requires “inverse” evaluation functions for instructions • Many will not be possible or only provide limited information • Allows the user to ask “what restrictions must hold in order for a value to match” • Meeting the restrictions does not guarantee that the value matches – not weakest preconditions

  28. Path constraints • Every branch choice depends on values • e.g. a jz will branch one way if the z flag is set, the other if not • We can spread this back up in the same way • And use this information to improve the analysis • Allows path-specific “input crafting” • “What must be true for this code to be reached” • Useful against control-flow obfuscation • Further links with compiler optimizations

  29. More complex solving • So far, no use of constraint solvers or computer algebra systems • Every input was considered “independent” of other calculations • But that’s not always the case • e.g. polynomial equations • So we can use the external systems to help • But we can reduce the complexity of queries by using our analysis information

  30. Abstraction • So far, no functions – entirely interprocedural • No assumptions about code structure used • e.g. stack frames, functions always return, etc. • Haven’t been able to deal with APIs or loops • Simple solution: think of them as instructions

  31. Abstraction: Functions • Bunch existing instructions together • Not just for labelling and visualizations; allows function specific improvements • Perform costly optimizations only once • Allows custom inverse implementations • e.g. MD5 cracking

  32. Abstraction: APIs • Recognize the API based on call address • Can use a symbolic base for modules • Define what we know about behaviour • e.g. Possible return values • External modelling • e.g. For writing/reading pipes or atoms • Can check for anti-debug/emulation too • Overwritten API code • Using undefined OS behaviour

  33. Abstraction: Loops • The loop becomes an instruction with all potentially read locations as inputs and all potentially written locations as outputs • Can be refined further by checking internal constraints • A few cases have easy models • e.g. repeated adds = multiplication • Otherwise, do what we can • Anything too complicated is just defined as tainted by all potential inputs with no evaluation • Again, the user can improve the analysis manually • Unroll iterations, add evaluators, etc.

  34. Abstraction • Gradually forming more expressive units • Becoming increasingly higher-level • More flexible than decompilation • Can always zoom back in to the details • Reflects what’s happening in the analyst’s mind • Figuring out components and how they fit together • Easier to manage and navigate than just by function

  35. Memory • Automated static analysis tools can’t miss bugs • False positives are a necessary evil for them • Whenever there’s ambiguity, they have to assume the worst • We can allow the user to focus on the cases they consider interesting • The ‘normal’ cases for some RE tasks • The ‘edge’ cases for others (e.g. vuln research)

  36. Memory • Considering when accesses overlap • A form of alias analysis mov [eax], 123 mov [ebx], 456 movecx, [eax] ? ? eax ebx

  37. Memory • Pointers to objects don’t just magically appear • Based on results from allocation functions • So compare the taint sources! • Not immune to deliberate obfuscation • e.g. a loop searching memory, like egg hunters in shellcode • … but still useful in many cases

  38. Memory • Sharing a taint source – prioritize equal case esi mov [eax], 123 mov [ebx], 456 movecx, [eax] ? ? eax ebx

  39. Memory • No shared taint source – prioritize unequal case esi edi mov [eax], 123 mov [ebx], 456 movecx, [eax] ? ? eax ebx

  40. Dynamic • Perform runs to collect example values • Can be guided by our analysis • Input values • Which values to collect • Where to intercept • Useful for still-unresolved indirect calls/jumps • Also useful to just display for the user! • Improve memory heuristics • Timing and hit count information also useful • Much like profiling

  41. Virtualizing Packers • Opcode handlers become instructions • Suddenly we have a disassembler and debugger with no additional effort • Can compare routines between VM and native implementation • Could be useful for Java or Flash VMs too • Needs further investigation

  42. No Intermediate Language? • Haven’t mentioned any sort of Intermediate Language • The whole thing can be thought of as an extensible IL • Share defined abstractions between projects • For library code, etc.

  43. Principles • Integrate techniques • Share information as much as possible • Stay interactive • Both for analysis and visualization • Allow flexible abstraction • Reflect what happens in the analyst’s mind

  44. Theoretical extensions • Type recovery • C++ classes • Dealing with new obfuscation methods • White-box cryptography • Functional programming?

  45. Implementation - REvealer • Under heavy development • Release likely in early 2011 • GUI • Useful visualization of all this • Dynamic extensions • Hypervisor based? • Other information sources • Debugging symbols • Extensible… somehow! • Python? • Lua?

  46. Special Thanks • Rolf Rolles • Sean Heelan • Cyril Paciullo • Shawn Zivontsis • Jonathan Kay • SérgioSilva • TareqSaade • Daniel Hauenstein • Jason Geffner

  47. Thank you for listening! • Mailing list: http://revealer.co.uk • E-mail: recon@wtbw.co.uk • Twitter: http://twitter.com/WillWhistler • Other useful links: • http://reddit.com/r/ReverseEngineering • http://ref.x86asm.net

More Related