1 / 43

Simple Substitution Distance and Metamorphic Detection

Simple Substitution Distance and Metamorphic Detection. Gayathri Shanmugam Richard M. Low Mark Stamp. The Idea. Metamorphic malware “mutates” with each infection Measuring software similarity is a possible means of detection But, how to measure similarity? Much relevant previous work

chuong
Download Presentation

Simple Substitution Distance and Metamorphic Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Simple Substitution Distance and Metamorphic Detection GayathriShanmugam Richard M. Low Mark Stamp Simple Substitution Distance

  2. The Idea Simple Substitution Distance • Metamorphic malware “mutates” with each infection • Measuring software similarity is a possible means of detection • But, how to measure similarity? • Much relevant previous work • Here, a novel distance measure is considered

  3. Simple Substitution Distance Simple Substitution Distance • We treat each metamorphic copy as if it is an “encrypted” version of “base” virus • Where the “cipher” is a simple substitution • Why simple substitution? • Easy to work with, fast algorithm to solve • Why might this work? • Simple substitution “cryptanalysis” tends to yield results that match family statistics • Accounts for modifications to files similar to some common metamorphic techniques

  4. Motivation Simple Substitution Distance • Given a simple substitution ciphertext where plaintext is English… • If we cryptanalyze using English language statistics, we expect a good score • If we cryptanalyze using, say, French language statistics, we expect a not-so-good score • We can obtain opcode statistics for a metamorphic family • Using simple substitution cryptanalysis, a virus of same family should score well… • …but, a benign exe should not score as well • Assuming statistics of these families differ

  5. Metamorphic Techniques Simple Substitution Distance • Many possible morphing strategies • Here, briefly consider • Register swapping • Garbage code insertion • Equivalent substitution • Transposition • Formal grammar mutation • At a high level --- substitution, transposition, insertion, and deletion

  6. Register Swap Simple Substitution Distance • Register swapping • E.g., replace EBX register with EAX, provided EAX not in use • Very simple and used in some of first metamorphic malware • Not very effective • Why not?

  7. Garbage Insertion Simple Substitution Distance • Garbage code insertion • Two cases: • Dead code --- inserted, but not executed • We can simply JMP over dead code • Do-nothing instructions --- executed, but has no effect on program • Like NOP or ADD EAX,0 • Relatively easy to implement • Effective at breaking signature detection

  8. Code Substitution Simple Substitution Distance • Equivalent instruction substitution • For example, can replace SUB EAX,EAX with XOR EAX,EAX • Does not need to be 1 for 1 substitution • That is, can include insertion/deletion • Unlimited number of substitutions • Very effective • Somewhat difficult to implement

  9. Transposition  Simple Substitution Distance • Transposition • Reorder instructions that have no dependency • For example, MOV R1,R2 ADD R3,R4 ADD R3,R4 MOV R1,R2 • Can be highly effective • But, can be difficult to implement • Sometimes applied only to subroutines

  10. Formal Grammar Mutation Simple Substitution Distance • Formal grammar mutation • View morphing engine as non-deterministic automata • Allow transitions between any symbols • Apply formal grammar rules • Obtain many variants, high variation • Really just a formalization of others approaches, not a separate technique

  11. Previous Work Simple Substitution Distance • Easy to prove that “good” metamorphic code is immune to signature detection • Why? • But, many successes detecting hacker-produced metamorphic malware… • HMM/PHMM/machine learning • Graph-based techniques • Statistics (chi-squared, naïve Bayes) • Structural entropy • Linear algebraic techniques

  12. This Research Simple Substitution Distance • Measure similarity using “simple substitution distance” • We “decrypt” suspect file using statistics from a metamorphic family • If decryption is good, we classify it as a member of the same metamorphic family • If decryption is poor, we classify it as NOT a member of the given metamorphic family

  13. Simple Substitution Cipher Simple Substitution Distance • Simple substitution is one of the oldest and simplest means of encryption • A fixed key used to substitute letters • For example, Caesar’s cipher, substitute letter 3 positions ahead in alphabet • In general, any permutation can be key • Simple substitution cryptanalysis? • Statistical analysis of ciphertext

  14. Simple Substitution Cryptanalysis Simple Substitution Distance • Suppose you observe the ciphertext PBFPVYFBQXZTYFPBFEQJHDXXQVAPTPQJKTOYQWIPBVWLXTOXBTFXQWAXBVCXQWAXFQJVWLEQNTOZQGGQLFXQWAKVWLXQWAEBIPBFXFQVXGTVJVWLBTPQWAEBFPBFHCVLXBQUFEVWLXGDPEQVPQGVPPBFTIXPFHXZHVFAGFOTHFEFBQUFTDHZBQPOTHXTYFTODXQHFTDPTOGHFQPBQWAQJJTODXQHFOQPWTBDHHIXQVAPBFZQHCFWPFHPBFIPBQWKFABVYYDZBOTHPBQPQJTQOTOGHFQAPBFEQJHDXXQVAVXEBQPEFZBVFOJIWFFACFCCFHQWAUVWFLQHGFXVAFXQHFUFHILTTAVWAFFAWTEVOITDHFHFQAITIXPFHXAFQHEFZQWGFLVWPTOFFA • Analyze frequency counts… • Likely that ciphertext “F” represents “E” • And so on, at least for common letters

  15. Simple Substitution Cryptanalysis Simple Substitution Distance • Can even automate attack • Make initial guess for key using frequency counts • Compute oldScore • Modify key by swapping adjacent elements • Compute newScore • If newScore > oldScore then oldScore = newScore • Else unswap elements • Goto 3 • How to compute score? • Number of dictionary words in putative plaintext? • Much better to use English digraph statistics

  16. Jackobsen’s Algorithm Simple Substitution Distance • Method on previous slide can be slow • Why? • Jackobsen’s algorithm uses similar idea, but fast and efficient • Ciphertext is only decrypted once • So algorithm is (essentially) independent of length of message • Then, only matrix manipulations required

  17. Jackobsen’s Algorithm: Swapping Simple Substitution Distance • Assume plaintext is English, 26 letters • Let K = k1,k2,k3,…,k26be putative key • And let “|” represent “swap” • Then we swap elements as follows • Also, we restart this swapping schedule from the beginning whenever score improves

  18. Jackobsen’s Algorithm: Swapping Simple Substitution Distance • Minimum swaps is 26 choose 2, or 325 • Maximum is unbounded • Each swap requires a score computation • Average number of swaps? Experimentally • Ciphertext of length 500, average 1050 swaps • Ciphertext of length 8000, avg just 630 swaps • So, work depends on length of ciphertext • More ciphertext, better scores, fewer swaps

  19. Jackobsen’s Algorithm: Scoring Simple Substitution Distance Let D = {dij} be digraph distribution corresponding to putative key K Let E = {eij} be digraph distribution of English language These matrices are 26 x 26 Compute score as

  20. Jackobsen’s Algorithm Simple Substitution Distance • So far, nothing fancy here • Could see all of this in a CS 265 assignment • Jackobsen’s trick: Determine new D matrix from old D without decrypting • How to do so? • It turns out that swapping elements of K swaps corresponding rows and columns of D • See example on next slides…

  21. Swapping Example Simple Substitution Distance • To simplify, suppose 10 letter alphabet E, T, A, O, I, N, S, R, H, D • Suppose you are given the ciphertext TNDEODRHISOADDRTEDOAHENSINEOAR DTTDTINDDRNEDNTTTDDISRETEEEEEAA • Frequency counts given by

  22. Swapping Example Simple Substitution Distance We choose the putative key K given here The corresponding putative plaintext is AOETRENDSHRIEENATE RIDTOHSOTRINEAAEAS OEENOTEOAAAEESHNA TTTTTII Corresponding digraph distribution D is 

  23. Swapping Example Previous key K New key K Simple Substitution Distance Suppose we swap first 2 elements of K Then decrypt using new K And compute digraph matrix for new K

  24. Swapping Example Simple Substitution Distance Old D matrix vs new D matrix What do you notice? So what’s the point here? This is good!

  25. Jackobsen’s Algorithm Simple Substitution Distance

  26. Proposed Similarity Score Simple Substitution Distance • Extract opcodes sequences from collection of viruses • All viruses from same metamorphic family • Determine n most common opcodes • Symbol n+1 used for all “other” opcodes • Use resulting digraph statistics form matrix E = {eij} • Note that matrix is (n+1) x (n+1)

  27. Scoring a File Simple Substitution Distance • Given an executable we want to score • Extract it’s opcode sequence • Use opcode digraph stats to get D = {dij} • This matrix also (n+1) x (n+1) • Initial “key” K chosen to match monograph stats of virus family • Most frequent opcode in exe maps to most frequent opcode in virus family, etc. • Score based on distance between D and E • “Decrypt” D and score how closely it matches E • Jackobsen’s algorithm used for “decryption”

  28. Example Simple Substitution Distance Suppose only 5 common opcodes in family viruses (in descending frequency) Extract following sequence from an exe Initial “key” is And “decrypt is

  29. Example Simple Substitution Distance • Given “decrypt” • Form D matrix • After swap… • And so on…

  30. Scoring Algorithm Simple Substitution Distance

  31. Quantifying Success Simple Substitution Distance Consider these 2 scatterplots of scores Which is better (and why)?

  32. ROC Curves Simple Substitution Distance • Plot true-positive vs false positive • As “threshold” varies • Curve nearer 45-degree line is bad • Curve nearer upper-left is good

  33. ROC Curves Simple Substitution Distance • Use ROC curves to quantify success • Area under the ROC curve (AUC) • Probability that randomly chosen positive instance scores higher than a randomly chosen negative instance • AUC of 1.0 implies ideal detection • AUC of 0.5 means classification is no better than flipping a coin

  34. Parameter Selection Simple Substitution Distance • Tested the following parameters • Opcode matrix size • Scoring function • Normalization • Swapping strategy • None significant, except matrix size • So we only give results for matrix size here

  35. Opcode Matrix Size Simple Substitution Distance Obtained following results So, ironically, we use 26 x 26 matrix

  36. Test Data Simple Substitution Distance • Tested the following metamorphic families • G2 --- known to be weak • NGVCK --- highly metamorphic • MWOR --- highly metamorphic and stealthy • MWOR “padding ratios” of 0.5 to 4.0 • For G2 and NGVCK • 50 files tested, cygwin utilities for benign files • For each MWOR padding ratio • 100 files tested, Linux utilities for benign files • 5-fold cross validation in each experiment

  37. NGVCK and G2 Graphs Simple Substitution Distance

  38. MWOR Score Graphs Simple Substitution Distance

  39. MWOR ROC Curves Simple Substitution Distance

  40. MWOR AUC Statistics Simple Substitution Distance

  41. Efficiency Simple Substitution Distance

  42. Conclusions Simple Substitution Distance • Simple substitution score, good results for challenging metamorphic viruses • Scoring is fast and efficient • Applicable to other types of malware • Requires opcodes

  43. References Simple Substitution Distance G. Shanmugam, R.M. Low, and M. Stamp, Simple substitution distance and metamorphic detection, Journal of Computer Virology and Hacking Techniques, 9(3):159-170, 2013

More Related