330 likes | 436 Views
Self-Learning Anti-Virus Scanner. Arun Lakhotia , Professor Andrew Walenstein , Assistant Professor University of Louisiana at Lafayette www.cacs.louisiana.edu/labs/SRL. Introduction. Alumni in AV Industry Prabhat Singh Nitin Jyoti Aditya Kapoor Rachit Kumar McAfee AVERT
E N D
Self-Learning Anti-Virus Scanner ArunLakhotia, Professor Andrew Walenstein, Assistant ProfessorUniversity of Louisiana at Lafayette www.cacs.louisiana.edu/labs/SRL 2008 AVAR (New Delhi)
Introduction • Alumni in AV Industry • Prabhat Singh • Nitin Jyoti • Aditya Kapoor • Rachit Kumar McAfee AVERT • Erik Uday Kumar,Authentium • Moinuddin Mohammed,Microsoft • Prashant Pathak, Ex-Symantec • Funded by: Louisiana Governor’s IT Initiative • Director, Software Research Lab • Lab’s focus: Malware Analysis • Graduate level course on Malware Analysis • Six years of AV related research • Issues investigated: • Metamorphism • Obfuscation AVAR 2008 (New Delhi)
Outline • Attack of Variants • AV vulnerability: Exact match • Information Retrieval Techniques • Inexact match • Adapting IR to AV • Account for code permutation • Vilo: System using IR for AV • Integrating Vilo into AV Infrastructure • Self-Learning AV using Vilo 2008 AVAR (New Delhi)
ATTACK OF VARIANTS 2008 AVAR (New Delhi)
Variants vs Family Source: Symantec Internet Threat Report, XI AVAR 2008 (New Delhi)
Analysis of attacker strategy • Purpose of attack of variants • Denial of Service on AV infrastructure • Increase odds of passing through • Weakness exploited • AV system use: Exact match over extract • Attack strategy • Generate just enough variation to beat exact match • Attacker cost • Cost of generating and distributing variants 2008 AVAR (New Delhi)
Analyzing attacker cost • Payload creation is expensive • Must reuse payload • Need thousands of variants • Must be automated • “General” transformers are expensive • Specialized, limited transformers • Hence packers/unpackers 2008 AVAR (New Delhi)
Attacker vulnerability • Automated transformers • Limited capability • Machine generated, must have regular pattern • Exploiting attacker vulnerability • Detect patterns of similarities • Approach • Information Retrieval (this presentation) • Markov Analysis (other work) 2008 AVAR (New Delhi)
Information Retrieval 2008 AVAR (New Delhi)
IR Basics • Basis of Google, Bioinformatics • Organizing very large corpus of data • Key idea • Inexact match over whole • Contrast with AV • Exact match over extract 2008 AVAR (New Delhi)
IR Problem Document Collection IR Related documents Query: Keywords orDocument AVAR 2008 (New Delhi)
IR Steps Step 1: Convert documents to vectors 1a. Define a method to identify “features” Example: k-consecutive words 1b. Extract all features from all documents Have you wondered When is a rose a rose? 1c. Count features, make feature vector Have you wondered 1 You wondered when 1 Wondered when rose 1 When rose rose 1 [1, 1, 1, 1, 0,0] How about onions 0 Onion smell stinks 0 AVAR 2008 (New Delhi)
IR Steps • Step 2: Compute feature vectors • Take into account features in entire corpus • Classical method • W=TF x IDF DF = # documents containing the feature IDF = Inverse of DF TF = Term Frequency TF(v1) DF w1 = TFxIDF(v1) IDF You wondered when 5 1 1/5 1/5 Wondered when rose 7 2 1/7 2/7 When rose rose 5 8 5/8 1/8 How about onions 6 3 1/6 3/6 Onion smell stinks 3 0 1/3 0/3 AVAR 2008 (New Delhi)
IR Steps • Step 3: Compare vectors • Cosine similarity w1 = [0.33, =0.25, 0.66, 0.50] w1 = [0.33, =0.25, 0.66, 0.50] 2008 AVAR (New Delhi)
IR Steps Document Collection • Step 4: Document Ranking • Using similarity measure Matching document 0.30 0.82 0.90 0.76 IR New Document AVAR 2008 (New Delhi)
Adapting IR for AV AVAR 2008 (New Delhi)
l2D2: push ecx push 4 pop ecx push ecx l2D7: rol edx, 8 mov dl, al and dl, 3Fh shr eax, 6 loop l2D7 pop ecx call s319 xchg eax, edx stosd xchg eax, edx inc [ebp+v4] cmp [ebp+v4], 12h jnz short l305 l2D2: push ecx push 4 pop ecx push ecx l2D7: rol edx, 8 mov dl, al and dl, 3Fh shr eax, 6 loop l2D7 pop ecx call s319 xchg eax, edx stosd xchg eax, edx inc [ebp+v4] cmp [ebp+v4], 12h jnz short l305 push push pop push rol mov and shr loop pop call xchg stosd xchg inc cmp jnz l144: push ecx push 4 pop ecx push ecx l149: mov dl, al and dl, 3Fh rol edx, 8 shr ebx, 6 loop l149 pop ecx call s52F xchg ebx, edx stosd xchg ebx, edx inc [ebp+v4] cmp [ebp+v4], 12h jnz short l18 l144: push ecx push 4 pop ecx push ecx l149: mov dl, al and dl, 3Fh rol edx, 8 shr ebx, 6 loop l149 pop ecx call s52F xchg ebx, edx stosd xchg ebx, edx inc [ebp+v4] cmp [ebp+v4], 12h jnz short l18 push push pop push mov and rol shr loop pop call xchg stosd xchg inc cmp jnz Adapting IR for AV Step 0: Mapping program to document Extract Sequence of operations 2008 AVAR (New Delhi)
P P O P R M A S L O C X S X I C J P P O P M A R S L O C X S X I C J Virus 1 Virus 2 Adapting IR for AV Step 1a: Defining features k-perm P P O P R M A S L O C X S X I C J P P O P S L O C X S X I C J M A R Feature = Permutation of k operations 2008 AVAR (New Delhi)
Virus 1 Virus 2 P P O P M A R S L O C X S X I C J P P O P M A R S L O C X S X I C J Adapting IR for AV Step 1 Example of 3-perm P P O P R M A S L O C X S X I C J P O P Virus 3 P P O P M A R S L O C X S X I C J AVAR 2008 (New Delhi)
1 P O PR M A S L 2 P O PM A R S L M A R S L P O P 3 MARS PMAR MARS PMAR 0 0 1 0 0 1 1 0 0 Adapting IR for AV Step 2: Construct feature vectors (4-perms) AVAR 2008 (New Delhi)
Adapting IR for AV • Step 3: Compare vectors • Cosine similarity (as before) • Step 4: Match new sample AVAR 2008 (New Delhi)
Vilo: System using IR for AV AVAR 2008 (New Delhi)
Vilo Functional View Malware Collection Malware Match 0.90 0.82 0.76 0.30 Vilo New Sample AVAR 2008 (New Delhi)
Vilo in Action: Query Match AVAR 2008 (New Delhi)
Vilo: Performance Response time vs Database size Search on generic desktop: In Seconds Contrast with Behavior match: In Minutes Graph match: In Minutes AVAR 2008 (New Delhi)
Vilo Match Accuracy ROC Curve: True Positive vs False Positive True Positive False Positive AVAR 2008 (New Delhi)
Vilo in AV Product AVAR 2008 (New Delhi)
Vilo in AV Product AV Systems: Composed of classifiers Classifier Classifier Classifier Vilo Classifier Classifier AV Scanner Introduce Vilo as a Classifier AVAR 2008 (New Delhi)
Self-Learning AV Product How to get malware collection? Collect malware detected by the Product. Solution 1 Vilo Classifier Classifier AVAR 2008 (New Delhi)
Self-Learning AV Product Solution 2 How to get malware collection? Collect and learn in the cloud Vilo Internet Cloud Vilo Classifier Classifier AVAR 2008 (New Delhi)
Learning in the Cloud Solution 2 How to get malware collection? Collect and learn in the cloud Internet Cloud Vilo Learner Vilo Classifier Classifier Classifier AVAR 2008 (New Delhi)
Experience with Vilo-Learning • Vilo-in-the-cloud holds promise • Can utilize cluster of workstations • Like Google • Take advantage of increasing bandwidth and compute power • Engineering issues to address • Control growth of database • Forget samples • Use “signature” feature vector(s) for family • Be “selective” about features to use AVAR 2008 (New Delhi)
Summary • Weakness of current AV system • Exact match over extract • Exploited by creating large number of variants • Information Retrieval research strengths • Inexact match over whole • VILO demonstrates IR techniques have promise • Architecture of Self-Learning AV System • Integrate VILO into existing AV systems • Create feedback mechanism to drive learning AVAR 2008 (New Delhi)