310 likes | 510 Views
AccessMiner Using System-Centric Models for Malware Protection. Andrea Lanzi, Davide Balzarotti , Christopher Kruegel, Mihai Christodorescu and Engin Kirda ACM CCS 2010 Oct. OUTLINE. Malware Detection System Call Data Collection Program-Centric Models and Detection
E N D
AccessMiner Using System-Centric Models for Malware Protection Andrea Lanzi, Davide Balzarotti , Christopher Kruegel, Mihai Christodorescu and Engin Kirda ACM CCS 2010 Oct.
OUTLINE • Malware Detection • System Call Data Collection • Program-Centric Models and Detection • System-Centric Models and Detection • Discussion and Conclusion
OUTLINE • Malware Detection • System Call Data Collection • Program-Centric Models and Detection • System-Centric Models and Detection • Discussion and Conclusion
Malware Detection • Signature • Static content • Byte strings, instruction sequences =>Code obfuscation • Behavior • Dynamic actions • Sequences of System calls, API functions • A program-centric approach • …good results?
Malware Detection Problem • Test case • Small scale • About 10 benign applications • Limited execution • A few minutes, sandbox • Synthetic inputs • Single machine
Malware Detection Problem(cont.) • Program-centric model • Narrow view on a program • Diversity of system call information • How benign programs interact with their environment? • Their models may specific to a small set of benign applications only
OUTLINE • Malware Detection • System Call Data Collection • Program-Centric Models and Detection • System-Centric Models and Detection • Discussion and Conclusion
System Call Data Collection • A Microsoft Windows kernel module • Collect, anonymize, and upload system call logs • Hooks the System Services Descriptor Table • Mindful of system resource
Kernel collector • 79 different system calls • Related to files, regs, processes and threads, networking, memory. • Same subset in Anubis • <timestamp, program, pid, ppid, system call, args, result>
System Call Data • Sensitive data are replaced • Non-system paths, user-root registry key, IP addresses
System Call Data Collection • Large and diverse set of system call traces • Ten different machines, different users • Serveral weeks • 114.5GB of data • 1.556 billion system call • 362,600 processes • 242 applications
Data set • 2~4 days with 2~12 hours • Production systems, development systems
Data Normalization • Raw data(system call logs) =>Accessed resources and access type • Tracking the access operations • The set of resources open at any given time • OS handles • Until the resource is released(NtClose) • Execution path and file name: • NtOpenFile, NtCreateSection, NtCreateThread
OUTLINE • Malware Detection • System Call Data Collection • Program-Centric Models and Detection • System-Centric Models and Detection • Discussion and Conclusion
Analysis of System Call Data • How diverse is the collected system call data? • Focus on types • Long tradition in the security community • Most models rely upon characteristic patterns • Ignore argument values
Creating n-gram Models • Follow a ”standard” approach 1.Extract n-grams Models for a set of malware programs and a set of benign programs 2.Find all n-gramsappear in malware programs but not in benign programs 3.Hope those n-grams are characteristic for malware programs
n-gram Models • 10,838 malware samples from Anubis • Ten experiments(ten machines) • System call traces from 9 machines and 2/3 of the malware set to train an n-grams • Perform detection with remaining system calls traces and 1/3 malwares
Program-Centric Models and Detection • Since system-call sequences invoked by benign applications are diverse • Have difficulties in distingushing normal and malicious behaviors • A large amount of data is needed
OUTLINE • Malware Detection • System Call Data Collection • Program-Centric Models and Detection • System-Centric Models and Detection • Discussion and Conclusion
System-CentricModels and Detection • Generalize how benign programs interact with the operating system • Record the files and the registry entries • Read, write, execute • It is “convergence”
Access Activity Model • A set of labels for operating system resources A label “L” is a set of access tokens • {t0,t1,…,tn} A token “t” is a pair <a,op> • <firefox,write>, <*,execute> a => application op => type of access
Initial Access Activity Model(1) • Use system-call traces of all benign processes • A virtual file system tree Application “a” C:\foo\a.txt (write) Application “b” C:\foo\bar\b.rar (exec) bar <b,exec> foo <a,write> C:
Model Pre-processing(2) • Remove some elements in the tree • Microsoft Windows services • Desktop indexing programs • Anti-virus software • Identify applications that start processes with different names • C:\Windows\system32 => win_core
Model Generalization(3) • Propagated • Container • All children are private(without *) • C:\Program Files • Merged • <x.write> => <x.read>
System-CentricModel Detection • For any op • Find the longest prefix P shared between the path to the resource and the folders in the virtual tree stored by our model • Ten experiments • File system access activity model • About 100 labels • Registry access activity model • About 3000 labels • Full access activity model
Detection Results(Files) • //Looks sobering • Many samples(Malware) don’t work(!) • 10,838 -> 7,847 • Use only write operation • Our own logging component • Software updates
Detection Results(Regs) • HKEY_USER\Software\Microsoft • Need a larger training set
OUTLINE • Malware Detection • System Call Data Collection • Program-Centric Models and Detection • System-Centric Models and Detection • Discussion and Conclusion
Discussion and Conclusion • Full access activity model • 91% detection / 0% false positives • System-centric approach • Policy violations occurred only for few, specific classes of programs • Network limitation • MAC policy • SELinux