Application Recognition

Application Recognition Sam Larsen Determina

Process Control • One method to improve computer security is through process control • Whitelist: user specifies what is allowed to run • Blacklist: user specifies what is not allowed to run • Strong customer interest • Disadvantages: • Difficult to administrate • Hackers are learning to circumvent

The Pesky Gray • Many applications won’t be black or white • Whitelist: a lot of work for the administrator • Currently, we identify applications via a checksum • New software introduces a new checksum • Every new upgrade/patch requires intervention • Blacklist: circumvention is getting common • Bad guys now create custom binaries just for you! • Small modifications defeat checksums • Many malware payloads are encrypted

Application Recognition • Can we automatically recognize a different version of a known application? • Migrate to blacklist/whitelist with little or no user intervention • Malware identification • Hackers are lazy: families of malware derived from the same code base

Approach • Observe runtime program behavior • Indirect branches pose no problem to analysis • Focus on the code that actually executes • Handle self-unpacking binaries • Potentially, observe runtime data • Apps derived from the same codebase should have similar runtime behavior • Different apps should have different behavior • First attempt: characterize an application by the stream of system calls it generates

Rationale for System Calls • System calls are the important events • Nearly identical binaries should generate nearly identical traces • Factor out small code changes • Low runtime overhead • Only take action at system calls

Application Communities • Application identification is most useful in an application community • Community data can be aggregated to form more complete application signatures • Once an application is recognized, it can be approved or disapproved for everyone • Prevent harm for most community members • Eliminate most of the overhead of recognition

T - d T Initial Experimental Results • Use DR to capture system call traces • Build database of all sequences of N calls • Example: For N=2 and sequence ABCD → AB, BC, CD • Measure of similarity between two apps: T = # unique sequences across both apps d = # sequences in one and not the other

N = 2 N = 3 N = 4 Firefox 1.0.1

N = 2 N = 3 N = 4 Firefox 1.5.0.1

N = 2 N = 3 N = 4 Apache 1.3.17

N = 2 N = 3 N = 4 Apache 2.0.47

N = 2 N = 3 N = 4 Gaim 1.0.0

N = 2 N = 3 N = 4 Gaim 1.3.0

Traces of API calls • Windows API is the primary system interface for windows apps • More sensible to track sequences of API calls • At system call, examine the call stack to find the outermost API call • If not possible, default to system call

N = 2 N = 3 N = 4 Firefox 1.0.1

N = 2 N = 3 N = 4 Firefox 1.5.0.1

N = 2 N = 3 N = 4 Apache 1.3.17

N = 2 N = 3 N = 4 Apache 2.0.47

N = 2 N = 3 N = 4 Gaim 1.0.0

N = 2 N = 3 N = 4 Gaim 1.3.0

Comparison with Traditional HIPS • Syscall-based intrusion detection/prevention • [Forrest & Hofmeyr] • Record traces during training, then monitor and compare in deployment • Problem with false positives • Syscall-based application recognition • Looking at general trends, thus some noise can be tolerated  false positives not an issue • More practical use of system call traces

Next Steps • Gather data for more applications • How can we match applications that make few system calls (e.g., calc)? • Compare families of malware • Build a sandbox? • Malicious code may be recognized too late

Application Recognition