272: Software Engineering Fall 2012

272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining

Code Mining • There is a lot of code that is available for everyone to access • Can we learn from them? • One of the active research directions in software engineering is to mine existing code for various purposes such as • To discover common behaviors • which can then be used to extract specifications such as interfaces usage patterns, etc. • To discover anomalies • which can then be used to find bugs or problematic behaviors

We will discuss two papers that do this • Today we will discuss two papers that use code mining for different purposes: • "Graph-based Mining of Multiple Object Usage Patterns" Tung Nguyen, Hoan Nguyen, Nam Pham, Jafar Al-Kofahi, and Tien Nguyen. 7th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2009). • "Mining Specifications of Malicious Behavior" Mihai Christodorescu, Somesh Jha, Christopher Kruegel. Joint meeting of European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2007).

Mining for Object Usage Patterns • We discussed papers that automatically extract behavioral interfaces for classes • The papers we discussed earlier focus on usage of a single class and try to identify the ordering of method class on a single object • However, there maybe usage patterns that involve multiple objects • Moreover, usage patterns may involve • control flow structures • such as calling a method within a loop • and data dependencies • such as one argument in a method call being dependent on an argument in another method call

GrouMiner • GrouMiner is a tool that extracts usage patterns for objects that takes into account both • temporal usage orders (like we have seen in the interface extraction papers we discussed already), and • data dependencies • It defines a graph-based object usage model (groum) and extracts these models from existing code

Object Usage Model • A groum is a directed acyclic graph • nodes are labeled, edges are not labeled • Nodes correspond to • actions: method calls, access to object fields • control flow structures: conditions, branches or loops such as if, while, for statements • Edges represent • temporal ordering: If A is used before (or always generated before) B then there is an edge from A to B • data dependency: There is an edge from A to B if there is a data dependency between A and B • A groum can represent multiple objects

How to Extract the Object Usage Model? • The temporal ordering of the nodes in a groum is extracted from the AST by adding edges between nodes that are sequentially ordered • The data dependency edges are extracted using an intra-procedural dependency analysis • Identify the variables involved in each action to determine the dependencies and add edges to represent the dependencies

Extracting Code Skeletons • A groum model can be un-parsed and converted to a code skeleton • This code skeleton will demonstrate the usage pattern as code rather than a directed acyclic graph • This approach can be used as a reverse engineering approach to discover different usage patterns in the code • However, there will be many groums in a given code and not all of them should be reported as usage patterns • They should be filetered somehow

Usage Pattern Mining • GrouMiner uses graph mining techniques to identify the common usage patterns in the code • They determine the frequency of a pattern by computing the number of independent occurrences • If the frequency of a pattern is higher than a threshold, then it is reported • The graph mining algorithm determines the common patterns efficiently by • By identifying common graph patterns incrementally, starting with graphs with small number of nodes and then finding other patterns based on sub-graph relationship • By checking equivalence of patterns approximately using a vector representation that summarizes the features of a pattern, rather than doing an exact matching

Anomaly Detection • Using the graph mining algorithm they can identify anomalous usages • They identify an anomalous usage as a sub-graph of an identified pattern that is not extensible to that pattern • This is considered a violation of the pattern • A violation is considered an anomaly when it is too rare • i.e., common violations are not reported as anomalies • They discuss two types of anomaly detection: 1) anomaly detection in a given project, 2) anomaly detection when a project changes • Anomaly detection can be used to identify errors • Ana anomalous usage may correspond to violation of an interface and may point to a bug • However, when anomaly detection is used as a bug finding approach it generates a lot of false positives (87.8% in one case) • i.e., many identified anomalies do not correspond to errors

Mining Specifications of Malicious Behavior • In the second paper we are discussing, code mining is used to find specifications of malicious behavior • Computer security applications rely on manually written specifications to identify malicious code automatically • However, the manual specification task is hard and time consuming • This paper tries to automate the specification of malicious behavior

The approach • The presented approach works in three steps • Collect execution traces from malware and benign programs • Construct the corresponding dependence graphs • Compute specification of malicious behavior as difference of dependence graphs • Note that in this approach mining is done on the execution traces • In the paper we discussed earlier, mining was done on the source code

How to represent behavior? • They identify some requirements for representation of behaviors: • A specification must not contain independent operations • A specification must relate the dependent operations • A specification should capture only security relevant operations • To meet these requirements they focus only on system calls and represent malicious behavior as a dependence graph of system calls • This representation satisfies their requirements • Independent calls will not be connected in this representation • Dependent calls will be connected • Only the system calls will be tracked since they correspond to the security relevant operations

How to represent the behavior? • The behavior is represented as a special type of dependence graph • Since they are interested in system security, they decide to model execution behavior as a sequence of system calls • Each node of the dependence graph they construct corresponds to a system call • The edges of the dependence graph corresponds to constraints that represent the dependences between two system calls • Such as argument1 for call1 is equal to the argument 2 of call2

More on dependence graphs • The dependence graphs they construct are directed acyclic graphs • Each node corresponds to a system call • They define a simple type system for the arguments of the system calls • Edges represent dependencies which are characterized as logic formulas • A logic system that allows constraints with modular and bit-vector arithmetic, arrays, and existential and universal quantifiers is sufficient

Comparing Benign Programs and Malware • The presented approach first constructs the dependence graphs for the execution traces of the benign program and the malicious programs • Then they construct the minimal contrast subgraph of a malware dependence graph and the benign dependence graph • The smallest subgraph of the first graph that does not appear in the second

Empirical evaluation • Thee presented approach is applied to 16 well-known malware examples • For these 16 examples, the algorithm successfully discovers the same behavioral features as those independently provided by human experts

272: Software Engineering Fall 2012