Intelligent Detection of Malicious Script Code

Intelligent Detection of Malicious Script Code CS194, 2007-08 Benson Luk Eyal Reuveni Kamron Farrokh Advisor: Adnan Darwiche Sponsored by Symantec

Outline for Project Phase I : Setup Set up machine for testing environment Ensure that “whitelist” is clean Phase II : Crawling Modify crawler to output only necessary data. This means: Grab only necessary information from webcrawling results Listen into Internet Explorer’s Javascript interpreter and output relevant behavior Phase III: Database Research and develop an effective structure for storing data and link it to webcrawler Phase IV: Analysis Research trends for normalcy and investigate possible heuristics

Approach to Project First Quarter : Infrastructure Second Quarter : Data Gathering Third Quarter : Data Analysis (Note: some overlap between quarters)

Infrastructure Internet Explorer 7, Windows XP SP2 Professional Main testing environment Norton Antivirus Protects against malicious files and scripts Can access logs to determine which sites launched attacks Integrated into automated site visiting

Infrastructure CanaryCallback.dll Plugin into Internet Explorer Able to access most data received by low-level Javascript interpreter The function being called (DISPID) The class that the function belongs to (GUID) The list of types and values of parameters passed into the function. Examples: VT_I4: 4-byte integer VT_BSTR: Byte string VT_DISPATCH: Object Large part of first and second quarter was spent programming, debugging, and maintaining the functions that would handle the data Functions to grab data type Functions to parse data values (some stored in bitstreams) Functions to output data to file If types did not have an obvious output format (i.e. VT_DISPATCH), we had to create one that would accurately represent as many components of the data as possible

Infrastructure Python Scripting language Designed to handle parsing with ease Script for infrastructure was used to perform three tasks: Launch Internet Explorer (uses the cPAMIE engine), load website, then close Internet Explorer Access and parse Norton’s web attack logs for any attacks launched by website Sort script data from CanaryCallback DLL based on DLL data and attack logs (Was there an attack? Did any scripts run? Etc.) Heretrix Open-source webcrawler with high customizability Can run specific crawls that target a set of domains, and output minimal information Uses HTTP requests; does not render crawled sites The purpose is to gather as many URLs with scripts as possible for a large sample base

Step 0: URL queue is “seeded” with domain list Step 3: Append URLs to log data and URL queue iff they satisfy our set of rules Infrastructure: Crawler URL queue Heretrix raw data Step 4: Get rid of excess data, leaving only URL information for each site, and output to new file Step 1: Grab URL from queue Crawler Python parser Step 2: Grab source from URL WWW Heretrix parsed data Repeat steps 1-4 until crawl limit is reached.

Norton Antivirus: Logs Step 5: Python analyzes callback data and logs to decide whether a site is clean, dirty, or has no scripts Infrastructure: Gatherer Norton Antivirus: CanaryCallback data Python controller Step 4: IE7 informs PAMIE that it is finished; Python kills IE7 Step 1: Python script grabs site from crawl data Step 3: IE7 Javascript interpreter outputs to file containing all DLL data Step 2: cPAMIE component loads IE and sends it to specified site Heretrix parsed data Internet Explorer 7 Step 6: Python outputs sorted and formatted data to relevant files for future analysis Formatted output Repeat steps 1-6 until URL list is exhausted.

Data gathering Heretrix crawls First crawl: 5 seeds, depth 5 5 million sites found Second crawl: 10 seeds, depth 3 3 million sites found Third crawl: 200 seeds, depth 1 18,500 sites found Fourth crawl: 200 seeds, depth 2 3 million sites found First two crawls produced data that was biased towards large, interlinked sites; the last two broad crawls were run to remedy this. CanaryCallback gathering For first and second crawls, a chosen set of 1,000 or so sites were run through by gatherer component. For third crawl, all sites (18,500) were processed by gatherer For fourth crawl, several tasks were performed: 20,000 sites were processed by gatherer In mid-May, the same 1000 sites were processed 28 times (about 4 times per day) from May 7 to May 13

Data analysis setup CanaryCallback data analysis Main choice for parsing data was Python scripting language Too much data for MS Access or even MySQL Python scripts were developed to facilitate analysis in manner similar to SQL Scripts to aggregate data sets and frequencies Scripts to calculate various metrics of data sets, such as: Smallest data point Largest data point Average data point Variance of data point Total data points Sum of data points Scripts to output to file in Excel spreadsheet (CSV) for deeper analysis

Individual data analysis Third quarter and last half of second quarter were spent focusing on as wide a range of data as possible To accomplish this, our group split up and pursued a different line of research individually Individual presentations will follow: Eyal: Activity categorization Benson: Integer argument trend analysis Kamron: Byte string argument trend analysis

Activity Categorization

Activity Analysis There is an obvious connection between a function and the site using it Is it possible to quantify this relationship, and establish whether certain functions are used in a specific kind of site? Characterize a site based on how active it is; i.e, how many function calls are made while the site is loaded Does there exist a pattern in the data that will be able to distinguish an abnormal usage of any function based on the characteristic of the site?

Total number of sites: 14848 Average function calls per site: 5777 Average function calls per function: 1984 Standard deviation of function calls per function: 25493 Standard deviation of function calls per site: 14181 Site Function Usage Statistics Median: 1456 First quartile: 438 Third quartile: 4029 Interquartile range: 3591 Minus outliers: none Lower whisker starts at: 0 Upper whisker ends at: 9365 “Box and whisker” outliers: 2048 Minus outliers: none Three Standard Deviations below: 0 Two Standard Deviations below: 0 One Standard Deviation below: 12086 One Standard Deviation above: 1633 Two Standard Deviations above: 510 Three Standard Deviations above: 296 Normal distribution outliers: 323

Correlation analysis Related each function to the site calling it using the number of function calls on that site Each tuple consisted of the number of times a function was called at a particular site, and the number of total function calls that were made at that site The correlation between the variables in the tuple was made for each individual function Many functions were not common, and so not enough data was available to make a conclusion about them For the functions that had enough (over 100) sites that called them, the correlation values were between .004 and -.01, showing no correlation between the function and the script activity of the site calling it

Function Usage Amount An interesting trend arose when analyzing the correlation data There are functions that are called hundreds/thousands of times Despite this, sites seem to call a specific function only a couple times. Example: GUID 3050f3fd-98b5-11cf-bb82-00aa00bdec0b, DISPID 1 Called 346 times, only in 11 sites is it called more than 3 times (3.2%)

Categorization Approach Since no correlation was found, another approach was taken According to trends in the script activity data, divide the sites into distinct categories Examine the function behavior in each category, as opposed to individual sites Three categories were chosen, roughly along the median and the end of the third quartile This gave one category 50% of the data, while the other two had 25% of the data An attempt to avoid bias toward the extremely script-heavy sites

Categorization Heuristic A heuristic was developed to determine whether a function would be more likely to appear in a certain category F =((avgl - avgsite)*(L - avgfunc)+(avgm - avgsite)*(M - avgfunc)+(avgh - avgsite)*(H - avgfunc)) / 3 avgl, avgm, and avgh are the average number of function calls per category (542, 2882, and 22745 respectively) avgsite is the overall average number of function calls per site (5777) avgfunc is the avg number of function calls per function (1984). L, M, and H are the specific number of times the function was called in the low, medium, and high category

Statistical Variation Among Categories The heuristic separated out the functions into three distinct sections Along the higher values were mostly functions that had few arguments supplied In the middle, there were whole objects represented (a GUID, and all of its related function calls) At the lowest negative values were functions that were commonly called with arguments

Argument Distributions A further analysis was done on whether there exists a difference in the behavior of a function in the separate categories The distributions of BSTR (Byte String) lengths and I4 (4-byte Integer) values were considered Several functions were examined, but this specific one (referred to as “Second”, as it had the second highest heuristic value) is exemplary of the trends noticed The argument type frequency of “Second”: LOW: 0 arguments: 20713 I4 arguments: 0 BSTR arguments: 2634 DISPATCH arguments: 14 NULL arguments: 0 BOOL arguments: 0 MID: 0 arguments: 170861 I4 arguments: 0 BSTR arguments: 9888 DISPATCH arguments: 1 NULL arguments: 0 BOOL arguments: 0 HIGH: 0 arguments: 1215964 I4 arguments: 0 BSTR arguments: 9447 DISPATCH arguments: 19 NULL arguments: 0 BOOL arguments: 0

Conclusions of Approach The trend seen is that there is no major statistical difference in the argument value distribution among the categories, but there are distinct characteristic differences seen Functions that appear more commonly in less-active sites tend to have arguments supplied to them No general correlation exists between functions and how active the site calling it is There may exist correlation in some other characteristic, however

Integer analysis

Functions through Three Sets Looked through 3 of the runs: 5 seeds, depth 5: 1,324 sites 10 seeds, depth 3: 1,184 sites 200 seeds, depth 1: 15,790 sites Picked three most common functions with integer arguments of the first run to analyze Goal: Look for consistency throughout function behavior in differing sets of sites

Functions through Three Sets In all three data sets, the values of the argument had a very large range, from 0 to the millions or billions Distributions did not stay consistent through sets, all had differing commonly occurring values

Functions through Three Sets Similar pattern in all 3 sets Low values were used Numbers near 0 most common, occurrences drop off as values get larger

Functions through Three Sets Values range from 0 to in the hundreds Second data set did not have enough data Similar common numbers in both sets: 3, 300, and 728

Patterns in DISPID Usage Looked at what DISPIDs were used, without regard to the GUIDS of the calling classes DISPIDs had a large range, from lows of less than -2 billion, to highs of over 3 million Out of 743,270 functions analyzed, The vast majority had DISPIDs within 4 distinct ranges 205 of the function did not fall within these groups, and instead were one of 6 other numbers Within each of the four ranges, occurrences at specific numbers formed patterns

DISPID Usage – First Range The most common range for DISPIDs – 3,000,000-3,001,286 • 490,201 functions, about 66% • 1,067 out of 1,286 different numbers used • Numbers nearer to 3 million are most common, higher numbers were used less

DISPID Usage – Second Range Second common range for DISPIDs – 0-2,313 • 164,224 functions, about 22% • 39 numbers in this range were used • 0 and 1,103 were the most common • Numbers clumped around 5 groups: 0-9, 127-154, 1002-1168, 1500-1504, and 2001-2015, with 2313 being an exception

DISPID Usage – Third Range Third range for DISPIDs –-2,147,417,109 to -2,147,411,105 • 50,541 functions, about 7% • 55 numbers in this range used • Most occurrences were around numbers ending in round thousands

DISPID Usage – Fourth Range Fourth range for DISPIDs – 10,001-10,087 • 38,099 functions, about 5% • 75 numbers out of the range were used • Uniquely used by 3050f55d-98b5-11cf-bb82-00aa00bdce0b • DISPIDs 10,001-10,007 are most common

Patterns in DISPID Usage Looked at what DISPIDs were used, without regard to the GUIDS of the calling classes DISPIDs had a large range, from lows of less than -2 billion, to highs of over 3 million Out of 743,270 functions analyzed, The vast majority had DISPIDs within 4 distinct ranges Within each of the four ranges, occurrences at specific numbers formed patterns

Function with Multiple Integers Looked for patterns in the relations among the integer arguments of functions taking multiple arguments Not very many functions in this category One took two arguments, first was always 0 One took two arguments, always the same. Arguments were all from (1,1) to (31,31) and (1908,1908) to (1908) All came from 2 signup sites on a particular website Two took two differing arguments, could not find relation between arguments Other functions did not have a large enough sample size

Functions with Multiple Integers • Function itself had consistent patterns in the values it took: 95% of arguments were (1,1) or (3,2) • No consistent relations between arguments

Function Pairs Examined GUID: 3050f55d-98b5-11cf-bb82-00aa00bdce0b DISPIDs: 10001-10062 Out of 38,099 occurrences, 3,595 were followed by: GUID: c59c6b12-f6c1-11cf-8835-00a0c911e8b2 DISPID: 0 Second function had no independent occurrences Similar arguments: First function took a variety of numbers and types of arguments Second function always took a DISPATCH argument, followed by the same arguments as the first function

Conclusions of Approach Functions arguments through sets: Seems to be consistent patterns in certain functions Range, values taken, values common, value distribution DISPID usage 4 ranges with very few exceptions Common subranges or distribution patterns within each range Multiple arguments Uncommon type of function No noticeable relations in arguments Function pairs Dependent functions have clear patterns Function position Argument types and values Only one example – do more exist?

Byte string analysis

Byte String Analysis Buffer overflows are a common method of exploiting a targeted system One method: create a very long string to break boundary checking, then append shellcode at the end to inject into the assembly code We are interested in the length of BSTR objects feeded into given functions For any given API, what is considered a normal string length?

Class-based analysis • Initial analyses were done on a class-by-class basis • Samples were grouped together and analyzed according to GUID • Byte strings are typically very small • More than 70% of the commonly called Javascript classes typically received byte strings of less than length 20. (39 out of 55 functions from this crawl) • Less than 10% of these ever receive a string greater than 5000 characters in length (4 out of 55 functions from this crawl).

Class-based analysis • Analysis of individual classes shows same trend toward smaller strings • However, analyzing based on classes groups byte strings of all class functions together, which results in inaccuracy and lost information

Parameter-based analysis • Second analysis split samples into individual arguments of unique functions of each class • Given a sample set with values in the interval (a, b) with average μ and standard deviation σ, we expect values to largely lie within the interval (μ – σ, μ + σ) • We also expect (μ – σ, μ + σ) to be smaller than (a, b) • The smaller (μ – σ, μ + σ) is in proportion to (a, b), the more well-defined our sample set becomes

Parameter-based analysis • Length of expected interval: 2σ • Length of entire interval: n = b – a + 1 • 2σ/n represents the ratio of the expected interval to the entire interval • Since 2σ< n, 0 < 2σ/n < 1 • When 2σ/n = 0, σ = 0 and all values in data set are equal • When 2σ/n = 1, σ = n/2 and all values in data equal either a or b • As 2σ/n goes from 0 to 1, shape of graph begins to shift

Intelligent Detection of Malicious Script Code