170 likes | 452 Views
Intelligent Detection of Malicious Script Code. CS194, 2007-08 Benson Luk Eyal Reuveni Kamron Farrokh Advisor: Adnan Darwiche Sponsored by Symantec. Outline for Project. Phase I : Setup Set up machine for testing environment Ensure that “whitelist” is clean Phase II : Crawling
E N D
Intelligent Detection of Malicious Script Code CS194, 2007-08 Benson Luk Eyal Reuveni Kamron Farrokh Advisor: Adnan Darwiche Sponsored by Symantec
Outline for Project Phase I : Setup Set up machine for testing environment Ensure that “whitelist” is clean Phase II : Crawling Modify crawler to output only necessary data. This means: Grab only necessary information from webcrawling results Listen into Internet Explorer’s Javascript interpreter and output relevant behavior Phase III: Database Research and develop an effective structure for storing data and link it to webcrawler Phase IV: Analysis Research and develop an effective algorithm for learning from massive amounts of data
Completed Tasks – First Quarter Phase I Configured machine with Norton Antivirus and Heritrix web crawler Webcrawler will be used to grab additional URLs, and Norton Antivirus will be used to verify that a URL has not launched an attack Created a Python script to ensure that visited sites are clean Captures Norton’s web attack logs before and after loading a site in Internet Explorer, then compares the logs for new entries and signals whether or not a site’s data should be discarded Phase II Configured Heritrix to run specific crawls that target a set of domains, and output minimal information The purpose is to gather as many URLs with scripts as possible for a large sample base Created a parser for Heritrix logs to filter out irrelevant websites For example, we are omitting URLs that point to images since they will not contain scripts
Completed Tasks – Second Quarter Phase I • Whitelist: integrated Symantec component to check whether visited site is malicious, so all of the data we gather is from clean sources • Hard drive: installed a 750 GB hard drive
Completed Tasks – Second Quarter Phase II • Crawling: We ran a shallow crawl with 200 domains as seed, and that is the current base of our data. The result was 18,500 URLs that we run through with our Script Listening component
Completed Tasks – Second Quarter Phase II • Script Listening: received a customizable tool from Symantec that listens to the Javascript interpreter in Internet Explorer • We modified it to output the information we need: GUID -> DISPID -> ArgType -> ArgVal
Completed Tasks – Second Quarter Example of data:
Completed Tasks – Second Quarter Phase III • The amount of data we have gotten is too large to use in a database. The pure text file is 4GB (~50 million function calls), and querying such a database is too slow on the computer we have. • Instead, we are storing the data as a text file, and doing operations on it with Python scripts.
Results and Findings – Second Quarter Phase IV • We have analyzed data from our first two result sets • Crawl with 5 initial seeds • 3,476,348 function calls • 109 distinct GUIDs, 7364 GUID-DispID pairs • Crawl with 15 initial seeds • 3,706,454 function calls • 95 distinct GUIDS, 5575 GUID-DispID pairs • Looked at most common functions, most common int-argument functions, and distribution of the argument values for these functions
Results and Findings – Second Quarter • Function 1: • GUID: 3050f55d-98b5-11cf-bb82-00aa00bdce0b • GUID object name: DispHTMLWindow2 • DispID: 1103 • Most popular int-argument function in both result sets • Mostly random distribution, but signs of regularity • Results from two sets show significant differences
Results and Findings – Second Quarter • Function 2: • GUID: 3050f55f-98b5-11cf-bb82-00aa00bdce0b • GUID object name: DispHTMLDocument • DispID: 1013 • Second most popular int-argument function in both result sets • Shows a regular distribution with distinct characteristics • Results from two sets show significant differences
Results and Findings – Second Quarter • Function 3: • GUID: 3050f51b-98b5-11cf-bb82-00aa00bdce0b • GUID object name: DispHTMLIFrame • Dispid: -2147418107 • Third most popular int-argument function 1st result set, 95th most popular in 2nd result set • Shows a random distribution with distinct characteristics • Results are dramatically different between data sets • All arguments in the 2nd result set are 0
Results and Findings – Second Quarter • Found significant differences between the data sets in both the frequencies of specific functions, and the arguments of specific functions • Suspect that differences result from biases due to small amount of original seeds (5 and 15) • Ran a much broader crawl (200 seeds) in hopes of getting more general, unbiased results • Just from partial results of this crawl (roughly 8000 websites), we have so far found: • A much larger average of calls to our listener per website • A large percentage of function calls that take 0 arguments • Will post complete results once crawl is finished
Direction for Next Quarter • Further analyze the gathered data for patterns • Compare trends in “normal” data to what occurs in malicious scripts