1 / 24

Research of Alan Sprague: Using Data Mining to Combat Spam, Phishing, and Malware

Research of Alan Sprague: Using Data Mining to Combat Spam, Phishing, and Malware. Department of Computer and Information Sciences University of Alabama at Birmingham. Computer Forensics at UAB.

shasta
Download Presentation

Research of Alan Sprague: Using Data Mining to Combat Spam, Phishing, and Malware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Research of Alan Sprague: Using Data Mining to Combat Spam, Phishing, and Malware Department of Computer and Information Sciences University of Alabama at Birmingham Univ. of Alabama @ Birmingham

  2. Computer Forensics at UAB • We offer BS and MS degrees with an emphasis on forensics; the Criminal Justice Department participates in these programs. • Research center: CIA/JFR: http://thecenter.uab.edu • Gary Warner • Blog “Cyber Crime and Doing Time” • http://garwarner.blogspot.com • My research • Spam • Phishing • Malware Univ. of Alabama @ Birmingham

  3. Outline • This presentation will describe my research interests in spam and malware. • The next 9 slides: spam. • Subsequent slides: malware. Univ. of Alabama @ Birmingham

  4. Spam and the criminal web 70-80% of all email in the world is spam. Spam enables various classes of antisocial activity: Spam advertises opportunities to buy counterfeit goods, for example, pills (possibly adulterated pills) Spam delivers phish, which commonly are intended to steal credentials to banks and other financial institutions. Spam delivers malware. Univ. of Alabama @ Birmingham

  5. Spam: Clustering, not Classification • People commonly expect our research to be classification of emails as ham or spam: desired or undesired. They then expect us to help filter email, so that spam will not be delivered. • That is not our research. Instead, we start with a data file that we expect is entirely spam, and our goal is to cluster it into spam campaigns. • This is an important goal, because after we understand the various spam campaigns, we know which are the largest, and we know what type of criminal activity each campaign enables. This enabled law enforcement to focus attention on the most harmful campaigns. Univ. of Alabama @ Birmingham

  6. Background on Data Mining • Data Mining studies the challenges and opportunities offered by huge data files. • Three methods are central to Data Mining. • Clustering: group together records in the data file if they resemble each other (without knowing the “meaning” of any resulting group, called a cluster). • Classification: assign each record to one of several “classes”, each of which corresponds to a known type of data. • Frequent sets and association rules Univ. of Alabama @ Birmingham

  7. Our spam data • Each day: 1 million spam messages • Stored into UAB Spam Data Mine Univ. of Alabama @ Birmingham

  8. Preprocessing of spam data • Parsing • Subject • Sender IP • Sendername • If body contains a URL: • Its domain, and IP • Word count of body Univ. of Alabama @ Birmingham

  9. Some spams, parsed Subject Sender Sender Name Username Order HCG online y5fh6 EfrenGriffith artq.com Order HCG online vfe3ih Victor musicradio.com Pfizer Inc Discount 43681 lefley uab.edu Buy Cialis Online Tam Smith adeptis.com Your LinkedIn blocked John Fial irs.gov Univ. of Alabama @ Birmingham

  10. Goal, for the Spam Data Mine • Cluster each day’s emails, to find largest spam campaigns, and then to find clues: where are they coming from? • Relate each day’s clusters to the previous day’s clusters. Any new types of spam are considered “emerging threats”. Univ. of Alabama @ Birmingham

  11. Largest Cluster on a particular day Univ. of Alabama @ Birmingham

  12. Why Is This Work Useful? • A large number of domains used by leading spammers to counter domain blacklisting • Shutdown of those domains and their hosting servers can greatly cripple spammers’ ability to conduct spam-related cyber crimes. • Further investigation of domains and IP addresses may lead to the identities of spammers. Univ. of Alabama @ Birmingham

  13. Transition • Spam clustering is an ongoing project. A different thrust is the study of malware. I describe two methods of static analysis of malware: using blocks and jumps (slide 16), and using strings (slides 17-23). Univ. of Alabama @ Birmingham

  14. Malware • What is malware? • A program that performs actions that the user does not want • Executable file, i.e., machine code • Each day, we add 5000 new malwares to our database • Two types of analysis: • Static analysis • Dynamic analysis Univ. of Alabama @ Birmingham

  15. Goals • Malwares belong to families, such as Zeus, Reveton, Perfect keylogger • Eventual goal: Put each malware into its family. • Current goal: Cluster malwares, based on their strings. Univ. of Alabama @ Birmingham

  16. Static Analysis, using Blocks and Jumps • Method to encode malwares: • Jumps (e.g. subroutines, and subroutine calls) • Disassemble each malware, split it into “blocks”, compute a hash value for each block. Also find each jump, and write which block it is from and which it is to. • Result: each malware is a directed graph. • When malwares are encoded this way, malwares will be clustered together if their graphs are similar. Univ. of Alabama @ Birmingham

  17. Static Analysis, using strings of printable characters • at least 4 characters long, ending with \0 cxczxczxczxcc Enter %d-%02d-%02d_%02d-%02d-%02d-%d JPEG Image saved successfully!^ Screenshot saving cancelled because of logging disabled.^ COXJPEGFile::fill_input_buffer : Catching CFileException^ %d-%d-%d_%d-%d-%d _controlfp 1.12782 @.rsrc Password: Univ. of Alabama @ Birmingham

  18. Data File for 1 Day Each row is the list of strings in one malware. A sample file of 5000 malwares looks like: m1: cxczxczxczxcc,Enter, _controlfp, …. m2: ……………. m3: ……………. m4: ……………. . . . m5000: …………. Univ. of Alabama @ Birmingham

  19. Frequent sets • A typical application is retail data. • Data File: Purchases at a large store. • Each record: List of purchases of one customer. • Question: Which items are often bought together? • Our application: malware. • Our data file: Strings in malwares. • Each record: List of strings of one malware. • Question: Which strings are often found together? • Dual Question: which malwares have many common strings? Univ. of Alabama @ Birmingham

  20. Frequent sets: Tiny example • 6 malwares (so 6 records), 4 strings. • The malwares: • a, b, c, d • b, c, d • a, c, d • a, b • c, d • b, d • Incidence matrix a b c d 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 0 1 0 1 Univ. of Alabama @ Birmingham

  21. Frequent sets: Tiny example • Strings a,c are a frequent set (records r1 and r3 contain both) • But a,c is not maximal, because d is in both records • Incidence matrix a b c d r1 *1 1 *1 *1 r2 0 1 1 1 r3 *1 0 *1 *1 r4 1 1 0 0 r5 0 0 1 1 r6 0 1 0 1 Univ. of Alabama @ Birmingham

  22. Closed frequent sets • A frequent set is closed if it equals the intersection of the records containing it. • Alternate definition: a closed set is a maximal all-ones submatrix. • Since rows and columns play the same role in this, one can let malwares and strings exchange roles. • Ex: Incidence matrix a b c d r1 *1 1 *1 *1 r2 0 1 1 1 r3 *1 0 *1 *1 r4 1 1 0 0 r5 0 0 1 1 r6 0 1 0 1 Univ. of Alabama @ Birmingham

  23. Closed Frequent Sets for Malware Analysis • Wanted closed frequent sets, with threshold 30. • The lowest the state-of-the-art algorithm could do was 1000. • By being willing to discard strings that appear more than 10 times, we recently managed threshold 20. • Ongoing Univ. of Alabama @ Birmingham

  24. The end . Univ. of Alabama @ Birmingham

More Related