Detecting malware with graph-based methods: traffic classification, botnets, and Facebook scams

Detecting malware with graph-based methods:traffic classification, botnets, and Facebook scams Michalis Faloutsos, U. New Mexico (moved from U.C. Riverside!) http://mypagekeeper.org

Key Thesis of this Talk Graph-mining and Network Science enable revolutionary security techniques • We develop methods for network malware detection • We use it to detect Malware on Social Networks • New frontier: Many new problems have emerged 2

This talk • Part I: Graph-based techniques for network security • Part II: Detecting malware in Social Networks • Part III: Some new projects Aristotle http://mypagekeeper.org

What we get with existing tools What we want • Ideally, we want a method to • profile ALL the traffic (Recall) • have high profiling accuracy (Precision) Unknown! Traffic profiling results using deep packet inspection (data are from a peering link between two ISP is the US) Who is using/attacking my network:Profiling traffic is not a solved problem

GraphWare:A graph-based approach to network monitoring • We monitor traffic as a network-wide phenomenon • Beyond packet, flow statistics and host profiling Based on MariosIliofotou PhD at UC Riverside Papers IMC07, CONEXT09, GI09, INFOCOM10, CONEXT10 Collaborators: PrashanthPappu, Sumeet Singh (Cisco) M. Mitzenmacher (Harvard), G. Varghese UCSD, T. Eliassi-Rad (LLNL/Rutgers), B. Gallagher (LLNL), Traffic Dispersion Graphs

Capturing Network Context:Traffic Dispersion Graphs • Traffic Dispersion Graphs: Who talks to whom • Deceptively simple definition • Defining what constitutes an edge allows for focused “slices” • Enables powerful visualization and novel algorithms Virus traffic in blue

Roadmap of this part • Previous work and background • Overview of our graph-based solutions • Developing graph-based methods • Traffic classification: Profiling By Association • Botnet detection 8

Port Numbers Payload Signatures Flow Statistics Simplest approach Flow level Packet level Flow padding Encryption Use random ports or legacy ports I. Why previous methods fail? Packet and Flow level • Packet and Flow level are not the answer What existing profilers use: How apps evade

IMC’07 Deploying our graph-based approach Step 1: Determine what is the question of interest Step 2: Define the appropriate graphs (TDGs) • Define what to monitor: i.e. Track all UDP flows, all flows at ports 10-300, flows with >10 packets Step 3: Use the right metrics to capture the right properties Step 4: Visualize results or take action GraphWare

CONEXT10 infocom10 GI’09 ASIACCS’14 COMNET CONEXT’09 Networking13 Using graphs enables many novel solutions! We addressed many different problems • Detect P2P traffic at the Internet core • Extract information from graph evolution • Classify obfuscated traffic • Exploit community structure using clustering • Profiling By Association (PBA) • Detect botnets: ENTELECHEIA • Detect malicious website scanners

P2P online game SMTP (email) Profiling By Association:The key insights Defaults Insight 1: Some traffic is easy to profile Insight 2: Traffic of apps exhibits homophily

Initial Seed Information “Profile By Association” NetworkTraffic Profiled NetworkTraffic Phase BInference Phase ASeeding Use only connectivity 18 The Profiling-By-Association Framework (PBA) Generate graph 15

+ x 0.5 x 0.5 Approach 1: Profiling By AssociationThe Neighboring Link Classifier (NLC) • Uses local structure of the graph • Classify an edge based on its neighbors V1 V2 web u web

P2P SMTP (email) online game Approach 2: Profiling By Association (HYP)The HYP algorithm – using clusters Uses global structure Two main steps: • Identify clusters • Exploit seeds to profile clusters [HYP from hyper-graph] Known email servers Known P2P Known gamers Clustering: The Louvain method by Blondel et al. outperformed other methods

Evaluation on four backbone traces • Seeding configurations • Randomly selected X% of IPs • Intentionally causing errors • Seeding using existing profilers • BLINC, Coral Reef (in the paper) • Evaluation • Using 3, 5, 10min intervals • Averaged over 20 runs • Small standard deviation

Accuracy This trace has more hosts with multipleapplications (high NAT usage) 1% of hosts as seeds Both our algorithms do pretty well! Accuracy = correctly labelled/all labelled Here we label all flows HYP is more robust to the specifics of a trace

How much seeding info do we need? >1% • 0.1% seeding info is workable (~85% accuracy) • 1% of seeding info is sufficient (>90% accuracy) • 10% is great! (>95%) Baseline: we just know the seeding information 20

P2P SMTP (email) online game What if we fake the connectivity? Imagine: Hacker adds fake edges by a factor of k 22

26 HYP is robust to edge obfuscation (20-200x) BRAZ trace • Add links from P2P hosts towards other apps • Fake = k * Existing 20x 200x k 23

Specific Application: Can you find botnets? Botnets: groups of compromised end-user machines that communicate with each other and launch attacks together (DDoS, Email Spam)

Detecting bots using graphs Problem: detect bots within enterprise Challenging requirements: 1. in their waiting stage (dormant, more difficult) 2. in the absence of payload signatures 3. Peer-to-peer: decentralized without a botmaster Previous efforts fail in at least one of the requirements Huy Hang, UCR T. Eliassi-Rad, Rutgers

ENTELECHEIA: detecting bots in a network Insight: botnet flows should be long-lived and low-intensity Question: Is this enough to detect them? Answer: Not as is. Solution: we need to redefine flows (SuperFlows) ENtrap Treacherous ELEments through Clustering Hosts Exhibiting Irregular Activities Defn: the state of a thing when its essence is fully realized (Aristotle) Nugache botnet Regular traffic Storm botnet Volume vs Duration of flow: Botnet flows are different

Key novelty: Introduce Superflows instead of flows • Superflow are groups of common flows • Consider any packet for the same pair of nodes • Irrespective of port number, or protocol • Flows that are close in time.

3 Initial results are very promising: >96% F1-score Flows Thresh ENTELECHEIA ENTELECHEIA: F1-score higher than 96% Real traces injected with real Storm and Nugache traffic Just using thresholds for volume and duration fails (5tuple, 2tuple) Reference solutions: Flows (5-tuple), Tresh (2-tuple):

We released GraphWare v.1.0:www.cs.ucr.edu/~hangh/graphware.html • Based on Python and the GUESS framework • Supports: graph metrics, comparisons, clustering

Related Publications • “Network monitoring using Traffic Dispersion Graphs (TDGs).” In ACM IMC, 2007. (AR 21%). • "Graph-based P2P traffic classification at the Internet backbone.” In IEEE Global Internet, 2009. (AR 34%). • “Exploiting dynamicity in graph-based traffic analysis: Techniques and applications.” In ACM CoNEXT, 2009. (AR 17%). • “Homophily in application-layer and its usage in traffic classification.” Brian Gallaghe, M. Iliofotou, T. Eliassi. M. Faloutsos. IEEE INFOCOM mini, 2010. (AR 24%) • “Profiling-by-association: A resilient traffic profiling solution for the Internet backbone.” To appear in ACM CoNEXT 2010. (AR 19%). • “Graption: A Graph-based P2P traffic classification framework for the Internet backbone.” Computer Networks by Elsevier 2011. • “Entelecheia: Detecting P2P Botnets in their Waiting Stage.” Huy Hang, Xuetao Wei, Michalis Faloutsos. Tina Eliassi-Rad. IFIP Networking 2013, May 2013. • “Scanner Hunter: Understanding HTTP scanning traffic.” Guowu Xie, Huy Hang, Michalis Faloutsos. Accepted to ASIACCS 2014, Kyoto, Japan. 30

USENIX Security’12 Part II: Detecting Malware on Facebook Collaborators: Sazzadur Rahman, Ting-Kai Huang Harsha V. Madhyastha UC Riverside Bruno Ribeiro, (Umass/CMU) CONEXT’12 WWW’13

Introducing Socware The “Get a free subway” scam on my wall Socware = SOCial malWARE Malicious, annoying, parasitic activities

The Dark Side of Facebook • We need new solutions for socware • Our prediction: it is going to get worse Hi Malicious post Malicious apps

Should we care? Yes. “Facebook is the new web” bmw.com facebook.com/bmw VS

MyPageKeeper: Our Facebook Appapps.facebook.com/mypagekeeper

MyPageKeeper does the job apps.facebook.com/mypagekeeper • MyPageKeeper, 20K installs, monitors 3M wall • It is efficient, scalable socware detection method • 0.005% false positive, 3% false negative • Monitors every 2 hours from the cloud • Some key observations • 49% users exposed to a malicious post in 4 months We are big in Japan

Existing malware solutions: not enough • URL Blacklists detect only 3.5% of bad posts • Remaining 96% caught by our ML-based logic • 26% malicious URL point to facebook.com Performance comparison with blacklist

The rise of the AppNet • Socware is enabled by Facebook apps! • 44% of campaigns are enabled by Facebook applications

Apps cross-promote directly App1 post Points to App2 Facebook terms forbid this!

Apps cross-promote indirectly: Highly sophisticated “fast-flux” App1 post We identified 103 URLs doing redirections! External website with redirector Javascript App4 App2 App3

Our Solution FRAppE:Facebook’s Rigorous App Evaluator App ID • FRAppE Lite, user-side • Use features crawled on-demand • No. of permissions required by an app • Domain reputation of redirect URI • Uses Support Vector Machines • FRAppE, OSN-centric • Addition of aggregation-based features: • Similarity of app names • Whether posted links are external • FRAppE has 99% detection accuracy FRAppE Malicious Benign

Some scary interesting results • 13% of apps in our dataset of 111K distinct apps are malicious • 60% of malicious apps endanger more than 100K users each (click on link) • 40% of malicious apps have over 1,000 monthly active users each We found 800 malicious apps that Facebook missed!

AppNets: large collaborative groups • App Collaboration graph • 44 connected components • Largest connected component 3,484 apps • High connectivity • 70% of apps collude with more than 10 other apps • High density • 25% of apps have local clustering coefficient more than 0.74 Real snapshot of 770 highly collaborating apps Promoter Promotee

Our anti-Socware work • “An Analysis of Socware Cascades in Online Social Networks”, Ting-Kai Huang , Md Sazzadur Rahman, Harsha Madhyastha and Michalis Faloutsos, World-Wide Web Conference (WWW’13), 2013 • "FRAppE: Detecting Malicious Facebook Applications", Md Sazzadur Rahman, Ting-Kai Huang, Harsha Madhyastha and Michalis Faloutsos, ACM CoNEXT'12, Nice, France, December 2012. • "Efficient and Scalable Socware Detection in Online Social Networks", Md Sazzadur Rahman, Ting-Kai Huang, Harsha Madhyastha and Michalis Faloutsos, USENIX Security, 2012.

Conclusion • We need analysis of large, evolving graphs • Network security needs to see the big picture • Graph-mining needed: fast, accurate, customizable • We need new solutions to detect socware • Existing blacklists and anti-spam won’t work • Malicious apps form colluding networks (AppNets)

Key Research Areas at UNM CS Human-Centric Security Adaptive Biological Systems Data Science and Visualization Computing In the Large

My Research Directions Social Media analytics and security Securing smartphones and embedded dev. Web-based malware

Human-Centric Security Center “Technologies for securing people”™ • Mission: • Secure personal information • Protect privacy and digital freedom • Empower people thru awareness, control and choice • Existing projects • Internet Censorship: choke points (Prof. Crandall) • Social Network tools to protect people • Privacy awareness and tools (Prof. Kelley) • Provably robust and privacy-preserving distr. Systems (Prof. Saia)

New projects • PeerApp: Early warning for risky behavior • Behavior: depression, suicide, addiction, bullying • Using OSNs data to detect • Privacy-Panel: know what you share • Inform users what they share and with whom • PrivateNet: private and anonymized OSN • None can reverse engineer contacts or content • CommentDigest: info from user comments • Securing embedded devices

PeerApp: Detecting risk behavior • Problem: detect risk behavior early • Solution: Leverage modern technology • Harness: social net information • Collect smartphone information: passive/active • Predict and prevent problems: • Notify appropriate person

PeerApp: An ambitious agenda Collect and mine social data Collect and mine bio data Develop theories of risk behavior Provide open platform for social studies

PeerApp: Key Novelties • Completeness: • From genes to brain to behavior to policy • Privacy sensitive: • Provide warnings, not incriminating evidence • In-vivo pseudo-real-time information: • Collect and act Just-In-Time • Provide blueprint for Social Studies: • Methods and tools to harness new tech

PeerApp: The team • Multidisciplinary team • Psychiatry: S. Feldstein-Ewing (UNM) T. Chung (U. Pitt) • Bioinformatics: V. Calhoun (MIND-UNM) • Data mining: C. Faloutsos (CMU), M. Abdullah (UNM) • Network Science: K. Pelechrinis (U. Pitt) M. Faloutsos (UNM) • Proposal and papers in progress

PrivateNet: Free Digital Speech • Goal: fully anonymous and private communications • Key: no one can reverse engineer • Content and pattern of communication • How: combination of • End to end encryption • Obfuscation • Collaborators: • S. Krishnamurthy, H. Madhyastha UCR

Detecting malware with graph-based methods: traffic classification, botnets, and Facebook scams