110 likes | 238 Views
Security Analytics Thrust. Anthony D. Joseph (UCB) Rachel Greenstadt (Drexel), Ling Huang (Intel), Dawn Song (UCB), Doug Tygar (UCB) . Outline. Our view of Security Analytics Adversaries, Humans, and Machine Learning Joint research with McAfee Our proposed m alware analysis pipeline
E N D
Security Analytics Thrust Anthony D. Joseph (UCB) Rachel Greenstadt (Drexel), Ling Huang (Intel), Dawn Song (UCB), Doug Tygar (UCB)
Outline • Our view of Security Analytics • Adversaries, Humans, and Machine Learning • Joint research with McAfee • Our proposed malware analysis pipeline • Today’s Security Analytics talks
Our View of Security Analytics • Using robust ML for adversary resistant security metrics and analytics • Pattern mining and prediction at scale on big data • Detecting malware, spam, and malicious sites/URLs • Identifying authors of User Generated Content and malware • Also, Sybil detection in crowds and obfuscating authors of UGC • Detecting human biosignals – EEG, vision tracking, SAFE continuous authentication • Helping the humans-in-the-loop (situational awareness) • End-users of systems • Crowds and human reviewers • Domain experts
Adversarial Exploitation of ML • Traditional approach – Evading Adversary • Attacker determines decision boundary • Crafts (positive instance) content that is classified as negative • Newer approach – Influencing Adversary • Patient attacker operates during periodic retraining stage by injecting “tricky” positive instances • Shifts decision boundary over time during retraining such that (positive instance) content is eventually classified as negative • Need novel adaptive, robust ML techniques to defend against Influencing Adversaries
Synergy between Humans and ML • Users – providing clear answers and usable security • Is this content spam or malicious? • What is the reasoning behind a security decision? • Can my UGC be identified as being mine? • Also, understanding how users reason about security • Crowds – augmenting ML with human capabilities • Leveraging humans to disambiguate borderline instances (e.g., is this a malicious or benign application or website) • Domain Experts – prioritizing a limited resource • Identifying when to rely on experts to evaluate model changes • Helping determine authorship identification for malware
Collaboration with McAfee • Special academic-industry collaboration • Unique opportunity for academic access to massive scale real-world adversarial data • Pathway for research to yield real-world impact • Two Robust ML research efforts • Current: Active protection • Future: Malicious URL/site detection (Site Advisor) • Update: • Signed University-level NDAs with UC Berkeley and Drexel • Had meetings at Intel and UC Berkeley • Delivered prototype ML-based malware classification system that supports large-scale classification of polymorphic threats • Ongoing: Refining research focus and exploring Artemis sample dataset
Artemis and GTI • Artemis and GTI collect voluminous “suspicious events and metadata” from millions of end host • McAfee needs to: • Classify events into clean/dirty label • Cluster events into groups • Rank groups according to their suspiciousness level • Help identify malware families (authorship classification) • Our planned efforts • Build a large-scale, online, adaptive ML system for automated malware classification with humans in the loop • Apply stylometryfor forensic analysis and malware classification
Proposed Malware Analysis Pipeline Data from McAfee’s GTI and Google’s VirusTotal Program Features Program code Mobile Apps Executables Program Analysis Machine Learning Static/ Dynamic/ Human Analysis Malware Classification Models Program Features Feature Encoding Machine Learning Further analysis Feedback Human: Domain Experts Categorization and Prioritizationare critical!
Security Analytics Talks (Session 1) • Big data for security analytics • Using adaptive, large-scale ML to identify and classify malware families using code features • Learning as an “attack”: De-anonymization • Automated analysis of encrypted traffic – Identifying the URLs/topics of SSL-encrypted web pages • Learning for web-based malware detection • Not code features, rather: Where scripts and objects comes from, Who makes the requests, How user gets to the site
Security Analytics Talks (Session 2) • Using Network Science to detect Sybils in social networks • Leveraging social structure to detect fake accounts and improve user authentication • Learning as an “attack”: De-anonymization • Automated analysis and identification of underground forums users • Understanding how End Users reason about Risk • Security, privacy, and a 9-dimensional model for users
Security Analytics Goals • Developing tools combining machine learning and analysis to automatically extract features and build models • Improving users’ experiences by translating the reasoning behind security decisions into human understandable concepts • Designing robust algorithms for large-scale machine-learning in the presence of adversarial manipulation