Combatting Advanced Cybersecurity Threats w ith AI and Machine Learning

Combatting Advanced Cybersecurity Threats with AI and Machine Learning Andrew B. Gardner, Ph.D. SPO1-T11 Senior Technical Director Symantec Corporation @andywocky

The AI and ML Revolution is Here Self-driving cars

The AI and ML Revolution is Here AI-generated art L. Gatys, A.S. Ecker and M. Bethge, “A Neural Algorithm of Artistic Style,” https://arxiv.org/pdf/1508.06576v1.pdf

The AI and ML Revolution is Here Computational perception – face recognition (and speech, text, social, video, etc.) Y. Taigman et al., “DeepFace: Closing the Gap to Human-Level Performance in Face Verification,” https://research.fb.com/wp-content/uploads/2016/11/deepface-closing-the-gap-to-human-level-performance-in-face-verification.pdf

Key Points to Cover in This Talk What are AI and ML? Why are AI and ML important for cybersecurity? AI / ML at Symantec AI / ML and the future of cybersecurity

AI & ML Overview

What are ML and AI? MACHINE LEARNING ARTIFICIAL INTELLIGENCE The capability of a machine to learn without explicitly being programmed. The capability of a machine to imitate intelligent human behavior. learning perception decisions autonomy

An Example: AI vs. ML AI: self-driving car ML: pedestrian detection

In Cybersecurity We Focus More on Learning Reasons for ML focus: What should a log file “look like?” • Complex sequential data • Not human-intuitive • What should a program trace or log file look like? • Scarce | expensive labels • Closed research models •  Slower to advance AI/ML

Is This New? deep learning before after AI services Netflix Prize deep learning(mobile) AI & MLcore developmentsmulti-industry ensembling decision trees CAML ImageNet 1950s – 2000s 2006 2008 2012 2009 2014 2015 2016

How is AI/ML Used in Security Today? Yet Another Threat Detector (YATD) Collect Datasets Training Algorithm Trained Model Researcher/Scientist Updated Classifiers • Straightforward recipe • Data with labels • Build / update classifiers • Debate about techniques • Rely on data scientists • Feature engineering • Updates & tweaks Yet Another Threat Detector (YATD)

How is AI/ML Used in Security Today? Hidden (Automated) Systems • Primarily for automation • Not user-facing • Services and applications • Data + software engineering + ML • Examples: • Continual detector retraining • Smart data collection and labeling • Anomaly detection for IDS

Why are AI/ML Important for Cybersecurity?

AI/ML Adoption Drivers Benefits • Complex threats • Advanced persistent threats • New malware vectors • non-PEs • ransomware • Social engineering • … plus many others • Humans are slow • Humans are expensive • Automation • Scaling and velocity • Faster response and protection • Personalization • Learn to adapt to me, unobtrusively • Usability • Cross-domain protection • Firewalls talking to email servers and endpoints

Are There Downsides to Using AI/ML? Poor architecture & unintended side effects • Detectors A & B independent • New system introduced • creates feedback between A/B • inadvertent, unknown? • New sample arrives: • A  2/10, B  1/10 • … but B sees ΔA, B  3/10 • … but A sees ΔB, A  4/10 • … and so on data A B

Are There Downsides to Using AI/ML? source code ML Technical Debt stateful, complex system • Traditional software • Source code  program • Machine learning software • Source code + data  program • Data are embedded, opaque • Reconstruction is hard or impossible • ML data versioning is hard • Introduces data and system dependencies data ML program

Adversaries Have AI/ML, Too!! Adversarial Machine Learning model • Model extraction • Adversary learns an approximate model using fewest possible queries • Poisoning • Adversary biases machine learning model through interaction • Adversarial examples • Crafting inputs to defeat ML. data panda perturb gibbon I. J. Goodfellow, J. Shlens, C. Szegedy, “Explaining and Harnessing Adversarial Examples.” ICLR 2015.

Advanced Behavioral Attacks Microsoft Real-Time Translation (2012) https://www.youtube.com/watch?v=Nu-nlQqFCKg • Imagine a business email compromise attack • you get an email to wire payment for an invoice from the CFO • The email is written like your CFO • natural language processing from emails • You’re suspicious and call the CFO • But your phone is compromised • You’re connected to an adversary who has a speechbot with your CFOs voice • Science fiction or possible today?

AI/ML at Symantec

The Symantec AI/ML Story • Define the goal: • Perfect, ubiquitous, unobtrusive protection for every context of our customer’s digital lives. • Invest in AI/ML resources • Center for Advanced Machine Learning (CAML), ~20 PhDs • Capitalize on unique telemetry assets (data!) • Automate and scale  ML everywhere • Improve protection using AI/ML

Doing AI & ML (Correctly) is Hard! BOUNTIFUL DATA DIMENSIONS • 9 Trillion rows of security data • 4.5B queries processed daily from 175M endpoint devices • 2B emails scanned daily • 1B previously unseen web requests scanned daily • Outputs from other systems & products • Static attributes • Dynamic behaviors • Reputation • Relations • Sequential state ADVANCED TECHNIQUES LEADING EXPERTS • Ensembling • Boosting • Sequential Learning • Deep Learning • Automation at Scale • Dedicated org of recognized machine learning experts • World-renowned attack investigation team • Centuries of combined ML experience

Optimize for Outcomes useful 100% 0 <= AUC=0.83 <= 1.0 True Positives (TP) 0 <= AUC=0.75 <= 1.0 0 2% False Positives (FP) 100% • This is an ROC • Shows TP/FP tradeoff • Create one for any classifier • How good is this detector? • Textbook: AUC, area under the curve = 0.83 • 0.83 is good, but… • Only the region with low FPs is useful to customers • The blue ROC is better in the real world!

Tackle Fundamental Problems. As Services. Charlatan – String Scoring Service • One HTTP RESTfulML service • containerized for deployment • available for products • available for experimentation • Deep learning on sequences • String scoring is really useful! • Malicious package names & filenames • DGA identification • Phishing domains • Adult content URLs • … and more

Services Everywhere! Shiftsequential changepoint detection Foresterautomagic, optimal decision tree ensembles Sifterrobust, automated feature selection Dolphincomputer vision phishing website detection Murdochbehavioral & sequential anomaly detection Multiraterunsupervised ground truth labeling Lyrebirdactive learning for detecting targeted spearphishing

Keep the Customer in Mind Endpoint Static Protection • Code name “Sapient” • Goal: upgrade efficacy • better detections • fewer false positives • How? • Not techniques– we still use lots of tree ensembles •  Better data, experiment design, optimization 99%+ 100% WORLD-CLASS better detection 93% TYPICAL TRUE positives fewer false positives 0.1% 1% 1% 0 FALSE positives

Practice Good Practices Pro vs. Joe: Best Practices Practiced Well • Scholarship – literature surveys, measurement, build on other work • Exploratory data analysis • Strong imputing • Sampling for class imbalance and experiment design • Hyperparameter optimization • Encoding & embedding for feature engineering • Feature selection • Model selection • Ensembling techniques • Thresholdout for reusable holdouts All of this is table stakes… …before we do “hard” ML … … or fancy “new” stuff This can be difficult for engineers and data scientists

Practice Good Practices Pro vs. Joe: Best Practices Practiced Well Scholarship – literature, measurements,past work. • Scholarship – literature surveys, measurement, build on other work • Exploratory data analysis • Strong imputing • Sampling for class imbalance and experiment design • Hyperparameter optimization • Encoding & embedding for feature engineering • Feature selection • Model selection • Ensembling techniques • Thresholdout for reusable holdouts Sampling and experiment design Ensembling techniques – combining models All of this is table stakes… …before we do “hard” ML … … or fancy “new” stuff This can be difficult for engineers and data scientists

Closing Time

The Future of AI & ML in Cybersecurity now • Superpowers for analysts • hunting for targeted spearphishing attacks 100x faster • Threat detection systems that self-evolve • Programs that understand other program binaries • Real-time conversation monitoring for • social engineering, cyberbullying, fake news, help, etc. • Predictive protection • AI for fuzzing, bugs, exploits & zero days future

Key Takeaways • AI/ML is real, it’s here and it’s disruptive! • Lots of opportunities and challenges • The “terminator wars” of the future will play out at scale, speed and cost that humans cannot match. Deep resources – cash, expertise, data, systems– will be required table stakes. • AI/ML benefits bad actors, too • The right expertise and experience are essential • AI/ML researchers + data scientists + security professionals • Systems, data and integration are differentiators • Symantec has the right pieces to play in this area 

Where Do You Go from Here? • Level up your organization’s AI/ML skills • Recruit, hire, train, borrow asap– you’re already behind! • Treat your ML systems and features as attack surfaces • Raise the bar on best practices when using ML • Work hard to reduce technical debt and unintended side effects! • Buy/use integrated solutions vs. point products • Preferably those which provably and usably leverage AI/ML

Thank you! Andrew B. Gardner, Ph.D. Senior Technical Director Symantec Corporation 470-330-2435 andrew_gardner@symantec.com @andywocky Andrew B. Gardner, Ph.D. For follow-ups:

Combatting Advanced Cybersecurity Threats w ith AI and Machine Learning