390 likes | 517 Views
ENEE 759D | ENEE 459D | CMSC 858Z. Security Data Science (SDS). Prof. Tudor Dumitraș. Assistant Professor, ECE University of Maryland, College Park. http://ter.ps/ 759d https://www.facebook.com/SDSAtUMD. Introducing Your Instructor. Tudor Dumitraș Office: AVW 3425
E N D
ENEE 759D | ENEE 459D | CMSC 858Z Security Data Science (SDS) Prof. Tudor Dumitraș Assistant Professor, ECEUniversity of Maryland, College Park http://ter.ps/759d https://www.facebook.com/SDSAtUMD
Introducing Your Instructor Tudor Dumitraș Office: AVW 3425 Email: tdumitra@umiacs.umd.edu Course Website: http://ter.ps/759d Office Hours: Mon 2-3 pm
My Background • Ph.D. at Carnegie Mellon University • Research in distributed systems and fault-tolerant middleware • Worked at Symantec Research Labs • Built WINE platform for Big Data experiments in security • WINE currently used by academic researchers and Symantec engineers • Joined UMD faculty • Research and teaching on applied security and systems • Focus on solving security problems with data analysis techniques WINE
SDS In A Nutshell • Course objectives • Ability to understandand interpretscholarly publications, to explaintheir key ideas, and to provide constructive feedback • Ability to applysome of these ideas in practice • Topics • Grading • 50% paper reviewsand class participation • 50% projects
We Are Swimming in Data • Data created/reproduced in 2010: 1,200 exabytes • Data collected to find the Higgs boson: 1 gigabyte / s • Yahoo: 200 petabytes across 20 clusters • Security: • Global spam in 2011: 62 billion / day • Malware variants created in 2011: 403 million
Why So Much Data? • We can store it • 6¢ / GB • 29¢ / GB (SAS HDD) • We can generate it • Most data is machine-generated • Most malware samples are variants of other malware, generated automatically (repacking, obfuscation) What to do with all this data?
What Questions to Ask on a First Date?The Power of Big Data One
If You Want to Know … Do my date and I have long-term potential?
If You Want to Know … Do my date and I have long-term potential? … ask: 275,000 user submitted questions 34,260 real world couples • Do you like horror movies? • Have you ever traveled around another country alone? • Wouldn't it be fun to chuck it all and go live on a sailboat? 3.7× Top 3 user rated questions, about: • God • Sex • Smoking Likelihood ofcoincidence Psychology Data
Online Dating and Big Data • eHarmony • Analyzes hundreds of behavioral variables, most collected automatically • CTO: former search engineer at Yahoo! • OkCupid We do math to get you dates • Founded by Harvardmath & CS majors • PlentyOfFish Building this matching system was harder than [being] cited in the paper that won the Fields Medal Source: CNN Money
Early 1900s: Most Factories Had Private Generators Source: Nicholas Carr Electricity was critical for business, but not widely available
Data analytics provide remarkable insight Applications in many disciplines Is he an engineer? Does she dateengineers? Source: OkCupid
What Is Data Science? • Also known as … • Big Data analytics • Machine intelligence • Data-intensive computing • Data wrangling • Data munging • Data jujitsu Source: Drew Conway
Two Improving Machine TranslationThe Unreasonable Effectiveness of Data
2005 NIST Machine Translation Competition English-Arabic competition • Google’s first entry • None of the engineers spoke Arabic • Simple statistical approach • Trained using United Nations documents • 200 million translated words • 1 trillion monolingual words
For many hard problems there appears to be a threshold of sufficient data A. Halevy, et al., CACM 2009.
What is Security Data Science? • Also known as … • Security analytics • Surveillance analytics • Applying data science methods to security problems
Security Principles in 60 Seconds [J. Saltzer & M. Schroeder, SOSP 1973] • Economy of mechanism: Keep the protection mechanism as simple and small as possible • Fail-safe defaults: Base access decisions on permission rather than exclusion • Complete mediation: Check every access to every object • Open design: Do not keep the design secret • Separation of privilege: Require two keys to unlock, not one • Least privilege: Grant every program/user the least set of privileges necessary to complete the job • Least common mechanism: Minimize the amount of mechanism common to more than one user and depended on by all users • Psychological acceptability: Design interfaces for ease of use
Security in Practice(Source: C. Nachenberg, Symantec) • 1986: Simple computer viruses • Defense: anti-virus • 1990: Polymorphic viruses (decryption logic + encrypted malicious code) • Defense: “universal” decoder, emulation • 1995: Macro viruses • Defense: AV vendor cooperation, digital signatures for macros • 1999: Worms • Defense: Vulnerability-specific signatures • 2004: Web-based malware • Defense: behavior blocking • 2006: Auto-generated malware • Defense: reputation based security • 2010(but probably earlier): Targeted attacks (physical infrastructure, 0-day, etc.) • Defense: ??
Three Understanding Zero-Day AttacksThe Need for Security Data Science
Zero-Day Attacks: Recent Examples Zero-day attack = cyber attack exploiting a software vulnerability before the public disclosure of the vulnerability 2011: Attack against RSA 2010: Stuxnet 2009: Operation Auroraagainst Google
Price of Zero-Day Exploits on the Black Market The Economist, March 2013
The Elderwood Project Group with “seemingly unlimited” supply of zero-day exploits (Source: Symantec)
Zero-Day Attacks: Open Questions Decade-long open questions • How common are zero-day attacks? • How long can they remain undiscovered? • What happens after disclosure? Zero-day attack Prior work [Arbaugh 2000, Frei 2008, McQueen 2009, Shahzad 2012] Vulnerabilitytimeline Vulnerability disclosed(“day zero”) Security patch released All hosts patched Creation Exploit used in attacks
Zero-Day Attacks: Open Questions (cont’d) Decade-long questions: Why still open? • Rare events, hard to observe in small data sets • Need data analysis at scale Rare events Before disclosure:Targeted attacks After disclosure:Large-scale attacks [weeks] Vulnerability disclosed(“day zero”) Security patch released All hosts patched Creation Exploit used in attacks
Research in Security Data Science 105 Challenge 1: Find the needle in the haystack • Example: Identify and measure zero-day attacks Challenge 2: Ensure generally applicable and repeatable results • The threat landscape changes frequently Challenge 3: Deal with new and advanced threats • Skilled and persistent hackers can bypass firewalls, anti-virus, password-protected systems, two-factor authentication, physical isolation […] 103 Variants 10 403 million new malware variants created in 2011 Rare events (weeks) -100 -50 T0 50 100 150 Targeted attacks before disclosure Your thesis topic goes here
What is Security Data Science? (re-visited) • Systems knowledge: develop technologies needed to store and process massive data sets • Statistics & machine learning knowledge: analyze the data and extract information • Security knowledge: ask the right questions about cyber attacks • Data scientists are in high demand in the cybersecurity industry Booz Allen may be recruiting more [data scientists] than Google or Facebook The Economist, June 2013
Course Content • Introduction to Security Data Science • Hands-on emphasis – this is largely an unexplored research area • Team-based projects • Reviews of scholarly publications • No textbook • Specific things you can expect to learn • Selected topics in security • System skills: Experiment design, data analysis, scalability • Team skills: Cooperating to achieve your team goals • Speaking/writing skills: Presenting paper/project findings, providing constructive feedback
This is an Advanced Course • You are responsible for holding up your end of the educational bargain • I expect you to attend classes and to complete reading assignments • I expect you to learn how to analyze data and to try things out for yourself • I expect you to know how to find research literature on security topics • The required readings provide starting points • I expect you to manage your time • In general there will be one written assignment due before each lecture • Learning material in this course requires participation • This is not a sit-back-and-listen kind of course; class participation is required for understanding the material and makes up a part of your grade! • Different grading criteria for graduate and undergraduate students
Reading Assignments • Readings: 1-2 papers before each lecture • Not light reading – some papers require several readings to understand • For next time: C. Kanich et al., 'Spamalytics: An Empirical Analysis of Spam Marketing Conversion,'ACM CCS, 2008. • Check course web page (still in flux) for next readings and links to papers • Homeworks: review the papers you read using a defined template • Submit homework by email to tdumitra@umiacs.umd.edu • We might switch to a Web based submission system in the future • Due at 6 pm the evening before class • BibTeX template: Summary, Contributions, Weaknesses, Opinion (optional) • I will provide feedback on someof your written critiques; no email means your writeup is satisfactory • In-class discussion: stand up and talk about the papers • Volunteers are preferred • Students randomly selected if no volunteers
Discuss … Do my date and I have long-term potential? … ask: 275,000 user submitted questions 34,260 real world couples • Do you like horror movies? • Have you ever traveled around another country alone? • Wouldn't it be fun to chuck it all and go live on a sailboat? 3.7× Top 3 user rated questions, about: • God • Sex • Smoking Likelihood ofcoincidence Psychology Data
Course Projects • Pilot project: two-week individual projects • Propose a security problem and a data set that you could analyze to solve it • Some ideas are available on the web page • Conduct preliminary data analysis and write a report • Propose projects by September 9th(soft deadline) • Submit report by September 18th • Group project: ten-week group project • Deeper investigation of promising approaches • Submit written report and present findings during last week of class • 2 checkpoints along the way (schedule on the course web page) • Form teams and propose projects by September 30th • Peer reviews: review at least 2 project reports from other students • Use skills learned from paper reviews • Post project proposals, reports and reviews on Piazza
Pre-Requisite Knowledge • Good programming skills • Knowledge of languages commonly used in data analysis, like Matlab or R, is a plus • To brush up: ‘Data Analysis and Visualization with MATLAB for Beginners’ seminar, on September 12 at 5pm, Room 1110 Kim Engineering Building • Ability to come up to speed on advanced security topics • Covered in the paper readings • Basic knowledge of security (CMSC 414, ENEE 459C or equivalent) is a plus • Ability to come up to speed on data analytics • Lectures provide light-duty tutorials, but you will need to pick up the details as you go along
Policies • “Showing up is 80% of life” – Woody Allen • Participation in in-class discussions is required for full credit • You can get an “A” with a few missed assignments, but reserve these for emergencies (conference trips, waking up sick, etc.) • Notify the instructor if you need to miss a class, and submit your homework on time • UMD’s Code of Academic Integrity applies, modified as follows: • Complete your homework entirely on your own. Afteryou hand in your homework, you are welcome (and encouraged) to discuss it with others • Discussthe problems and concepts involved in the project, but produce your own project implementation, report and presentation • Group projects are the result of team work • See class web site for the official version
Classroom Protocol • Please arrive on time; lecture begins promptly • I also promise to end on time • Handouts, readings and homework templates posted class web page • Questions are encouraged • If you don’t understand, ask; probably other students are struggling too • Explain the content of your reading assignment, and the underlying reasoning, to the rest of the class • Your reasons don't have to be "right” –you just have to be able to explain them • There is no way to cover everything • If there is an interesting aspect that we do not cover in class, feel free to incorporate that in your projects
Grading Criteria • Straight scale: A≥90; B≥80; C≥70; D<70 • 50% Written paper critique and class discussion • 24 assignments x 2 points each + 2 points for this lecture • 50% Projects • 30 points for group project, 10 points for pilot project, 10 points for project reviews • 10% Subjective evaluation • Expectations • Graduate students: you can explain the contributions and weaknesses of the papers you read • Undergraduates: you demonstrate a general understanding of the papers • Unsatisfactory participation means: • You did not read the papers • You did not produce a working implementation for your project, or you do not understand how the implementation works
Review of Lecture • What did we learn? • Data analytics provide real benefits • Analyzing large data sets allows tackling long-standing hard problems • Difference between security principles and security in practice • Examples of security problems that require insights from large data sets • I want to emphasize • This is systems course, not a not a pen-and-paper course • You will be expected to build a real, working, data analysis tool • What’s next? • Basic statistics and experimental design • Pilot project: proposal, approach, expectations • Deadline reminder • Post pilot project proposal on Piazza by Monday (soft deadline) • First homework due on Sunday at 6 pm