200 likes | 325 Views
ENEE 759D | ENEE 459D | CMSC 858Z. 2. Big Data and Basic Statistics. Prof. Tudor Dumitraș. Assistant Professor, ECE University of Maryland, College Park. http://ter.ps/ 759d https://www.facebook.com/SDSAtUMD. Today’s Lecture. Where we’ve been Security principles vs. security in practice
E N D
ENEE 759D | ENEE 459D | CMSC 858Z 2. Big Data and Basic Statistics Prof. Tudor Dumitraș Assistant Professor, ECEUniversity of Maryland, College Park http://ter.ps/759d https://www.facebook.com/SDSAtUMD
Today’s Lecture • Where we’ve been • Security principles vs. security in practice • Introduction to security data science • Where we’re going today • Big Data and basic statistics • Pilot project: proposal, approach, expectations • Where we’re going next • Basic statistics (part 2)
The Trouble With Big Data • Last time we talked about the data deluge • Data created/reproduced in 2010: 1,200 exabytes • Data collected to find the Higgs boson: 1 gigabyte / s • Yahoo: 200 petabytes across 20 clusters • Security: • Global spam in 2011: 62 billion / day • Malware variants created in 2011: 403 million • We can store all this data (1 GB ~ ¢6) • Analyzing the data can produce remarkable insights How to analyze multi-TB data sets?
Challenges for Dealing with Big Data • Big Data is hard to move around • Engineers must grasp parallel processing techniques • To access 1 TB in 1 min, must distribute data over 20 disks • MapReduce? Parallel DB? Dryad? Pregel? OpenMPI? PRAM? • Engineers must understand how to interpret data correctly
Processing Data in Parallel • How big is ‘Big Data’? (data volume) • Real answer: it depends • When your manager asks: • Parallelism does not reduce asymptotic complexity • O(N log N) algorithm is still O(N log N) when run in parallel on K machines • But the constants are divided by K (and can have K > 1000) • 10-20 TB • 5-8 TB • 1 TB
Data Collection Rate • Sometimes the data collection rate is too high (data velocity) • It may be too expensive to store all the data • The latency of data processing may not support interactivity • Example: There are 600 million collisions/s per second in the Large Hadron Collider at CERN • This would amount to collecting ~1 PB/s (David Foster, CERN) • They only record one in 1013 (ten trillion) collisions (~ 100 MB/s – 1 GB/s) • Techniques for dealing with data velocity • Sampling (as in the LHC) • Stream processing • Compression (e.g. Snappy, QuickLZ, RLE) • In some cases operating on lightly compressed data reduces latency!
The Curse of Many Data Formats • Data comes from many sources and in many formats, often not standardized or even documented (data variety) • This is also known as the ‘data integration problem’ • Example: It is difficult for security products to analyze all the relevant data sources • A good approach: schema-on-read • The DB way: data loaded must have a schema (columns, data types, constraints) • In practice, enforcing a schema on load means that some data is discarded • The MapReduce way: store raw data, parse when analyzing In 84% of [targeted attacks between 2004-2012] clear evidence of the breach was present in local log files. DARPA ICAS/CAT BAA, 2013
Junk Data is a Reality • Data Quality (also called information quality, data veracity) • Can the data be trusted? • Example: information on vulnerabilities and attacks from Twitter • Is there inherent uncertainty in the values recorded? • Example: anti-virus (AV) detections are often heuristic, not black-and-white • Does the data collection procedure introduce noise or biases? • Example: data collected using an AV product is from security-minded people
Attributes of Big Data • The 3 Vs of Big Data(source: ‘Challenges & Opportunities with Big Data,’ SIGMOD Community Whitepaper, 2012) • Data Volume: the size of the data • Data Velocity: the data collection rate • Data Variety: the diversity of sources and formats • One more important attribute • Data Quality: • Are any data items corrupted or lost (e.g. owing to errors while loading)? • Is the data uncertain or unreliable? • What is the statistical profile of the data set? (e.g. distribution, outliers) • You must understand how to interpret data correctly
Exploratory Data Analysis What does the data look like? (basic inspection) • It is often useful to inspect the data visually (e.g. open it in Excel) • Example: road test data from 1974 Motor Trendmagazine
Basic Data Visualization What does the data look like? (basic inspection) • Scatter plot (Matlab: plot, scatter R: ggplot(…) + geom_point()) • Determine a relationship between two variables • Reveal trends, clustering of data • Do not rely on colors alone to distinguish multiple data groups (e.g. cylinders) • Watch out for overplotting Source: Kanich et al., Spamalytics
Stem Plots and Histograms 10 | 44 12 | 3 14 | 3702258 16 | 438 18 | 17227 20 | 00445 22 | 88 24 | 4 26 | 03 28 | 30 | 44 32 | 49 What are the most frequent values in the data? • Stem-and-leaf plot (R: stem) • Shows the bins with the most values and the values • Not that popular since the dawn of the computer age, but effective • Histogram (R: ggplot(…) + geom_histogram()) • Pay attention to the bin width
Statistical Distributions What does the data look like? (more rigorous) • Probability density function (PDF) of the values you measure • PDF(x) is the probability that the metric takes the value x • Estimation from empirical data (Matlab: ksdensityR: density) Tail of the distribution • Cumulative density function (CDF) • CDF(x) is the probability that the metric takes a value less than x • Estimation (R: ecdf) 75th percentile median 25th percentile
Summary Statistics What does the data look like? (in summary) • Measures of centrality • Mean = sum / length (mean) • Median = half the measured values are below this point (median) • Mode = measurement that appears most often in the dataset Mode Median Mean 80% of analytics is sums and averages. Aaron Kimball, wibidata • Measures of spread • Range = maximum – minimum (range) • Standard deviation (σ) (Matlab: stdR: sd) • Coefficient of variation = σ / mean • Independent of the measurement units
Percentiles and Outliers What does the data look like? (in summary) • Percentiles • Nth percentile: X such that N% of the measured samples are less than X • The median is the 50th percentile • The 25th and 75th percentiles are also called the 1st and 3rd quartiles (Q1 and Q3), respectively • Matlab: prctileR: quantile • The “five number” summary of a data set: <min, Q1, median, Q3, max> • Outliers • “Unusual” values, significantly higher/lower than the other measurements • Must reason about them: Measurement error? Heavy-tailed distribution? An interesting (unexplained) phenomenon? • Simple detection tests: • 3σ test • 1.5 * IQR • R package outliers • The median is more robust to outliers than the mean
Boxplots What does the data look like? (comparisons) • Box-and-whisker plots are useful for comparing probability distributions • The box represents the size of the inter-quartile range (IQR) • The whiskers indicate the maximum and minimum values • The median is also shown • Matlab: boxplotR: ggplot(..)+geom_boxplot() • In 1970, US Congress instituted a random selection process for the military draft • All 366 possible birth dates were placed in a rotating drum and selected one by one • The order in which the dates were drawn defined the priority for drafting • Boxplots show that men born later in the year were more likely to be drafted From http://lib.stat.cmu.edu/DASL/Stories/DraftLottery.html
Pilot Project • Propose a security problem and a data set • Some ideas available on the web page • 1–2 page report due on September 18th (hard deadline) • Hypothesis • Volume: how much data (e.g. number of rows, columns, bytes) • Velocity: how fast is the data updated • Variety: how to access/analyze the code programmatically • JSON/CSV/DB dump, screen scrape, etc.; • What language / library to use to read the data • Data quality • Statistical distribution? Outliers? Missing fields? Junk data? etc.
A Note On Homework Submissions • Submit BibTeX files in plain text • No Word DOC, no RTF, no HTML! • Do not remove BibTeX syntax (e.g. the @ sign before entries) • This confuses my parser and I may think that you did not submit the homework if I don’t catch the error! • Make your contribution statements more summarizing than descriptive • Remembering "introduced a methodology" isn't as useful to you later on as "the key idea is XYZ” • Most of you wrote precise statements for weaknesses – do the same for contributions!
Discussion Outside the Classrooom • Piazza: https://piazza.com/umd/fall2013/enee759d/home • Post project proposals, reports, project reviews • Post questions and general discussion topics • DO NOT post paper reviews • DO NOT discuss your paper review with your classmates before the submission deadline, on Piazza or anywhere • You are welcome (and encouraged) to discuss the papers after the deadline • Facebook: https://www.facebook.com/SDSAtUMD • For posting interesting articles, videos, etc. related to security data science • For spreading the word
Review of Lecture • What did we learn? • Attributes of Big Data • Probability distributions • What’s next? • Paper discussion: ‘Spamalytics: An Empirical Analysis of Spam Marketing Conversion’ • Basic statistics (part 2) • Deadline reminder • Post pilot project proposal on Piazza by the end of the day (soft deadline) • Office hours after class • Second homework due on Tuesday at 6 pm