1 / 20

2. Big Data and Basic Statistics

ENEE 759D | ENEE 459D | CMSC 858Z. 2. Big Data and Basic Statistics. Prof. Tudor Dumitraș. Assistant Professor, ECE University of Maryland, College Park. http://ter.ps/ 759d https://www.facebook.com/SDSAtUMD. Today’s Lecture. Where we’ve been Security principles vs. security in practice

lexiss
Download Presentation

2. Big Data and Basic Statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ENEE 759D | ENEE 459D | CMSC 858Z 2. Big Data and Basic Statistics Prof. Tudor Dumitraș Assistant Professor, ECEUniversity of Maryland, College Park http://ter.ps/759d https://www.facebook.com/SDSAtUMD

  2. Today’s Lecture • Where we’ve been • Security principles vs. security in practice • Introduction to security data science • Where we’re going today • Big Data and basic statistics • Pilot project: proposal, approach, expectations • Where we’re going next • Basic statistics (part 2)

  3. The Trouble With Big Data • Last time we talked about the data deluge • Data created/reproduced in 2010: 1,200 exabytes • Data collected to find the Higgs boson: 1 gigabyte / s • Yahoo: 200 petabytes across 20 clusters • Security: • Global spam in 2011: 62 billion / day • Malware variants created in 2011: 403 million • We can store all this data (1 GB ~ ¢6) • Analyzing the data can produce remarkable insights How to analyze multi-TB data sets?

  4. Challenges for Dealing with Big Data • Big Data is hard to move around • Engineers must grasp parallel processing techniques • To access 1 TB in 1 min, must distribute data over 20 disks • MapReduce? Parallel DB? Dryad? Pregel? OpenMPI? PRAM? • Engineers must understand how to interpret data correctly

  5. Processing Data in Parallel • How big is ‘Big Data’? (data volume) • Real answer: it depends • When your manager asks: • Parallelism does not reduce asymptotic complexity • O(N log N) algorithm is still O(N log N) when run in parallel on K machines • But the constants are divided by K (and can have K > 1000) • 10-20 TB • 5-8 TB • 1 TB

  6. Data Collection Rate • Sometimes the data collection rate is too high (data velocity) • It may be too expensive to store all the data • The latency of data processing may not support interactivity • Example: There are 600 million collisions/s per second in the Large Hadron Collider at CERN • This would amount to collecting ~1 PB/s (David Foster, CERN) • They only record one in 1013 (ten trillion) collisions (~ 100 MB/s – 1 GB/s) • Techniques for dealing with data velocity • Sampling (as in the LHC) • Stream processing • Compression (e.g. Snappy, QuickLZ, RLE) • In some cases operating on lightly compressed data reduces latency!

  7. The Curse of Many Data Formats • Data comes from many sources and in many formats, often not standardized or even documented (data variety) • This is also known as the ‘data integration problem’ • Example: It is difficult for security products to analyze all the relevant data sources • A good approach: schema-on-read • The DB way: data loaded must have a schema (columns, data types, constraints) • In practice, enforcing a schema on load means that some data is discarded • The MapReduce way: store raw data, parse when analyzing In 84% of [targeted attacks between 2004-2012] clear evidence of the breach was present in local log files. DARPA ICAS/CAT BAA, 2013

  8. Junk Data is a Reality • Data Quality (also called information quality, data veracity) • Can the data be trusted? • Example: information on vulnerabilities and attacks from Twitter • Is there inherent uncertainty in the values recorded? • Example: anti-virus (AV) detections are often heuristic, not black-and-white • Does the data collection procedure introduce noise or biases? • Example: data collected using an AV product is from security-minded people

  9. Attributes of Big Data • The 3 Vs of Big Data(source: ‘Challenges & Opportunities with Big Data,’ SIGMOD Community Whitepaper, 2012) • Data Volume: the size of the data • Data Velocity: the data collection rate • Data Variety: the diversity of sources and formats • One more important attribute • Data Quality: • Are any data items corrupted or lost (e.g. owing to errors while loading)? • Is the data uncertain or unreliable? • What is the statistical profile of the data set? (e.g. distribution, outliers) • You must understand how to interpret data correctly

  10. Exploratory Data Analysis What does the data look like? (basic inspection) • It is often useful to inspect the data visually (e.g. open it in Excel) • Example: road test data from 1974 Motor Trendmagazine

  11. Basic Data Visualization What does the data look like? (basic inspection) • Scatter plot (Matlab: plot, scatter R: ggplot(…) + geom_point()) • Determine a relationship between two variables • Reveal trends, clustering of data • Do not rely on colors alone to distinguish multiple data groups (e.g. cylinders) • Watch out for overplotting Source: Kanich et al., Spamalytics

  12. Stem Plots and Histograms 10 | 44 12 | 3 14 | 3702258 16 | 438 18 | 17227 20 | 00445 22 | 88 24 | 4 26 | 03 28 | 30 | 44 32 | 49 What are the most frequent values in the data? • Stem-and-leaf plot (R: stem) • Shows the bins with the most values and the values • Not that popular since the dawn of the computer age, but effective • Histogram (R: ggplot(…) + geom_histogram()) • Pay attention to the bin width

  13. Statistical Distributions What does the data look like? (more rigorous) • Probability density function (PDF) of the values you measure • PDF(x) is the probability that the metric takes the value x • Estimation from empirical data (Matlab: ksdensityR: density) Tail of the distribution • Cumulative density function (CDF) • CDF(x) is the probability that the metric takes a value less than x • Estimation (R: ecdf) 75th percentile median 25th percentile

  14. Summary Statistics What does the data look like? (in summary) • Measures of centrality • Mean = sum / length (mean) • Median = half the measured values are below this point (median) • Mode = measurement that appears most often in the dataset Mode Median Mean 80% of analytics is sums and averages. Aaron Kimball, wibidata • Measures of spread • Range = maximum – minimum (range) • Standard deviation (σ) (Matlab: stdR: sd) • Coefficient of variation = σ / mean • Independent of the measurement units

  15. Percentiles and Outliers What does the data look like? (in summary) • Percentiles • Nth percentile: X such that N% of the measured samples are less than X • The median is the 50th percentile • The 25th and 75th percentiles are also called the 1st and 3rd quartiles (Q1 and Q3), respectively • Matlab: prctileR: quantile • The “five number” summary of a data set: <min, Q1, median, Q3, max> • Outliers • “Unusual” values, significantly higher/lower than the other measurements • Must reason about them: Measurement error? Heavy-tailed distribution? An interesting (unexplained) phenomenon? • Simple detection tests: • 3σ test • 1.5 * IQR • R package outliers • The median is more robust to outliers than the mean

  16. Boxplots What does the data look like? (comparisons) • Box-and-whisker plots are useful for comparing probability distributions • The box represents the size of the inter-quartile range (IQR) • The whiskers indicate the maximum and minimum values • The median is also shown • Matlab: boxplotR: ggplot(..)+geom_boxplot() • In 1970, US Congress instituted a random selection process for the military draft • All 366 possible birth dates were placed in a rotating drum and selected one by one • The order in which the dates were drawn defined the priority for drafting • Boxplots show that men born later in the year were more likely to be drafted From http://lib.stat.cmu.edu/DASL/Stories/DraftLottery.html

  17. Pilot Project • Propose a security problem and a data set • Some ideas available on the web page • 1–2 page report due on September 18th (hard deadline) • Hypothesis • Volume: how much data (e.g. number of rows, columns, bytes) • Velocity: how fast is the data updated • Variety: how to access/analyze the code programmatically • JSON/CSV/DB dump, screen scrape, etc.; • What language / library to use to read the data • Data quality • Statistical distribution? Outliers? Missing fields? Junk data? etc.

  18. A Note On Homework Submissions • Submit BibTeX files in plain text • No Word DOC, no RTF, no HTML! • Do not remove BibTeX syntax (e.g. the @ sign before entries) • This confuses my parser and I may think that you did not submit the homework if I don’t catch the error! • Make your contribution statements more summarizing than descriptive • Remembering "introduced a methodology" isn't as useful to you later on as "the key idea is XYZ” • Most of you wrote precise statements for weaknesses – do the same for contributions!

  19. Discussion Outside the Classrooom • Piazza: https://piazza.com/umd/fall2013/enee759d/home • Post project proposals, reports, project reviews • Post questions and general discussion topics • DO NOT post paper reviews • DO NOT discuss your paper review with your classmates before the submission deadline, on Piazza or anywhere • You are welcome (and encouraged) to discuss the papers after the deadline • Facebook: https://www.facebook.com/SDSAtUMD • For posting interesting articles, videos, etc. related to security data science • For spreading the word 

  20. Review of Lecture • What did we learn? • Attributes of Big Data • Probability distributions • What’s next? • Paper discussion: ‘Spamalytics: An Empirical Analysis of Spam Marketing Conversion’ • Basic statistics (part 2) • Deadline reminder • Post pilot project proposal on Piazza by the end of the day (soft deadline) • Office hours after class • Second homework due on Tuesday at 6 pm

More Related