1 / 58

Looking at Data

Looking at Data. Dror Feitelson Hebrew University. Disclaimer. No connection to www.lookingatdata.com They have neat stuff – recommended But we’ll just use very simple graphics. The Agenda.

vian
Download Presentation

Looking at Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Looking at Data Dror Feitelson Hebrew University

  2. Disclaimer No connection to www.lookingatdata.com They have neat stuff – recommended But we’ll just use very simple graphics

  3. The Agenda To promote the collection, sharing, and use of real data about computer systems, in order to ensure that our research is relevant to real-life situations (as opposed to doing research based on assumptions)

  4. Computer “Science” • Mathematics = abstract thought • Engineering = building things • Science = learning about the world • Observation • Measurement • Experimentation • The scientific method is also required for the study of complex computer systems (including complexity arising from humans)

  5. Example 1: The Top500 list

  6. The Top500 List • List of the 500 most powerful supercomputers in the world • As measured by Linpack • Started in 1993 by Dongarra, Meuer, Simon, and Strohmaier • Updated twice a year at www.top500.org • Contains data about vendors, countries, and machine types • Egos and politics in the top spots

  7. November 2002 list

  8. Top500 Evolution: Scalar vs. Vector 1993 – 1998: Number of vector machines plummets: MPPs instead of Crays

  9. Top500 Evolution: Scalar vs. Vector 1998 – 2003: Vector machines stabilize • Earth simulator • Cray X1

  10. Top500 Evolution: Scalar vs. Vector 2003 – 2007: Vectors all but disappear What happened?

  11. Top500 Evolution: Parallelism Most attention typically given to largest machines

  12. Top500 Evolution: Parallelism But let’s focus on the smallest ones: We need more and more proc’s to stay on the list

  13. Top500 Evolution: Parallelism Vectors needed double every 18 months Microproc’s double every 2-3 years So microproc’s are improving faster Implication: in 2008 microprocessors finally closed the performance gap

  14. Historical Perspective Figure from a 1994 report

  15. Top500 Evolution: Parallelism Need more proc’s to stay on list Implication: performance grows faster than Moore’s law

  16. Top500 Evolution: Parallelism Need more proc’s to stay on list = performance grows faster than Moore’s law Since 2003 slope increased due to slowing of micro improvements

  17. Top500 Evolution: Parallelism BTW: largest machines stayed flat for 7 years Everything else grew exponentially Implication: indicates difficulty in usage and control

  18. Example 1: The Top500 list • Example 2: Parallel workload patterns

  19. Parallel Workloads Archive • All large scale supercomputers maintain accounting logs • Data includes job arrival, queue time, runtime, processors, user, and more • Many are willing to share them (and shame on those who are not) • Collection at www.cs.huji.ac.il/labs/parallel/workload/ • Uses standard format to ease use

  20. NASA iPSC/860 trace

  21. Parallelism Assumptions • Large machines have thousands of processors • Cost many millions of dollars • So expected to be used for large-scale parallel jobs (Ok, maybe also a few smaller debug runs)

  22. Parallelism Data

  23. Parallelism Data On all machines 15-50% of jobs are serial Also very many small jobs Implication: bad news: small jobs may block out large jobs Implication: good news: small jobs are easy to pack

  24. Parallelism Data On all machines 15-50% of jobs are serial Also very many small jobs Majority of jobs use power of 2 nodes • No real application requirements • Hypercube tradition • We think in binary Implication: regardless of reason, reduces fragmentation

  25. Size-Runtime Correlation • Parallel jobs require resources in two dimensions: • A number of processors • For a duration of time • Assuming the parallelism is used for speedup, we can expect large jobs to run for less time • Important for scheduling, because job size is known in advance Potential implication: scheduling large jobs first also schedules short jobs first!

  26. Size-Runtime Correlation Data

  27. “Distributional” Correlation • Partition jobs into two groups based on size • Small jobs (less than median) • Large jobs (more than median) • Find distribution of runtimes for each group • Measure fraction of support where one distribution dominates the other

  28. “Distributional” Correlation Implication: large jobs first ≠ short jobs first (maybe even long first)

  29. Example 1: The Top500 list • Example 2: Parallel workload patterns • Example 3: “Dirty” data

  30. Beware Dirty Data • Looking at data is important • But is all data worth looking at? • Errors in data recording • Evolution and non-stationarity • Diversity between different sources • Multi-class mixtures • Abnormal activity • Need to select relevant data source • Need to clean dirty data

  31. Abnormality Example Some users are much more active than others So much so that they single-handedly affect workload statistics • Job arrivals (more) • Job sizes (modal?) Probably not generally representative Implication: we may be optimizing for user 2

  32. Workload Flurries • Bursts of activity by a single user • Lots of jobs • All these jobs are small • All of them have similar characteristics • Limited duration (day to weeks) • Flurry jobs may be affected as a group, leading to potential instability (butterfly effect) • This is a problem with evaluation methodology more than with real systems

  33. Workload Flurries

  34. Instability Example Simulate scheduling of parallel jobs with EASY scheduler Use CTC SP2 trace as input workload Change load by systematically modifying inter-arrival times Leads to erratic behavior

  35. Instability Example Simulate scheduling of parallel jobs with EASY scheduler Use CTC SP2 trace as input workload Change load by systematically modifying inter-arrival times Leads to erratic behavior Removing a flurry by user 135 solves the problem Implication: using dirty data may lead to erroneous evaluation results

  36. Example 1: The Top500 list • Example 2: Parallel workload patterns • Example 3: “Dirty” data • Example 4: User behavior

  37. Independence vs. Feedback • Modifying the offered load by changing inter-arrival times assumes an open system model • Large user population insensitive to system performance • Jobs are independent of each other • But real systems are often closed • Limited user population • New jobs submitted after previous ones terminate • This leads to feedback from system performance to workload generation

  38. Evidence for Feedback Implication: jobs are not independent modifying inter-arrivals is problematic

  39. The Mechanics of Feedback • If users perceive the system as loaded, they will submit less jobs • But what exactly do users care about? • Response time: how long they wait for results • Slowdown: how much longer than expected • Answer needed to create a user model that will react correctly to load conditions

  40. Data Mining • Available data: system accounting log • Need to assess user reaction to momentary condition • The idea: associate the user’s think time with the performance of the previous job • Good performance  satisfied user  continue work session  short think time • Bad performance  dissatisfied user  go home  long think time • “performance” = response time or slowdown

  41. The Data Implication: response time is a much better predictor of user behavior

  42. Predictability = Locality • Predicting the future is good • Avoid constraints of on-line algorithms • Approximate performance of off-line algorithms • Ability to plan ahead • Implies a correlation between events • Application behavior characterized by locality of reference • User behavior characterized by locality of sampling

  43. Locality of Sampling Workload attributes are modeled by a marginal distribution But at different times the distributions may be quite distinct Implication: the notion that more data is better is problematic

  44. Locality of Sampling Workload attributes are modeled by a marginal distribution But at different times the distributions may be quite distinct Implication: the assumption of stationarity is problematic

  45. Locality of Sampling Workload attributes are modeled by a marginal distribution But at different times the distributions may be quite distinct Thus the situation changes with time Implication: locality is required to evaluate adaptive systems

  46. Example 1: The Top500 list • Example 2: Parallel workload patterns • Example 3: “Dirty” data • Example 4: User behavior • Example 5: Mass-count disparity

  47. Variability in Workloads • Changing conditions • locality of sampling • Variability between different periods • Heavy-tailed distributions • Unique “high weight” samples • Samples may be so big that they dominate the workload

  48. File Sizes Example USENET survey by Gordon Irlam in 1993 Distribution of file sizes is concentrated around several KB

  49. File Sizes Example USENET survey by Gordon Irlam in 1993 Distribution of file sizes is concentrated around several KB Distribution of disk space spread over many MB This is mass-count disparity

  50. File Sizes Example Joint ratio of 11/89 89% of files have 11% of bytes, while other 11% of files have 89% of bytes (generalization of 20/80 principle and 10/90 principle)

More Related