1 / 23

The Cloud Workloads Archive: A Status Report

The Cloud Workloads Archive: A Status Report. Special thanks to Ion for this opportunity!. Alexandru Iosup. Rean Griffith, Andrew Konwinski, Matei Zaharia, Ali Ghodsi, Ion Stoica. Parallel and Distributed Systems Group, Delft University of Technology, The Netherlands. RADLab,

malini
Download Presentation

The Cloud Workloads Archive: A Status Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Cloud Workloads Archive: A Status Report Special thanks to Ion for this opportunity! Alexandru Iosup Rean Griffith, Andrew Konwinski, Matei Zaharia, Ali Ghodsi, Ion Stoica Parallel and Distributed Systems Group, Delft University of Technology, The Netherlands RADLab, University of California, Berkeley, USA Berkeley, CA, USA

  2. About the Team • Recent Work in Performance • The Grid Workloads Archive (Nov 2006) • The Failure Trace Archive (Nov 2009) • Analysis of Facebook, Yahoo, and Google data center workloads (2009-2010) • The Peer-to-Peer Trace Archive (Apr 2010) • Tools: GrenchMark workload-based grid benchmarking, RAIN • Speaker: Alexandru Iosup • Systems work: Tribler (P2P file sharing), Koala (grid scheduling), POGGI and CAMEO (massively multiplayer online gaming) • Performance evaluation of clouds for sci.comp.: EC2 & three others • Team of 15+ active collaborators in NL, AT, RO, US • Happy to be in Berkeley until September

  3. Traces: Sine Qua Non in Comp.Sys.Res. • “My system/method/algorithm is better than yours (on my carefully crafted workload)” • Unrealistic (trivial): Prove that ‘prioritize jobs from users whose name starts with A’ is a good scheduling policy • Realistic? 85% jobs are short, 15% are long • Major problem in Computer Systems research • Workload Trace = recording of real activity from a (real) system, often as a sequence of jobs / requests submitted by users for execution • Main use: compare and cross-validate new job and resource management techniques and algorithms • Major problem: obtaining and using real workload traces

  4. Previous Data Sharing Efforts • Critical datasets in computer science • Grid Workloads Archive • Failure Trace Archive • Peer-to-Peer Trace Archive • Game Trace Archive (soon) • … PWA, ITA, CRAWDAD, … • 1,000s of scientists • From theory to practice Research Question: Are data center workloads unique? (vs GWA, PWA, …) Dataset Size 1TB/yr 1TB GamTA 100GB P2PTA 10GB 1GB Year ‘06 ‘09 ‘10 ‘11

  5. Agenda • Introduction & Motivation • The Cloud Workloads Archive: What’s in a Name? • Format and Tools • Contents • Analysis & Modeling • Applications • Take Home Message

  6. The Cloud Workloads Archive (CWA)What’s in a Name? CWA = Public collection of cloud/data center workload traces and of tools to process these traces; allows us to: • Compare and cross-validate new job and resource management techniques and algorithms, across various workload traces • Determine which (part of a) trace is most interesting for a specific job and resource management technique or algorithm • Design a general model for data center workloads, and validate it with various real workload traces • Evaluate the generality of a particular workload trace, to determine if results are biased towards a particular trace • Analyze the evolution of workload characteristics across long timescales, both intra- and inter-trace

  7. One Format Fits Them All • Flat format • Job and Tasks • Summary (20 unique data fields) and Detail (60 fields) • Categories of information • Shared with GWA, PWA: Time, Disk, Memory, Net • Jobs/Tasks that change resource consumption profile • MapReduce-specific (two-thirds data fields) CWJ CWJD CWT CWTD A. Iosup, R. Griffith, A. Konwinski, M. Zaharia, A. Ghodsi, I. Stoica, Data Format for the Cloud Workloads Archive, v.3, 13/07/10

  8. CWA Contents: Large-Scale Workloads Trace ID System Size J/T/Obs Period Notes CWA-03 CWA-07 CWA-05 CWA-01 Facebook Facebook 2 Facebook 4 eBay 1.1M/-/- 61K/10M/- ?/?/- 10d/2009 5m/2009 3m/02+2010 23 Sep 2010 Full detail Time & IO Full detail • Tools • Convert to CWA format • Analyze and model automatically  Report CWA-04 Facebook 3 ?/?/- 10d/01-2010 Full detail CWA-06 CWA-08 CWA-02 Yahoo M Twitter Google 2 28K/28M/- 20d/2009 ~Full detail Need help! 25 Aug 2010 CWA-09? Google 9K/177K/4M 7h/2009 Coarse,Period

  9. Agenda • Introduction & Motivation • The Cloud Workloads Archive: What’s in a Name? • Format and Tools • Contents • Analysis & Modeling • Applications • Take Home Message

  10. Types of Analysis • Analysis Type • Basic statistics • Evolution over time • Correlations • Data Break-down • Overall • By Task Type (M/R) • By App. Type (ID) • By User (ID) • By Duration (Short) Analysis Focus • Time-related • Run, Wait, Resp.Time • Bounded Slowdown • Structure-related • Number of tasks • IO-related • IO sizes and ratios • Status-related • Sys. Utilization-related • Counts/Ratios

  11. Types of AnalysisSys.U.,Over Time, By RunTime • Also 1h, 10mins, … counting intervals • Study Short-/Long- Range Dependence (self-similarity) • Also Job count, Running/Waiting counts, … • Study system utilization behavior

  12. Modeling Process • Well-known prob. distrib. • Normal, Exp, LogNormal,Gamma, Weibull, Gen-Pareto, • MLE to fit • Fit known distribution to empirical distribution  parameters • Goodness-of-Fit • Assess how good the fit is; select best-fitting distribution • Kolmogorov-Smirnov: sensitive to body of distribution + D stat • Anderson-Darling: sensitive to tails of distribution • Hybrid method*: works for very large populations *Kondo et al., Failure Trace Archive, CCGrid’10, Best Paper Award.

  13. Main Results: Basic Stats Trace ID TRunTime [s] #Tasks/Job Pk.Arr.Rate/D # users GWA-T3 CWA-03 GWA-T10 CWA-01 433/86med 165J 31,964 89,274 n/a 5—20 5—20 153/143Map -/8KT -/1.6KTph 8KJ/2MT 21KJ/-T 387 n/a 216 18 • MapReduce vs Grid workloads [vs Parallel Prod. Env.] • Massive short tasks vs Many long tasks vs Few very long tasks • Fewer users for MapReduce environments? • TODO: Analyse amounts per core CWA-02 512/80med 901/712Map 6KJ/3.2MT n/a GWA-T1 370 5—20 -/20KT 332 GWA-T6 14,599 5—20 -/22.5KT 206 GWA-T11 8,971 5—20 -/22KTph 412

  14. Agenda • Introduction & Motivation • The Cloud Workloads Archive: What’s in a Name? • Format and Tools • Contents • Analysis & Modeling • Applications • Take Home Message

  15. Applications • Mesos running mixtures of workloads • Workloads: MPI, MapReduce, grid, … • Find bottlenecks • Find workloads that are particularly difficult to run • Improve the system! • Status: in progress, using cluster in Finland (Petri Savolainen) • All the apps typical to trace-based work: design, validation, and comparison of algorithms, methods, and systems.

  16. Agenda • Introduction & Motivation • The Cloud Workloads Archive: What’s in a Name? • Format and Tools • Contents • Analysis & Modeling • Applications • Take Home Message

  17. Take Home Message • Cloud Workloads Archive • Datasets • Tools to convert, analyze, and model the datasets • Need your help to collect more traces • Converted and analyzed three MapReduce workloads • Different from grid and parallel production environment workloads(ask about additional proof and let me show a couple more slides) • Invariants? • Applications • 1: Model of Cloud/MapReduce workloads • 2: Test and improve Mesos

  18. Continuing Our Collaboration • Scheduling mixtures of grid/HPC/cloud workloads • Scheduling and resource management in practice • Modeling aspects of cloud infrastructure and workloads • Condor on top of Mesos • Massively Social Gaming and Mesos • Step 1: Game analytics and social network analysis in Mesos • …

  19. Thank you! Questions? Observations? Alex Iosup, Rean Griffith, Andrew Konwinski, Matei Zaharia, Ali Ghodsi, Ion Stoica email: A.Iosup@tudelft.nl Thanks for all: AliG, Andrew, AndyK, Ari, Beth, Blaine, David, Ion, Justin, Lucian, Matei, Petri, Rean, Tim, … • More Information: • The Grid Workloads Archive: gwa.ewi.tudelft.nl • The Failure Trace Archive: fta.inria.fr • The GrenchMark perf. eval. tool: grenchmark.st.ewi.tudelft.nl • Cloud research: www.st.ewi.tudelft.nl/~iosup/research_cloud.html • see PDS publication database at: www.pds.twi.tudelft.nl/ Big thanks to our collaborators: U. Wisc.-Madison, U Chicago, U Dortmund, U Innsbruck, LRI/INRIA Paris, INRIA Grenoble, U Leiden, Politehnica University of Bucharest, Technion, …

  20. Additional Slides

  21. Main Results: Basic Stats Trace ID Total IO [MB] Rd. [MB] Wr [%] HDFS Wr[MB] GWA12.2 CWA-03 GWA12.4 CWA-01 - 10,934 389 144 6,805 114 33 - 21% 92% - 38% n/a 1,538 n/a - • MapReduce vs Grid workloads • IO-intensive vs Compute-intensive • Constant Wr[%]~40%IO for MapReduce traces? • TODO: More MapReduce traces to validate findings CWA-02 75,546 47,539 37% 8,563 GWA12.1 469 174 63% n/a GWA12.3 161 130 19% n/a GWA12.5 330 31 91% n/a

  22. Main Results • Two-mode trace  do NOT analyze as whole

More Related