460 likes | 551 Views
The Failure Trace Archive. Grid Computing: From Old Traces to New Applications. Alexandru Iosup , Ozan Sonmez, Nezih Yigitbasi, Hashim Mohamed, Catalin Dumitrescu, Mathieu Jan, Dick Epema. Parallel and Distributed Systems Group, TU Delft.
E N D
The FailureTraceArchive Grid Computing: From Old Traces to New Applications Alexandru Iosup, Ozan Sonmez, Nezih Yigitbasi, Hashim Mohamed, Catalin Dumitrescu, Mathieu Jan, Dick Epema Parallel and Distributed Systems Group, TU Delft Many thanks to our collaborators: U Wisc./Madison, U Chicago, U Dortmund, U Innsbruck, LRI/INRIA Paris, INRIA Grenoble, U Leiden, Politehnica University of Bucharest, Technion, … DGSim Fribourg, Switzerland
Alexandru Iosuphttp://pds.twi.tudelft.nl/~iosup/ • Systems • The Koala grid scheduler • The Tribler BitTorrent-compatible P2P file-sharing • The POGGI and CAMEO gaming platforms • Performance • The Grid Workloads Archive (Nov 2006) • The Failure Trace Archive (Nov 2009) • The Peer-to-Peer Trace Archive (Apr 2010) • Tools: DGSim trace-based grid simulator, GrenchMark workload-based grid benchmarking • Team of 15+ active collaborators in NL, AT, RO, US • Happy to be in Berkeley until September
The Grid An ubiquitous, always-on computational and data storage platform on which users can seamlessly run their (large-scale) applications Shared capacity & costs, economies of scale
The Dutch Grid: DAS System and Extensions UvA/MultimediaN (46) VU (85 nodes) DAS-3: a 5-cluster grid • 272 AMD Opteron nodes 792 cores, 1TB memory • Heterogeneous: • 2.2-2.6 GHz single/dual core nodes • Myrinet-10G (excl. Delft) • Gigabit Ethernet SURFnet6 UvA/VL-e (41) Clouds 10 Gb/s lambdas • Amazon EC2+S3, Mosso, … DAS-4 (upcoming) • Multi-cores: general purpose, GPU, Cell, … Leiden (32) TU Delft (68)
Many Grids Built DAS, Grid’5000, OSG, NGS, CERN, … Why grids and not The Grid?
Agenda • Introduction • Was it the System? • Was it the Workload? • Was it the System Designer? • New Application Types • Suggestions for Collaboration • Conclusion
The Failure Trace ArchiveFailure and Recovery Events http://fta.inria.fr 20+ traces online D. Kondo, B. Javadi, A. Iosup, D. Epema, The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems, CCGrid 2010 (Best Paper Award)
Was it the System? • No • System can grow fast • Good data and models to support system designers • Yes • Grid middleware unscalable [CCGrid06,Grid09,HPDC09] • Grid middleware failure-prone [CCGrid07,Grid07] • Grid resources unavailable [CCGrid10] • Inability to load balance well [SC|07] • Poor online information about resource availability
Agenda • Introduction • Was it the System? • Was it the Workload? • Was it the System Designer? • New Application Types • Suggestions for Collaboration • Conclusion
The Grid Workloads ArchivePer-Job Arrival, Start, Stop, Structure, etc. http://gwa.ewi.tudelft.nl 6 traces online 1.5 yrs >750K >250 A. Iosup, H. Li, M. Jan, S. Anoep, C. Dumitrescu, L. Wolters, D. Epema, The Grid Workloads Archive, FGCS 24, 672—686, 2008.
How Are Real Grids Used? Data Analysis and Modeling • Grids vs. parallel production environments such as clusters and (small) supercomputers • Bags of single-processor tasks vs. single parallel jobs • Bigger bursts of job arrivals • More jobs Grid Systems Parallel production environments
Grid WorkloadsAnalysis: Grid Workload Components Workflows (WFs) Bags-of-Tasks (BoTs) Time [units] • BoT size = 2-70 tasks, most 5-20 • Task runtime highly variable, from minutes to tens of hours • WF size = 2-1k tasks, most 30-40 • Task runtime of minutes
Was it the Workload? • No • Similar workload characteristics across grids • High utilization possible due to single-node jobs • High load imbalance • Good data and models to support system designers[Grid06,EuroPar08,HPDC08-10,FGCS08] • Yes • Too many tasks (system limitation) • Poor online information about job characteristics +High variability of job resource requirements • How to schedule BoTs, WFs, mixtures in grids?
Agenda • Introduction • Was it the System? • Was it the Workload? • Was it the System Designer? • New Application Types • Suggestions for Collaboration • Conclusion
Problems in Grid Scheduling and Resource Management The System Grid schedulers do not own resources themselves They have to negotiate with autonomous local schedulers Authentication/multi-organizational issues Grid schedulers interface to local schedulers Some may have support for reservations, others are queuing-based Grid resources are heterogeneous and dynamic Hardware (processor architecture, disk space, network) Basic software (OS, libraries) Grid software (middleware) Resources may fail Lack of complete and accurate resource information
Problems in Grid Scheduling and Resource Management The Workloads Workloads are heterogeneous and dynamic Grid schedulers may not have control over the full workload (multiple submission points) Jobs may have performance requirements Lack of complete and accurate job information Application structure is heterogeneous Single sequential job Bags of Tasks; parameter sweeps (Monte Carlo), pilot jobs Workflows, pipelines, chains-of-tasks Parallel jobs (MPI); malleable, coallocated
The Koala Grid Scheduler Developed in the DAS system Has been deployed on the DAS-2 in September 2005 Ported to DAS-3 in April 2007 Independent from grid middlewares such as Globus Runs on top of local schedulers Objectives: Data and processor co-allocation in grids Supporting different application types Specialized application-oriented scheduling policies Koala homepage: http://www.st.ewi.tudelft.nl/koala/
Koala in a Nutshell A bridge between theory and practice • Parallel Applications • MPI, Ibis,… • Co-Allocation • Malleability • Parameter Sweep Applications • Cycle Scavenging • Run as low-priority jobs • Workflows
Inter-Operating Grids Through Delegated MatchMakingInter-Operation Architectures Delegated MatchMaking Independent Centralized Hybrid hierarchical/ decentralized Hierarchical Decentralized
Resource request Local load too high Bind remote resource Delegate Resource usage rights Inter-Operating Grids Through Delegated MatchMaking The Delegated MatchMaking Mechanism • Deal with local load locally (if possible) • When local load is too high, temporarily bind resources from remote sites to the local environment. • May build delegation chains. • Delegate resource usage rights, do not migrate jobs. • Deal with delegations each delegation cycle (delegated matchmaking) The Delegated MatchMaking Mechanism=Delegate Resource Usage Rights, Do Not Delegate Jobs
DMM Decentralized Centralized Independent What is the Potential Gain of Grid Inter-Operation?Delegated MatchMaking vs. Alternatives (Higher is better) • DMM • High goodput • Low wait time • Finishes all jobs • Even better for load imbalance between grids • Reasonable overhead • [see thesis] Grid Inter-Operation (through DMM)delivers good performance
4.2. Studies on Grid Scheduling [5/5]Scheduling under Cycle Stealing monitors/informs idle/demanded resources • CS Policies: • Equi-All: • grid-wide basis • Equi-PerSite: • per cluster Head Node Scheduler KCM Node • Requirements • Unobtrusiveness Minimal delay for (higher priority) local and grid jobs • Fairness 3. Dynamic Resource Allocation 4. Efficiency 5. Robustness and Fault Tolerance grow/shrink messages submits launchers registers Launcher submits PSA(s) CS-Runner Launcher deploys, monitors, and preempts tasks JDF Clusters • Application Level Scheduling: • Pull-based approach • Shrinkage policy Deployed as Koala Runner O. Sonmez, B. Grundeken, H. Mohamed, A. Iosup, D. Epema: Scheduling Strategies for Cycle Scavenging in Multicluster Grid Systems. CCGRID 2009: 12-19
Was it the System Designer? • No • Mechanisms to inter-operate grids: DMM [SC|07], … • Mechanisms to run many grid application types: WFs, BoTs, parameter sweeps, cycle scavenging, … • Scheduling algorithms with inaccurate information [HPDC ‘08, ‘09, ‘10] • Tools for empirical and trace-based experimentation • Yes • Still too many tasks • What about new application types?
Agenda • Introduction • Was it the System? • Was it the Workload? • Was it the System Designer? • New Application Types • Suggestions for Collaboration • Conclusion
Sources: MMOGChart, own research. Sources: ESA, MPAA, RIAA. MSGs are a Popular, Growing Market • 25,000,000 subscribed players (from 150,000,000+ active) • Over 10,000 MSGs in operation • Market size 7,500,000,000$/year
Romeo and Juliet Massively Social Gaming as New Grid/Cloud Application Massively Social Gaming (online) games with massive numbers of players (100K+), for which social interaction helps the gaming experience [SC|08, TPDS’10] • Virtual worldExplore, do, learn, socialize, compete+ • ContentGraphics, maps, puzzles, quests, culture+ • Game analyticsPlayer stats and relationships [EuroPar09 BPAward, CPE10] [ROIA09]
Suggestions for Collaboration • Scheduling mixtures of grid/HPC/cloud workloads • Scheduling and resource management in practice • Modeling aspects of cloud infrastructure and workloads • Condor on top of Mesos • Massively Social Gaming and Mesos • Step 1: Game analytics and social network analysis in Mesos • The Grid Research Toolbox • Using and sharing traces: The Grid Workloads Archive and The Failure Trace Archive • GrenchMark: testing large-scale distributed systems • DGSim: simulating multi-cluster grids
Thank you! Questions? Observations? Alex Iosup, Ozan Sonmez, Nezih Yigitbasi, Hashim Mohamed, Dick Epema email: A.Iosup@tudelft.nl • More Information: • The Koala Grid Scheduler: www.st.ewi.tudelft.nl/koala • The Grid Workloads Archive: gwa.ewi.tudelft.nl • The Failure Trace Archive: fta.inria.fr • The DGSim simulator: www.pds.ewi.tudelft.nl/~iosup/dgsim.php • The GrenchMark perf. eval. tool: grenchmark.st.ewi.tudelft.nl • Cloud research: www.st.ewi.tudelft.nl/~iosup/research_cloud.html • Gaming research: www.st.ewi.tudelft.nl/~iosup/research_gaming.html • see PDS publication database at: www.pds.twi.tudelft.nl/ DGSim Many thanks to our collaborators: U. Wisc.-Madison, U Chicago, U Dortmund, U Innsbruck, LRI/INRIA Paris, INRIA Grenoble, U Leiden, Politehnica University of Bucharest, Technion, …
The 1M-CPU Machine with Shared Resource Ownership • The 1M-CPU machine • eScience (high-energy physics, earth sciences, financial services, bioinformatics, etc.) • Shared resource ownership • Shared resource acquisition • Shared maintenance and operation • Summed capacity higher (more efficiently used) than sum of individual capacities
How to Build the 1M-CPU Machine with Shared Resource Ownership? • Clusters of resources are ever more present • Top500 SuperComputers: cluster systems from 0% to 75% share in 10 years (also from 0% to 50% performance) • CERN WLCG: from 100 to 300 clusters in 2½ years Source: http://goc.grid.sinica.edu.tw/gstat//table.html Source: http://www.top500.org/overtime/list/29/archtype/
Last 4 years How to Build the 1M-CPU Machine with Shared Resource Ownership? Now:0.5x/yr Max:100x To build the 1M-CPU cluster: - At last 10 years rate, another 10 years - At current rate, another 200 years Averge: 20x Median: 10x Last 10 years Data source: http://www.top500.org
How to Build the 1M-CPU Machine with Shared Resource Ownership? • Cluster-based Computing Grids • CERN’s WLCG cluster size over time Shared clusters grow on average slower than Top500 cluster systems! Max: 2x/yr Avg: +15 procs/yr Median: +5 procs/yr Year 1 Year 2 Data source: http://goc.grid.sinica.edu.tw/gstat/
How to Build the 1M-CPU Machine with Shared Resource Ownership? • Physics • Dissipate heat from large clusters • Market • Pay industrial power consumer rate, pay special system building rate • Collaboration • Who pays for the largest cluster? • We don’t know how to exploit multi-cores yet • Executing large batches of independent jobs Why doesn’t CERN WLCG use larger clusters? Why doesn’t CERN WLCG opt for multi-cores?
4.1. Grid Workloads [2/5]BoTs are predominant in grids • Selected Findings • Batches predominant in grid workloads; up to 96% CPUTime • Average batch size (Δ≤120s) is 15-30 (500 max) • 75% of the batches are sized 20 jobs or less A. Iosup, M. Jan, O. Sonmez, and D.H.J. Epema, The Characteristics and Performance of Groups of Jobs in Grids, Euro-Par, LNCS, vol.4641, pp. 382-393, 2007.
System Availability CharacteristicsResource Evolution: Grids Grow by Cluster
System Availability CharacteristicsGrid Dynamics: Grids Shrink Temporarily Average availability:69% Grid-level view
Resource Availability Model • Assume no correlation of failure occurrence between clusters • Which site/cluster? • fs, fraction of failures at cluster s MTBF MTTR Correl. • Weibull distribution for IAT • the longer a node is online, the higher the chances that it will fail
Grid WorkloadsLoad Imbalance Across Sites and Grids • Overall workload imbalance: normalized daily load (5:1) • Temporary workload imbalance: hourly load (1000:1) Overall imbalance Temporary imbalance
4.1. Grid Workloads [4/5]Modeling Grid Workloads: Feitelson adapted • Adapted to grids: percentage parallel jobs, other values. • Validated with 4 grid and 7 parallel production env. traces A. Iosup, T. Tannenbaum, M. Farrellee, D. Epema, M. Livny: Inter-operating grids through Delegated MatchMaking. SC|07 (Nominated for Best Paper Award)
Grid WorkloadsModeling Grid Workloads: adding users, BoTs • Single arrival process for both BoTs and parallel jobs • Reduce over-fitting and complexity of “Feitelson adapted” by removing the RunTime-Parallelism correlated model • Validated with 7 grid workloads A. Iosup, O. Sonmez, S. Anoep, and D.H.J. Epema. The Performance of Bags-of-Tasks in Large-Scale Distributed Systems, HPDC, pp. 97-108, 2008.
How To Compare Existing and New Grid Systems?The Delft Grid Simulator (DGSim) Generate realistic workloads Automate simulation process (10,000s of tasks) Discrete event generator DGSim …tudelft.nl/~iosup/dgsim.php
OAR Condor Koala Globus GRAM Alien Independent Centralized Condor Flocking OAR2 OurGrid Moab/Torque NWIRE CCS Decentralized Hierarchical How to Inter-Operate Grids?Existing (, Working?) Alternatives Load imbalance? Resource selection? Scale? Root ownership? Node failures? Accounting? Trust? Scale?
2 3 3 3 3 3 3 Inter-Operating Grids Through Delegated MatchMaking [1/3]The Delegated MatchMaking Architecture • Start from a hierarchical architecture • Let roots exchange load • Let siblings exchange load Delegated MatchMaking Architecture=Hybrid hierarchical/decentralized architecture for grid inter-operation
Massively Social Gaming on Clouds MSGs • Million-user, multi-bn market • Content, World Sim, Analytics Current Technology Our Vision • Upfront payment • Cost and scalability problems • Makes players unhappy • Scalability & Automation • Economy of scale with clouds Ongoing Work PublicationsGaming and Clouds2008: ACM SC, TR Perf 2009: ROIA, CCGrid, NetGames, EuroPar (Best Paper Award), CloudComp, TR variability 2010: IEEE TPDS, Elsevier CCPE2011: Book Chapter CAMEO Graduation Forecast2010/2011: 1PhD, 2Msc, 4BSc • Content: POGGI Framework • Platform: edutain@grid • Analytics: CAMEO Framework The Future • Happy players • Happy cloud operators
Problems in Grid Scheduling and Resource Management New Hypes, New Focus for Designers • Clouds • Large-scale, loosely coupled infrastructure and/or platform • Computation and storage has fixed costs (?) • Guaranteed good performance, e.g., no wait time (?) • Easy to port grid applications to clouds (?) • Multi-cores • Small- and mid-scale, tightly-coupled infrastructure • Computation and storage has lower cost than grid (?) • Good performance (?) • Easy to port grid applications to multi-cores (?)