1 / 44

SimMatrix: SIMulator for MAny -Task computing execution fabRIc at eXascales

SimMatrix: SIMulator for MAny -Task computing execution fabRIc at eXascales. Ke Wang Data-Intensive Distributed Systems Laboratory Computer Science Department Illinois Institute of Technology February 14 th , 2012. Acknowledgements. DataSys Laboratory Dr. Ioan Raicu

Download Presentation

SimMatrix: SIMulator for MAny -Task computing execution fabRIc at eXascales

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales Ke Wang Data-Intensive Distributed Systems Laboratory Computer Science Department Illinois Institute of Technology February 14th, 2012

  2. Acknowledgements • DataSys Laboratory • Dr. IoanRaicu • Juan Carlos Hernández Munuera, MS 2011 • Hui Jin, Tonglin Li • Paper submission: • KeWang, IoanRaicu. “SimMatrix: Exploring Many-Task Computing through Simulations at Exascales”, under review at ACM HPDC 2012 SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  3. Outline • Introduction & Motivation • Long-Term Aims and Contributions • SimMatrix Architecture • Implementation • Evaluation • Related Work • Contributions • Future Work & Conclusion SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  4. Outline • Introduction & Motivation • Long-Term Aims and Contributions • SimMatrix Architecture • Implementation • Evaluation • Related Work • Contributions • Future Work & Conclusion SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  5. Distributed Systems SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  6. Manycore Computing Pat Helland, Microsoft, The Irresistible Forces Meet the Movable Objects, November 9th, 2007 SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales 6 Today (2011): Multicore Computing • O(10) cores commodity architectures • O(100) cores proprietary architectures • O(1000) GPU hardware threads Near future (~2018): Manycore Computing • ~1000 cores/threads commodity architectures

  7. Exascale Computing Top500 Performance Development, http://top500.org/static/lists/2011/11/TOP500_201111_Poster.pdf 7 Today (2012): 10 Petaflop Computing • O(100K) nodes (100X in the last 10 years) • O(1M) cores (1000X in the last 10 years) Near future (~2018): Exaflop Computing • ~1M nodes (10X) • ~1B processor-cores/threads (1000X)

  8. Major Challenges of Exascale Computing SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales 8 Concurrency • Parallel programmability Resilience • MTTF decreases, MPI suffers I/O and Memory • Minimizing data movement Heterogeneity • Accelerators, GPUs, MIC Energy • 20MW limitation

  9. MTC: Many-Task Computing • Bridge the gap between HPC and HTC • Applied in clusters, grids, and supercomputers • Loosely coupled apps with HPC orientations • Many activities coupled by file system ops • Many resources over short time periods SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  10. MTC Middleware • Falkon • Fast and Lightweight Task Execution Framework • http://datasys.cs.iit.edu/projects/Falkon/index.html • Swift • Parallel Programming System • http://www.ci.uchicago.edu/swift/index.php SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  11. Outline • Introduction & Motivation • Long-Term Aims and Contributions • SimMatrix Architecture • Implementation • Evaluation • Related Work • Contributions • Future Work & Conclusion SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  12. Long-Term Aims • Address major exascale computing challenges: • Concurrency • Resilience • I/O and Memory • Heterogeneity • Explore techniques to enable MTC at exascales • Design, Analyze, and Implement a distributed data-aware execution fabric (MATRIX) supporting HPC/MTC workloads • Integrate MATRIX with parallel programming systems (e.g. Swift, Charm++, MapReduce) and with the FusionFS distributed file system • Prove that MTC applications can scale to exascales SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  13. This Work’s Contributions • Explore techniques to enable MTC to scale to exascales • Design, Analyze, and Implement a discrete-event simulator (SimMatrix) enabling the study of MATRIX at extremely large scales (e.g. exascales) • Identified work stealing as a viable technique to achieve load balance at exascales • Provide evidence that work stealing is scalableby identifying optimal parameters affecting the performance of work stealing SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  14. Outline • Introduction & Motivation • Long-Term Aims and Contributions • SimMatrix Architecture • Implementation • Evaluation • Related Work • Contributions • Future Work & Conclusion SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  15. OverviewJob Scheduling Systems • Efficiently manage the distributed computing power of workstations, servers, and supercomputers in order to maximize job throughput and system utilization. • Load balancing is critical • Different scheduling strategies • Centralized scheduling hinders the scalability • Hierarchical scheduling has long job turnaround time • Distributed scheduling is a promising approach at exascales • Work Stealing – a distributed scheduling strategy • Starved processors steal tasks from overloaded ones • Various parameters affect performance: • Number of tasks to steal • Number of neighbors • Static or Dynamic random neighbors SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  16. SimMatrix Architecture Submit tasks Submit tasks Client Client Dispatcher Arbitrary Node Figure 1: Simulation architectures; the left part is the centralized one with a single dispatcher connecting all nodes, the right part is the homogeneous distributed topology with each node having the same number of cores and neighbors SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  17. Simulations • Continuous time simulations • Abandoned the idea of creating a separate thread per simulated node: we found that on our 48-core system with 256GB of memory, we were limited to 32K threads • Discrete event simulations • The only viable approach (today) to explore scheduling techniques at exascales (millions of nodes and billions of cores) • Created a unique object per simulated node, and converted any behavior to an event SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  18. Outline • Introduction & Motivation • Long-Term Aims and Contributions • SimMatrix Architecture • Implementation • Evaluation • Related Work • Contributions • Future Work & Conclusion SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  19. At the Heart of SimMatrixGlobal Event Queue • All events are inserted to the queue, sorted based on the occurrence time ascending • Handle the first event, advance the simulation time and update the event queue • Implemented as red-black tree based “TreeSet” in Java, which ensures Θ(log⁡𝑛 ) time for insert & remove Figure 2: Event State Transition Diagram SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  20. Simulator Features • Node load information • Nested hash maps provides extremely fast performance at large scales • Dynamic Task Submission • Aims to reduce the memory foot-print • Dynamic Poll interval • Exponential backoff to reduce the number of messages and increase speed of simulation SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  21. Implementation • SimMatrix is developed in JAVA • Sun 64-bit JDK version 1.6.0_22 • 1500 lines of code • Code accessible at: • http://datasys.cs.iit.edu/projects/SimMatrix/index.html • SimMatrix has no other dependencies SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  22. Outline • Introduction & Motivation • Long-Term Aims and Contributions • SimMatrix Architecture • Implementation • Evaluation • Related Work • Contributions • Future Work & Conclusion SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  23. Experiment Environment • Fusion system: • fusion.cs.iit.edu • 48 AMD Opteron cores at 1.93GHz • 256GB RAM • 64-bit Linux kernel 2.6.31.5 • Sun 64-bit JDK version 1.6.0_22 SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  24. Metrics • Throughput • Number of tasks finished per second. Calculated as total-number-of-tasks/simulation-time. • Efficiency • The ratio between the ideal simulation time of completing a given workload and the real simulation time. The ideal simulation time is calculated by taking the average task execution time multiplied by the number of tasks per core. • Load Balancing • We adopted the coefficient variance of the number of tasks finished by each node as a measure the load balancing. The smaller the coefficient variance, the better the load balancing is. It is calculated as the standard-deviation/average in terms of number of tasks finished by each node. • Scalability • Total number of tasks, number of nodes, and number of cores supported. SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  25. Workloads • Synthetic workloads: • Uniform distributions with different average task lengths, such as 10s (ave_10), 100s (ave_100), 1000s (ave_1000), 5000s (ave_5000), 10000s (ave_10000), and 100000s (ave_100000); also all tasks of 1 sec each (all_1) • Realistic application workloads: • General MTC workload from 2008-2009 trace of 173M tasks; average task length 64±486s (mtc_64), using Gamma Distribution SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  26. Validation Validate SimMatrix against the state-of-the-art MTC systems (e.g. Falkon), to ensure that the simulator can accurately predict the performance of current petascale systems. SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  27. Comparing Work Stealing to Falkon’s Naïve Distributed Scheduler Fine grained workloads: • 2%  99.3% efficiency increase Coarse grained workloads: • 99%  99.999% efficiency increase SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  28. Scalability1M Nodes and 10B tasks Memory consumption • <13 KB/task • <200 GB CPU Time • <90 us/task • <260 hours SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  29. Scalability1M Nodes and 10B tasks Efficiency • 90%+ Co-variance • <0.06 • Load imbalance of <600 tasks from 10K tasks per node SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  30. Work Stealing ParametersNumber of Tasks to Steal Stealing half of neighbor’s work is best strategy!

  31. Work Stealing ParametersNumber of Neighbors (Static) Requires linear number of neighbors for good performance!

  32. Work Stealing ParametersNumber of Neighbors (Dynamic Random) An increasing number of neighbors are needed for 90%+ efficiency, with the largest scales requiring square root neighbors (e.g. 1K neighbors from 1M nodes!

  33. Work Stealing ParametersOptimal Parameters Generality The same optimal parameters achieve 90%+ efficiency across many different workloads!

  34. Work StealingThroughput Centralized scheduling has severe bottleneck, especially for workload with fine granularity. Distributed scheduling has great scalability, for workload with coarse granularity, there is no obvious upper bound SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  35. Load Balancing Visualization1024 Nodes and Ave_5000 Workload Starvation Good Load Balancing Good Load Balancing Starvation Quarter Static Neighbors Square Root Dynamic Neighbors Square Root Static Neighbors 2 Static Neighbors SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  36. Summary Plot for Distributed Scheduling Steady state utilization is ~100% at exascales SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  37. Outline • Introduction & Motivation • Long-Term Aims and Contributions • SimMatrix Architecture • Implementation • Evaluation • Related Work • Contributions • Future Work & Conclusion SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  38. Related Work • Real Job Scheduling Systems: • Condor (University of Wisconsin), Bradley et al, 2012 • PBS (NASA Ames) , Corbattoet al, 2012 • LSF Batch (Platform Computing of Toronto), 2011 • Falkon (University of Chicago), Raicu et al, SC07 • Job Scheduling System Simulators: • simJava(University of Edinburgh), Wheeler et al, 2004 • GridSim(University of Melbourne, Australia), Buyyaet al, 2010 • Load Balancing: • Neighborhood averaging scheme, Sinhaet al, 1993 • Charm++ (UIUC), Zhenget al, 2011 • Scalable Work Stealing • Dinan et al, SC09 • Blumofe et al, Scheduling multithreaded computations by work stealing, 1994 SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  39. Outline • Introduction & Motivation • Long-Term Aims and Contributions • SimMatrix Architecture • Implementation • Evaluation • Related Work • Contributions • Future Work & Conclusion SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  40. Contributions • Designed, Analyzed, and Implemented a discrete-event simulator (SimMatrix) enabling the study of MTC workloads at exascales • Identified work stealing as a viable technique to achieve load balance at exascales • Provided evidence that work stealing is scalableby finding optimal parameters affecting the performance of work stealing • Number of tasks to steal is half • Dynamic random neighbors strategy is required • There must be a squared root number of neighbors SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  41. Outline • Introduction & Motivation • Long-Term Aims and Contributions • SimMatrix Architecture • Implementation • Evaluation • Related Work • Contributions • Future Work & Conclusion SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  42. Future Work • Explore work stealing for manycore processors with 1000 cores • Enhancing the network topology model to allow complex networks • Insight from SimMatrix will be used to develop MATRIX, a distributed task execution fabric • MATRIX will employ work stealing for distributed load balancing • MATRIX will be integrated with other projects, such as Swift (a data-flow parallel programming systems) and FusionFS(a distributed file systems) SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  43. Conclusion • Exascale systems bring great opportunities in unraveling of significant scientific mysteries • There are significant challenges to achieve exascales, such as concurrency, resilience, I/O and memory, heterogeneity, and energy • MTC requires a highly scalable and distributed task/job management system at large scales • Distributed scheduling is likely an efficient way to achieve load balancing, leading to high job throughput and system utilization • Work stealing is a scalable method to achieve load balance at exascales given the optimal parameters SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

  44. More Information • More information: • http://datasys.cs.iit.edu/~kewang/ • http://datasys.cs.iit.edu/projects/SimMatrix/ • Contact: • kwang22@hawk.iit.edu • Questions? SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascales

More Related