410 likes | 594 Views
Next Generation Job Management System for Extreme Scales. Speaker: Ke Wang Advisor: Ioan Raicu Datasys Lab, Illinois Institute of Technology November 25 th , 2013 at Datasys Mini-Workshop, Fall 2013. Contents.
E N D
Next Generation Job Management System for Extreme Scales Speaker: Ke Wang Advisor: IoanRaicu DatasysLab, Illinois Institute of Technology November 25th, 2013 at Datasys Mini-Workshop, Fall 2013
Contents • SLURM++: A distributed job launch prototype for extreme-scale ensemble computing (IPDPS14 submission) • MATRIX:A distributed Many-Task Computing execution fabric designed for exascale (CCGRID14 submission) Next Generation Job Management System for Extreme Scales
Outline • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales
Outline • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales
Exascale Computing • Today (June, 2013): 34 Petaflop • O(100K) nodes • O(1M) cores • Near future (~2020): Exaflop Computing • ~1M nodes • ~1B processor-cores/threads Next Generation Job Management System for Extreme Scales
Job Management Systems for Exascale Computing • Ensemble Computing • Over-decomposition • Many-Task Computing • Jobs/Tasks are finer-grained • Requirements • high availability • extreme high throughput (1M tasks/sec) • low Latency Next Generation Job Management System for Extreme Scales
Current Job Management Systems • Batch scheduled HPC workloads • Lack the support of ensemble workloads • Centralized Design • Poor Scalability • Single-point-of-failure • SLURM maximum throughput of 500 jobs/sec • Decentralized design is demanded Next Generation Job Management System for Extreme Scales
Goal • Architect, and design job management systems for exascale ensemble computing • Identifies the challenges and solutions towards supporting job management systems at extreme scales • Evaluate and compare different design choices at large scale Next Generation Job Management System for Extreme Scales
Outline • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales
Contributions • Proposed a distributed architecture for job management systems, and identified the challenges and solutions towards supporting job management system at extreme-scales • Designed and developed a novel distributed resource stealing algorithm for efficient HPC job launch • Designed and implemented a distributed job launch prototype SLURM++ for extreme scales by leveraging SLURM and ZHT • Evaluated SLURM and SLRUM++ up to 500-nodes with various micro-benchmarks of different job sizes with excellent results up to 10X higher throughput Next Generation Job Management System for Extreme Scales
Architecture Controllers are fully connected Ratio and Partition Size are configurable for HPC and MTC Data servers are also fully connected Next Generation Job Management System for Extreme Scales
Job and Resource Metadata Next Generation Job Management System for Extreme Scales
Resource Stealing • The procedure of stealing resources from other partitions • Why need to steal resources? • When to steal resources? • Where and how to steal resources? • What if no resources to be stolen? Next Generation Job Management System for Extreme Scales
Resource Contention • When different controllers try to allocate the same resources • Naive way to solve the problem is to add a global lock for each queried key in the DKVS • Atomic compare and swap operation in the DKVS that can tell the controllers whether the resource allocation succeeds Next Generation Job Management System for Extreme Scales
Blocking State Change Callback • A controller needs to wait on specific state change before moving on • Inefficient when client keeps polling from the server • The server has a blocking state change callback operation Next Generation Job Management System for Extreme Scales
SLURM++ Design and Implementation • SLURM description • Light-weight controller as ZHT client • Job launching as a separate thread • Implement the resource stealing algorithm • Developed in C • 3K lines of code + SLURM 50K lines of code + ZHT 8K lines of code Next Generation Job Management System for Extreme Scales
Evaluation • LANL Kodiak machine up to 500 nodes • Each node has two 64-bit AMD Opteron processors at 2.6GHz and 8GB memory • SLURM version 2.5.3 • Partition Size of 50 • Metrics: • Efficiency • number of ZHTmessage Next Generation Job Management System for Extreme Scales
Small-Job Workload Conclusions: SLURM remains almost constant, while SLURM++ has an increasing trend with respect to the scale MTC configuration is more preferable for MTC workload than HPC configuration Next Generation Job Management System for Extreme Scales
Medium-Job Workload Conclusion: SLURM experiences constant, while SLURM++ has linearly increasing trend with respect to scale Medium-job workload introduces a little bit resource stealing overhead Next Generation Job Management System for Extreme Scales
Big-Job Workload Conclusion: SLURM is about to saturate at large scales, while SLURM++ has an increasing trend with respect to scale The more partitions we have, the better chance that a controller can steal resources Next Generation Job Management System for Extreme Scales
Outline • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales
Contributions • Design and implement a distributed MTC task execution fabric (MATRIX) that uses adaptive work stealing technique for distributed load balancing, and employs a DKVS to store task metadata. • Explore the parameter space of work stealing technique as applied to exascale class systems, through a light-weight job scheduling system simulator, SimMatrix, up to millions of nodes, billions of cores, and trillions of tasks. • Evaluate and compare MATRIX with various task schedulers( Falkon, Sparrow, and CloudKon) tested on an IBM Blue Gene/P machine and Amazon Cloud, using both micro-benchmarks and real workload traces. Next Generation Job Management System for Extreme Scales
Work Stealing • Distributed load balancing technique • Idle scheduler steals tasks from over-loaded one • Number of tasks to steal • Number of static neighbors • Number of dynamic neighbors • Poll interval Next Generation Job Management System for Extreme Scales
Work Stealing • Select Neighbors Next Generation Job Management System for Extreme Scales
Work Stealing • Dynamic Poll Interval Next Generation Job Management System for Extreme Scales
MATRIX Architecture Next Generation Job Management System for Extreme Scales
Task Submission • Worst Case • All tasks are submitted to one scheduler • Best Case • All tasks are evenly distributed to all schedulers Next Generation Job Management System for Extreme Scales
Execute Unit Next Generation Job Management System for Extreme Scales
Client Monitoring • Client doesn’t have to be alive • A monitoring program polls the task execution progressperiodically • Record logs about the system state for doing visualization Next Generation Job Management System for Extreme Scales
Evaluation • Focus on the Amazon Cloud environment • The “m1.medium” instance • 1 virtual cpu, 2 compute units, • 3.7GB memory, • 410GB hard disk. • Run MATRIX, Sparrow, and CloudKon up to 64 instances. • The number of executing threads is 2 for all of the systems. Next Generation Job Management System for Extreme Scales
Throughput Comparison Reason: MATRIX has near perfect load balancing when submitting tasks Sparrow needs to send probe messages to push tasks the clients of CloudKon need to push and pull tasks from SQS Next Generation Job Management System for Extreme Scales
Efficiency Comparison Conclusion: SQS, and DynamoDB are designed specifically for Cloud The probing and pushing method in Sparrow has poor scalability for heterogeneous workload The network communication layer of MATRIX needs to be improved significantly Next Generation Job Management System for Extreme Scales
Exploration of Work Stealing Parameters Space • Through SimMatrix up to millions of nodes, billions of cores, and trillions of tasks • Number of Tasks to Steal • Number of Static Neighbors • Number of Dynamic Neighbors Next Generation Job Management System for Extreme Scales
Number of Tasks to Steal Conclusion: steal half of tasks a small number of static neighbors is not sufficient the more tasks (< half) to steal, the better performance Next Generation Job Management System for Extreme Scales
Number of Static Neighbors Conclusion: the optimal number of static neighbors is eighth of all nodes impossible at exascalewith 128K neighbors need dynamic neighbors to reduce the neighbor count Next Generation Job Management System for Extreme Scales
Number of Dynamic Neighbors Conclusion: square root number of dynamic neighbors is optimal reasonable at exascale (1K neighbors) Next Generation Job Management System for Extreme Scales
Outline • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales
Conclusions • Applications for exascale computing are becoming ensemble, and finer-grained • Task schedulers for exascale computing need to be distributed, scalable • SLURM++ should be integrated with MATRIX Next Generation Job Management System for Extreme Scales
Future Work • Re-implement MATRIX • Large Scale • Integration of SLURM++ and MATRIX • Workflow integration • Data-aware Scheduling • Distributed Map-Reduce framework Support Next Generation Job Management System for Extreme Scales
Acknowledgements • Xiaobing Zhou • Hao Chen • KiranRamamurthy • ImanSadooghi • Michael Lang • IoanRaicu Next Generation Job Management System for Extreme Scales
More Information • More information: • http://datasys.cs.iit.edu/~kewang/ • Contact: • kwang22@hawk.iit.edu • Questions? Next Generation Job Management System for Extreme Scales