Next Generation Job Management System for Extreme Scales

Next Generation Job Management System for Extreme Scales Speaker: Ke Wang Advisor: IoanRaicu DatasysLab, Illinois Institute of Technology November 25th, 2013 at Datasys Mini-Workshop, Fall 2013

Contents • SLURM++: A distributed job launch prototype for extreme-scale ensemble computing (IPDPS14 submission) • MATRIX:A distributed Many-Task Computing execution fabric designed for exascale (CCGRID14 submission) Next Generation Job Management System for Extreme Scales

Outline • Introduction & Motivation • SLURM++ • MATRIX • Conclusion & Future Work Next Generation Job Management System for Extreme Scales

Exascale Computing • Today (June, 2013): 34 Petaflop • O(100K) nodes • O(1M) cores • Near future (~2020): Exaflop Computing • ~1M nodes • ~1B processor-cores/threads Next Generation Job Management System for Extreme Scales

Job Management Systems for Exascale Computing • Ensemble Computing • Over-decomposition • Many-Task Computing • Jobs/Tasks are finer-grained • Requirements • high availability • extreme high throughput (1M tasks/sec) • low Latency Next Generation Job Management System for Extreme Scales

Current Job Management Systems • Batch scheduled HPC workloads • Lack the support of ensemble workloads • Centralized Design • Poor Scalability • Single-point-of-failure • SLURM maximum throughput of 500 jobs/sec • Decentralized design is demanded Next Generation Job Management System for Extreme Scales

Goal • Architect, and design job management systems for exascale ensemble computing • Identifies the challenges and solutions towards supporting job management systems at extreme scales • Evaluate and compare different design choices at large scale Next Generation Job Management System for Extreme Scales

Contributions • Proposed a distributed architecture for job management systems, and identified the challenges and solutions towards supporting job management system at extreme-scales • Designed and developed a novel distributed resource stealing algorithm for efficient HPC job launch • Designed and implemented a distributed job launch prototype SLURM++ for extreme scales by leveraging SLURM and ZHT • Evaluated SLURM and SLRUM++ up to 500-nodes with various micro-benchmarks of different job sizes with excellent results up to 10X higher throughput Next Generation Job Management System for Extreme Scales

Architecture Controllers are fully connected Ratio and Partition Size are configurable for HPC and MTC Data servers are also fully connected Next Generation Job Management System for Extreme Scales

Job and Resource Metadata Next Generation Job Management System for Extreme Scales

Resource Stealing • The procedure of stealing resources from other partitions • Why need to steal resources? • When to steal resources? • Where and how to steal resources? • What if no resources to be stolen? Next Generation Job Management System for Extreme Scales

Resource Contention • When different controllers try to allocate the same resources • Naive way to solve the problem is to add a global lock for each queried key in the DKVS • Atomic compare and swap operation in the DKVS that can tell the controllers whether the resource allocation succeeds Next Generation Job Management System for Extreme Scales

Blocking State Change Callback • A controller needs to wait on specific state change before moving on • Inefficient when client keeps polling from the server • The server has a blocking state change callback operation Next Generation Job Management System for Extreme Scales

SLURM++ Design and Implementation • SLURM description • Light-weight controller as ZHT client • Job launching as a separate thread • Implement the resource stealing algorithm • Developed in C • 3K lines of code + SLURM 50K lines of code + ZHT 8K lines of code Next Generation Job Management System for Extreme Scales

Evaluation • LANL Kodiak machine up to 500 nodes • Each node has two 64-bit AMD Opteron processors at 2.6GHz and 8GB memory • SLURM version 2.5.3 • Partition Size of 50 • Metrics: • Efficiency • number of ZHTmessage Next Generation Job Management System for Extreme Scales

Small-Job Workload Conclusions: SLURM remains almost constant, while SLURM++ has an increasing trend with respect to the scale MTC configuration is more preferable for MTC workload than HPC configuration Next Generation Job Management System for Extreme Scales

Medium-Job Workload Conclusion: SLURM experiences constant, while SLURM++ has linearly increasing trend with respect to scale Medium-job workload introduces a little bit resource stealing overhead Next Generation Job Management System for Extreme Scales

Big-Job Workload Conclusion: SLURM is about to saturate at large scales, while SLURM++ has an increasing trend with respect to scale The more partitions we have, the better chance that a controller can steal resources Next Generation Job Management System for Extreme Scales

Contributions • Design and implement a distributed MTC task execution fabric (MATRIX) that uses adaptive work stealing technique for distributed load balancing, and employs a DKVS to store task metadata. • Explore the parameter space of work stealing technique as applied to exascale class systems, through a light-weight job scheduling system simulator, SimMatrix, up to millions of nodes, billions of cores, and trillions of tasks. • Evaluate and compare MATRIX with various task schedulers( Falkon, Sparrow, and CloudKon) tested on an IBM Blue Gene/P machine and Amazon Cloud, using both micro-benchmarks and real workload traces. Next Generation Job Management System for Extreme Scales

Work Stealing • Distributed load balancing technique • Idle scheduler steals tasks from over-loaded one • Number of tasks to steal • Number of static neighbors • Number of dynamic neighbors • Poll interval Next Generation Job Management System for Extreme Scales

Work Stealing • Select Neighbors Next Generation Job Management System for Extreme Scales

Work Stealing • Dynamic Poll Interval Next Generation Job Management System for Extreme Scales

MATRIX Architecture Next Generation Job Management System for Extreme Scales

Task Submission • Worst Case • All tasks are submitted to one scheduler • Best Case • All tasks are evenly distributed to all schedulers Next Generation Job Management System for Extreme Scales

Execute Unit Next Generation Job Management System for Extreme Scales

Client Monitoring • Client doesn’t have to be alive • A monitoring program polls the task execution progressperiodically • Record logs about the system state for doing visualization Next Generation Job Management System for Extreme Scales

Evaluation • Focus on the Amazon Cloud environment • The “m1.medium” instance • 1 virtual cpu, 2 compute units, • 3.7GB memory, • 410GB hard disk. • Run MATRIX, Sparrow, and CloudKon up to 64 instances. • The number of executing threads is 2 for all of the systems. Next Generation Job Management System for Extreme Scales

Throughput Comparison Reason: MATRIX has near perfect load balancing when submitting tasks Sparrow needs to send probe messages to push tasks the clients of CloudKon need to push and pull tasks from SQS Next Generation Job Management System for Extreme Scales

Efficiency Comparison Conclusion: SQS, and DynamoDB are designed specifically for Cloud The probing and pushing method in Sparrow has poor scalability for heterogeneous workload The network communication layer of MATRIX needs to be improved significantly Next Generation Job Management System for Extreme Scales

Exploration of Work Stealing Parameters Space • Through SimMatrix up to millions of nodes, billions of cores, and trillions of tasks • Number of Tasks to Steal • Number of Static Neighbors • Number of Dynamic Neighbors Next Generation Job Management System for Extreme Scales

Number of Tasks to Steal Conclusion: steal half of tasks a small number of static neighbors is not sufficient the more tasks (< half) to steal, the better performance Next Generation Job Management System for Extreme Scales

Number of Static Neighbors Conclusion: the optimal number of static neighbors is eighth of all nodes impossible at exascalewith 128K neighbors need dynamic neighbors to reduce the neighbor count Next Generation Job Management System for Extreme Scales

Number of Dynamic Neighbors Conclusion: square root number of dynamic neighbors is optimal reasonable at exascale (1K neighbors) Next Generation Job Management System for Extreme Scales

Conclusions • Applications for exascale computing are becoming ensemble, and finer-grained • Task schedulers for exascale computing need to be distributed, scalable • SLURM++ should be integrated with MATRIX Next Generation Job Management System for Extreme Scales

Future Work • Re-implement MATRIX • Large Scale • Integration of SLURM++ and MATRIX • Workflow integration • Data-aware Scheduling • Distributed Map-Reduce framework Support Next Generation Job Management System for Extreme Scales

Acknowledgements • Xiaobing Zhou • Hao Chen • KiranRamamurthy • ImanSadooghi • Michael Lang • IoanRaicu Next Generation Job Management System for Extreme Scales

More Information • More information: • http://datasys.cs.iit.edu/~kewang/ • Contact: • kwang22@hawk.iit.edu • Questions? Next Generation Job Management System for Extreme Scales

Next Generation Job Management System for Extreme Scales

Next Generation Job Management System for Extreme Scales

Presentation Transcript

The Next Generation AMR System

Next Generation Management

ArchivesSpace A Next-Generation Archives Management System

Next Generation Receivables Process Management

Next Generation Records Management

Next Generation CAT System

Next Generation Traffic Management Centers

Next Generation Investment Risk Management

Next Generation Air Transportation System

Next Generation Customer Experience Management

Preparing the next generation of job creators

Next Generation Air Transportation System

A Vision for Next Generation System Monitoring

Idaho’s Next Generation Accountability System

Assessing The Next Generation Science Standards on Multiple Scales

OCLC’s Next Generation Metadata Management

Next Generation Information Management

Next Generation Master Data Management

Graph Based Next Generation Energy Management System

Next Generation Air Transportation System

Next Generation Warehouse Management System

Next Generation Fleet Management