210 likes | 349 Views
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting May 10-11, 2005 Argonne, IL. Resource Management and Accounting Working Group. Working group scope Progress since last face-to-face Future Work Other issues. Working Group Scope.
E N D
Scalable Systems Software CenterResource Management and Accounting Working GroupFace-to-Face MeetingMay 10-11, 2005Argonne, IL
Resource Management and Accounting Working Group • Working group scope • Progress since last face-to-face • Future Work • Other issues
Working Group Scope The Resource Management Working Group is involved in the areas of resource management, scheduling and accounting. This working group will focus on the following software components: • Queue Manager • Scheduler • Accounting and Allocation Manager • Meta Scheduler Other critical resource management components are being developed in the Process Management and Monitoring Working Group: • Process Manager • Cluster Monitor
Resource Management Component Architecture Grid Scheduler Infrastructure Services Allocation Manager Cluster Scheduler Discovery Service Queue Manager Node Monitor Event Manager Security System Process Manager Node Manager
Resource Management Prototype Demonstration This demo runs a simple end-to-end test with a job being submitted running past it’s wallclock limit 4 Create-Reservation Allocation Manager Cluster Scheduler 9 Withdraw-Allocation 2 Query-Job 7 Query-Job 8 Delete-Job 3 Query-Node 5 Run-Job Job Submission Client Queue Manager Node Monitor 1 Submit-Job 0 Service-Lookup 6 Exec-Process Discovery Service Process Manager
General Progress • New release of RMWG components made available from SSS web site • Bamboo Queue Manager v1.1 • Maui Scheduler v3.2.6p13 • Gold Accounting and Allocation Manager v2.b2.10.2
General Progress • Continued Adoption of SSS components and interfaces • SSS suite running on additional systems in Ames • Gold being used in production on University of Utah’s Icebox cluster
General Progress • Working on integration of SSSRMAP into ssslib • Bill Pitre -- implementing the SSSRMAP Message Format SDK (Python classes) • Craig Steffen -- integrating SSSRMAP Wire Level protocol into ssslib
General Progress • Paper accepted for presentation and publication at a conference • Title: Allocation Management Solutions for High Performance Computing • Conference: Parallel and Distributed Processing Techniques and Applications (PDPTA'05) • Workshop on “Scheduling and Resource Management for Parallel and Distributed Systems”
General Progress • New Documents in SSS RMWG Notebook • Considerations for using SOAP as the basis for SSSRMAP v4 • Fault Tolerance with Gold • Last Quarter’s Weekly RMWG Meeting Notes
Queue Manager Progress • V1.1 release of Bamboo made available • SSS suite running on several systems in Ames. • Support for Task Groups and Node Properties added to server. • Added a new mailing feature • New fountain component created to pull node information from multiple sources. • Simple node information now supported. • Working on adding support for SuperMon, Ganglia and NWPerf
Accounting and Allocation Manager Progress • New release of Gold available – 2nd Gold Beta v2.b2.10.2 • v2.b2.7.0 incorporated into OSCAR release • Gold being used in production on University of Utah’s Icebox cluster • Implemented and tested design for distributed accounting and multi-organizational negotiation in job launching • Implemented fault tolerance to 50% cluster loss by adding support for a backup gold server. • Clients can failover to a backup gold server if defined • The database can be made fault tolerant by utilizing a synchronous multi-master replication system such as pgcluster. • documented in RMWG notebook
Accounting and Allocation Manager Progress • Simplified ease of use for allocation management for basic configurations by adding ability to hide account abstraction layer • enabled account auto-generation, project-level deposits, etc. • Ported Gold to Tier3 and Tier4 OS’s • (OS-X, IRIX, HP-UX, Solaris) - unable to get access to Unicos • Enabled support for mysql database
Cluster Scheduler Progress • Migrated latest MCOM library into Maui • includes support for encryption, scalability enhancements, sss return codes, job description extensions, etc. • Enabled support for partitions, node features • Enhanced recovery modes for failures and unexpected conditions • Additional QOS modes for Allocation Manager • fallback QOS, QOS requested vs. delivered • Fixed additional packaging bugs, buffer overflows • Started work on multi-taskgroup jobs
Grid Scheduler Progress • Added support for multi-site authentication (per peer-service symmetric keys) • Rolling X.509 credential management into MCOM library • Enabled support for Globus 3.x (had to workaround a lot of Globus bugs) • Enhanced grid job queue and launch • Reliability - completed Globus failure diagnostics, logging and auto-recovery • Data Staging - completed Globus/non-Globus data staging failure auto-recovery • Fairness - implemented Priority, Fairshare, and Usage Limit based policy enforcement • Statistics - added credential, job, and cluster based usage statistics
Future Work • General release of all components • Including new Silver Meta-scheduler • Increase deployment base • Integrate SSSRMAP into ssslib • Portability testing for all components • Fault Tolerance supporting 25% cluster loss
Future Work Queue manager • Add job group support (mainly for submission) • Add Job Submission filter • Finish final missing portions of PBS style job language support.
Future Work Accounting and Allocation manager • General release to be made available by mid-year • Production deployment of Gold on additional sites • Port Gold GUI from JSP to Perl CGI • Add support for multi-site authentication (each site having its own symmetric key) • Documentation to include object customization
Future Work Cluster Scheduler • Add support for multi-taskgroup SSS jobs • Support SSS job extensions and job-level policies • Peer Diagnostics - add auto-recovery to failed service interfaces • Resource Utilization - complete development of all resource utilization objectives • Resource Limits - complete development of all resource limits objectives • Checkpoint Restart – test with LBNL and optimize resource management for suspended jobs • Get X.509 credential management working
Future Work Grid Scheduler • Release Silver meta-scheduler • targeting end of June for alpha release • need to test Maui/Silver interoperability with new MCOM lib • Need to test • Priority, Fairshare, and Usage Limit based policy enforcement • credential, job, and cluster based usage statistics • Optimization - add network co-allocation reservation • General - mature client commands to provide status reporting in more intuitive manner