190 likes | 205 Views
Learn about the Scalable Systems Software Center and its research goals, software architecture, participants, components, and project status. Explore the challenges, experiences, and lessons in developing systems software for scalable machines.
E N D
Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory Scalable Systems Software CenterAl Geist, coordinating P.I.
Outline The Scalable Systems Software Center • Goals • Participants • Scope • Structure • Issues • Status • Experiences • Lessons
Current State of Systems Software for Large-Scale Machines • Both proprietary and open-source systems • Machine-specific, PBS, LSF, POE, SLURM, COOAE (Collections Of Odds And Ends), … • Many are monolithic “resource management systems,” combining multiple functions • Job queuing, scheduling, process management, node monitoring, job monitoring, accounting, configuration management, etc. • A few established separate components exist • Maui scheduler • Qbank accounting system • Many home-grown, local pieces of software • Scalability often a weak point
The Scalable Systems Software SciDAC Project • Research goal: to develop a component-based architecture for systems software for scalable machines • Software goal: to demonstrate this architecture with prototype open-source components • One powerful effect: forcing rigorous (and aggressive) definition of what components should do and what should be encapsulated in other components • http://www.scidac.org//ScalableSystems
Participants • Labs • ORNL, ANL, LBNL, PNNL, Ames, SNL, LANL • Universities • NCSA, SDSC, PSC, Clemson • Vendors • Unlimited Scale, IBM, Cray, Intel • Open to anyone who wants to participate
Project Concept Meta Scheduler Meta Monitor Meta Manager Access Control Security Manager Meta Services Interacts with all components Node Configuration & Build Manager System Monitor Accounting Scheduler Resource Allocation Management Process Manager Queue Manager User DB Data Migration High Performance Communication & I/O Checkpoint / Restart Usage Reports User Utilities File System Testing & Validation Application Environment
Structure of Project • Working Groups • Node build, configuration, and global infrastructure • Job submission, queue management, scheduling, and accounting • Process management, system monitoring, and checkpointing • Validation and integration • Quarterly project meetings, weekly working group conference calls • Electronic notebooks for all working documents • www.scidac.org/ScalableSystems
SSS Project Issues • Put minimal constraints on component implementations • Ease merging of existing components into SSS framework • E.g., Maui scheduler • Ease development of new components • Encourage multiple implementations from vendors, others • Define minimal global structure • Components need to find one another • Need common communication method • Need common data format at some level • Each component will compose messages others will read and parse • Message-framing protocols
SSS Project Status – Global • Early decisions on inter-component communication • Lowest level communication is over sockets (at least) • Message content will be XML • Parsers available in all languages • Did not reach consensus on transport protocol (HTTP, SOAP, BEEP, assorted home grown), especially to cope with local security requirements, so multiple protocols are supported • Early implementation work on global issues • Service directory component defined and implemented • SSSlib library for inter-component communication • Handles interaction with service directory • Hides details of transport protocols from component logic • Anyone can add protocols to the library • Bindings for C, C++, Java, Perl, and Python • Event manager for asynchronous communication
SSS Project Status – Individual Component Prototypes • Precise XML interfaces not settled on yet, pending experiments with component prototypes • Both new and existing components • Maui scheduler is existing full-featured scheduler, SSS communication added • QBank accounting system has added SSS communication interface, evolving into Gold • New Checkpoint Manager component being integrated now • System-initiated checkpoints of LAM jobs
SSS Project Status – More Individual Component Prototypes • New Build-and-Configuration Manager completed • Controls how nodes are configured and built • New Node State Manager • Manages nodes as they are installed, reconfigured, added to active pool • New Event Manager for asynchronous communication among components • Components can register for notification of events supplied by other components (useful in monitoring, fault tolerance) • New Queue Manager mediates among user (job submitter), Job Scheduler, and Process Manager • Multiple monitoring components, both new and old • Data warehouse
SSS Project Status – Still More Individual Component Prototypes • New Process Manager component provides SSS interface to MPD scalable process manager • Speaks XML through SSSlib to other SSS components • Invokes MPD to implement SSS process management specification • MPD itself is not an SSS component • Allows MPD development, especially with respect to supporting MPI and MPI-2, to proceed independently • SSS Process Manager abstract definitions have influenced addition of MPD functionality beyond what is needed to implement mpiexec from MPI-2 standard • E.g. separate environment variables for separate processes
Schematic of Process Management Component in Scalable Systems Software Context NSM SD Sched EM MPD’s SSS Components QM PM PM SSS XML application processes mpdrun simple scripts or hairy GUIs using SSS XML QM’s job submission language XML file mpiexec (MPI Standard args) interactive Prototype MPD-based implementation side SSS side Other managers could go here instead
Other Accomplishments • APItest is a component test framework, capable of conducting unit tests on components • Well-suited to complicated network such as this one • Allows testing of one component at a time without testing all at once • Used on SSS components this year • SSS-OSCAR is a public, open-source release of the current state of the component system, tested for compatibility. (Get from web page) • Tested subset of components on 5000-node cluster at NCSA • SSS component architecture put into production on clusters at ORNL, PNNL, ANL (ANL story follows)
New Challenges on Chiba City • Medium-sized, middle-aged cluster at Argonne • Dedicated to computer science scalability research, not applications • Also used by friendly, hungry applications • New requirement: support research requiring specialized kernels and alternate operating systems, for OS scalability research • Want to schedule jobs that require node rebuilds (for new OS’s, kernel module tests, virtual nodes, etc.) as part of “normal” job scheduling • Requires major upgrade of Chiba City systems software
Chiba Commits to SSS • Fork in the road: • Major overhaul of old, crufty, Chiba systems software (open PBS + Maui scheduler + homegrown stuff), OR • Take leap forward and bet on all-new software architecture of SSS • Problems with leaping approach: • SSS interfaces not finalized • Some components don’t yet use library (implement own protocols in open code, not encapsulated in library) • Some components not fully functional yet • Solutions to problems: • Collect components that are adequately functional and integrated (PM, SD, EM, BCM) • Write “stubs” for other critical components (Sched, QM) • Do without some components (CKPT, monitors, accounting) until ready
Features of Adopted Solution • Stubs adequate, at least for time being • Scheduler does FIFO + reservations + backfill, improving • QM implements “PBS compatibility mode” (accepts user PBS scripts) as well as asking Process Manager to start parallel jobs directly • Process Manager wraps MPD, as described above • Single ring of daemons runs as root, managing all jobs for all users • Daemons’s started by Build-and-Config manager at boot time • An MPI program called MPISH (MPI Shell) wraps user jobs for handling file staging and multiple job steps • Python implementation of most components • Each component < 400 lines • Demonstrates feasibility of using SSS component approach to systems software • Running normal Chiba job mix for over six months now • Only systems software on this machine • Moving forward on meeting new requirements for research support
Lessons Learned: This Approach Really Works! • Components can use one another’s data • Functionality only needs to be implemented once • E.g., broadcast of messages • Components are more robust, since they focus on one task • Code volume shrinks because of less duplication of functionality • Easy to add new functionality • File staging • MPISH • Rich infrastructure on which to build new components • Communication, logging, location services • Need not be limited by existing subcomponents of existing systems • Can replace just the functionality needed (get to solve the problem you want to solve, without re-implementing everything). • E.g. having queue manager accept requests for rebuilt nodes before starting jobs.
Summary • The Scalable Systems Software SciDAC project is addressing the problem of systems software for terascale systems. • Component architecture for systems software • Definitions of standard interfaces between components • An infrastructure to support component implementations within this framework • A set of component implementations, continuing to improve • Prototype software suite released • Experimental production use of the component architecture and some of the component implementations • Encourages development of sharable tools and solutions • Scalability testing under way